Jobs-API
The Jobs API is a tool for logging the outcome of ETL scripting jobs. It is designed to support the task of "incremental" data loads—i.e., loading only new or updated records from a source database to a destination database.
The typical ETL workflow using the Job API is:
At the beginning of your ETL job, query the Job API to retrieve the timestamp of the last time your job ran successfully.
Use this timestamp as a temporal filter to extract new/modifed records from the source database.
At the completion of your ETL job, register the successful outcome with the Job API.
Querying the API
The test instance is available at http://transportation-data-test.austintexas.io/jobs, and it can be queried using any PostGREST route, as described in the PostgREST documentation.
Field Definitions:
Fieldname
Field Type
Description
id
SERIAL PRIMARY KEY
System-generated incremental ID
name
TEXT NOT NULL
Name descriptor of the job which uniquely identifies the ETL job
start_date
TIMESTAMP WITH TIME ZONE NOT NULL
The job's start datetime
end_date
TIMESTAMP WITH TIME ZONE
The job's end datetime
message
TEXT
Optional message
status
TEXT NOT NULL
The status of the ETL job. Use 'in_progress', 'success', or 'error'
records_processed
INTEGER
The number of records processed by the job.
source
TEXT NOT NULL
The name source database of the ETL job.
destination
TEXT NOT NULL
The name of the destination database of the ETL job
Python Example
We've written a Python utility to interact with the jobs interface. It's included in our transportation-data-utils (tdutils) pacakge. tdutils
requires Python 3 and can be installed with $ pip install tdutils
.
Begin by importing the jobutil
library and creating a new job instance:
Assuming you've run this job successfully in the past, you can use the most_recent
method to return a unix timestamp from the last successful job run:
Register your new job with job.start()
. The API returns a JSON representation of newly created job record:
We have successfuly registered a new job. Note that the status was automatically set to in_progress
. If we were to call job.most_recent()
again, the results are the same as before, because the status of current job is not success
:
From here, continue your ETL process as needed. When your ETL is complete, use job.result()
to update your job status accordingly. Again, a JSON representation of the job record is returned:
Note that the end_date was populated automatically. Because the job was successful, calling job.most_recent()
now returns the timestamp of our current job:
You should also register job failures. The errror
result is availabe for such purposes:
Last updated