Making a Simple Data Pipeline Part 2: Automating ETL

Schedule Python and SQL scripts to keep your dataset clean and up-to-date in a Postgres database

Andrew Doss
The Inner Join

--

Photo by JESHOOTS.COM on Unsplash

Want to try it yourself? First, sign up for bit.io to get instant access to a free Postgres database. Then clone the GitHub repo and give it a try!

Where we left off

In Making a Simple Data Pipeline Part 1: The ETL Pattern, we explained that the Extract, Transform, Load (ETL) process is a general computing pattern for getting data in the right location and format for use.

A scheduled ETL process helps us get prepared data into a common database where we can further join, transform, and access the data for particular use cases such as analytics and ML.

We previously walked through each of the extract, transform, and load steps using simple Python scripts. Now it’s time to put the pieces together to complete an automated pipeline.

We will walk through key code snippets together, and the full implementation and documentation is available in this repo. You can see the end product in our public Postgres database.

Putting the pieces together

We need a way to put the ETL steps together. For this simple implementation, we will define an additional Python main.py script that executes an ETL process using command-line arguments for the (optional) transformation function name, data source, and database destination:

python main.py -name <TRANSFORM_FUNCTION_NAME> \
'<DATA_SOURCE_URL>' \
'<USERNAME/REPO_NAME.DESTINATION_TABLE_NAME>'

For example, we can run the ETL process for the NYT cases and deaths data as follows:

python main.py -name nyt_case_counties \ 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv' \
'demo_user_etl/simple_pipeline.case_counties'

To ingest a local CSV file instead, simply add the option -local_source and replace the URL with a file path.

Next, let’s review how this command-line script is implemented. We start at the bottom of main.py. Lines 5–15 parse and validate the provided command-line arguments. Then, the script calls the main function with the parsed arguments.

We can now continue with the top portion of main.py in the gist below.

Lines 16–17 read the Postgres connection string from your environment or a .env file (see the main repo for more help, if needed) and assign the string to the outer variable PG_CONN_STRING .

The main function is defined in lines 20–48. This function calls the appropriate extract, transform, and load functions from the previously shown modules based on the provided command line options and arguments.

In exchange for this slightly more complex parameterized command-line script, we’ve gained the ability to drive ETL processes (with their own source, transformation function, and destination table) from the command line for each of our three data sources using a single Python script. This interface sets us up for our next step — scheduling the ETL processes for the two sources with daily updates.

On schedule

We already have what we need for on-demand ETL processes for our three data sources, but manually running the daily updates is both tedious and error-prone. This is not an article on scheduling best-practices, but we will cover a very simple scheduling implementation to illustrate the concept and get you started.

Unix-like operating systems can run a utility called cron that automatically runs jobs on a schedule. We can define a shell script that calls our command-line Python ETL script for each data source:

Then we can instruct cron to execute the shell script at the same time each day. We can also run the Python script directly from cron, but wrapping all of our terminal commands in a shell script makes the cron “job definition” cleaner.

You may need to install cron with the appropriate package manager for your system. With cron installed, you can then edit a crontab file to schedule this script to run daily. Here is the line we added to our crontab file:

45 09 * * * cd ~/Documents/simple_pipeline && ./scheduled_run.sh

The 45 09 * * * portion instructs cron to run the job every day at 9:45 AM local time (during our daily standup), and the cd ~/Documents/simple_pipeline && ./scheduled_run.sh portion specifies the job itself.

Here is a summary of the cron schedule syntax from Wikipedia:

# ┌───────────── minute (0 - 59)
# │ ┌───────────── hour (0 - 23)
# │ │ ┌───────────── day of the month (1 - 31)
# │ │ │ ┌───────────── month (1 - 12)
# │ │ │ │ ┌───────────── day of the week (0 - 6) (Sun. to Sat.)
# │ │ │ │ │
# │ │ │ │ │
# │ │ │ │ │
# * * * * * <command to execute>

As-is, this scheduling implementation lacks notifications and logging (although cron supports both) — but it is a quick way to get started with scheduled ETL scripts.

This is not primarily a tutorial on shell scripting or cron, so please let us know in the comments if you need help or refer to the full repo for additional documentation.

That’s it! We’ve now implemented and scheduled our ETL process. From here, we have one bonus step to discuss — post-load transformation.

Extract, Transform, Load, Transform…

ETL is a useful general pattern for data ingestion. However, in a modern data environment, some or all of the transformation step of data preparation may occur after loading the data sources into a common data warehouse. There are multiple reasons for this:

  1. The transformations may be defined by downstream analysts who can iterate without being coupled to the upstream pipeline maintainers.
  2. The data warehouse may be a more efficient or capable compute environment than the ingestion pipeline.
  3. The same data sources might be needed for several different final representations.

We will demonstrate a post-load transformation by scheduling an in-database SQL script ca_covid_data.sql to update a California COVID cases and vaccinations table each time the data sources are reloaded:

Lines 1–14 create a California COVID data table, if it does not already exist. Line 16 clears any existing data from the table, and lines 18–38 query the three loaded datasets and insert the results into the California COVID data table.

We can run this SQL script using a second simple Python script sql_executor.py that takes one command-line parameter — the path to the script ca_covid_data.sql — and connects to bit.io for in-database execution:

python sql_executor.py ca_covid_data.sql

Finally, we can add one more line to our scheduled shell script to re-run the in-database table update query after each run of the ETL jobs.

The flexibility and convenience provided by post-load transformations is one of many great reasons to work with your data in an actual database like Postgres on bit.io.

The California COVID data table is represented by the blue node above. However, a working data warehouse will often have a whole network of nodes derived through in-database transformations.

A minimum viable pipeline

If you’ve made it this far, you’ve completed our walkthrough and are now ready to start building your own simple, scheduled data pipelines using Python and Postgres.

Data pipelining is a deep topic, and we’ve only covered just enough to get started. If you’d like to keep going and make a more maintainable pipeline, check out Making a Simple Data Pipeline Part 3: Testing ETL where we introduce both code and data tests. We also have a final section, Part 4, on automating pipeline runs and testing with GitHub actions coming soon.

If you haven’t yet, we invite you to sign up for bit.io to get instant access to a free, private Postgres database, clone the Github repo, and give it a try on a dataset that you need!

Interested in future Inner Join publications and related bit.io data content? Please consider subscribing to our weekly newsletter.

Appendix

Series overview

This article is part of a four-part series on making a simple, yet effective, ETL pipeline. We minimize the use of ETL tools and frameworks to keep the implementation simple and the focus on fundamental concepts. Each part introduces a new concept along the way to building the full pipeline located in this repo.

  1. Part 1: The ETL Pattern
  2. Part 2: Automating ETL
  3. Part 3: Testing ETL
  4. Part 4: CI/CI with GitHub Actions

Additional considerations

This series aims to illustrate the ETL pattern with a simple, usable implementation. To maintain that focus, some details have been left to this appendix.

  • Best practices — this series glosses over some important practices for making robust production pipelines: staging tables, incremental loads, containerization/dependency management, event messaging/alerting, error handling, parallel processing, configuration files, data modeling, and more. There are great resources available for learning to add these best practices to your pipelines.
  • ETL vs. ELT vs. ETLT — the ETL pattern can have a connotation of one bespoke ETL process loading an exact table for each end use case. In a modern data environment, a lot of transformation work happens post-load inside a data warehouse. This leads to the term “ELT” or the unwieldy “ETLT”. Put simply, you may want to keep pre-load transformations light (if at all) to enable iteration on transformations within the data warehouse.

Keep Reading

We’ve written a whole series on ETL pipelines! Check them out here:

Core Concepts and Key Skills

Focus on Automation

ETL In Action

--

--