Cron, Anacron, and Launchd for Data Pipeline Scheduling

How to automate ETL pipelines with command line tools and a remote PostgreSQL database

Daniel Liden
The Inner Join

--

Want to try it yourself? First, sign up for bit.io. Then clone the GitHub repository and get started!

In this article, we’re returning to automation to explore a few different options for scheduling ETL pipelines from the command line. In particular, we’ll show how to use cron, anacron, and launchd. These tools are primarily for Linux and MacOS. Similar functionality for Windows can be found in the task scheduler (sign up for our newsletter or check back on the Inner Join for an upcoming article on the task scheduler).

Running a pipeline once may result in quickly-outdated data and a challenging and time-consuming update process.

These command line tools provide some of the easiest and most versatile ways of scheduling ETL processes. They can be run from server environments or right from your laptop. However, it’s important to pick the right tool for the job and to understand how the different tools handle common situations such as missing a scheduled run. Scheduling ETL pipelines with these tools is a good way to ensure your data are up to date when you need them without investing a lot of time and effort in more complicated tools.

A carefully-constructed and automated ETL pipeline can ensure that your data are up to date and ready to use whenever you need them. Cron, anacron, and launchd are three great tools for automating ETL pipelines from the command line.

For more details on the structure and implementation of ETL pipelines themselves, check out our two-part series on making a simple data pipeline (part 1, part 2), our article on scheduling a notebook to run at regular intervals with Deepnote (one of the easiest ways to get started), and our guide on logging IoT data to a cloud Postgres database on bit.io using Python on a Raspberry Pi.

Table of Contents

Example ETL Pipeline

In this example, we’ll be using data from the New York Times Books API. That said, the methods described below can be applied to a wide range of different ETL pipeline scripts.

Our example pipeline uses data from the New York Times Books API. You can find details on setting up the pipeline yourself in the GitHub repository.

Our pipeline downloads the full list of bestseller lists (fiction, nonfiction, graphic novels, etc.) as well as the bestseller lists themselves for fiction and for nonfiction; applies some simple transformations (for example, we remove some of the e-commerce links but make sure to retain those for Bookshop and Indiebound — we love indie booksellers!); and loads these tables to a cloud PostgreSQL bit.io repository. You can check out the pipeline code on github.

The main components of our pipeline are:

  • extract.py: Contains a general method for obtaining JSON-formatted data with a GET request
  • transform.py: Contains simple transformations for two types of list from the New York Times Books API: (1) a list of all of the different bestseller lists; and (2) bestseller lists themselves.
  • load.py: Contains functions, largely sourced from here, for loading the transformed data to bit.io.
  • main.py: Brings the three files above together and contains logic for uploading the list of bestseller lists along with the combined print and ebook fiction and nonfiction lists to bit.io. The extract/transform/load modules are imported into main.py; main.py is all we need to automate.

The end result of the pipeline is a PostgreSQL database accessible via the in-browser SQL query editor, the bit.io API, and a whole host of integrations (SQL clients, Python, R, etc.).

The end result of our automated pipeline: our data are now in an online postgreSQL database on bit.io. The pipeline will keep this data repository up to date as new data are released each week.

Now, let’s get into the automation tools themselves, starting with cron.

What is cron?

cron is the classic command line scheduling utility, first released in 1975. cron can be used to schedule jobs to run at fixed dates/times (e.g. “September 15 at 12:45”) or intervals (“every three days”).

How to use cron

In order to schedule execution of a given script using cron, we first call crontab -e on the command line. This opens a temporary copy of the user’s crontab file, which stores the list of a user’s cron jobs. In that file, on a new line, we specify the minute, hour, day of month, month, and day of week, in that order, followed by the command to execute. Each of the five time fields must take some value; a value of * means that it will run at each corresponding time interval (e.g. every minute, every hour, etc.).

Once you add a job, save, and close the crontab file, the new cron job is installed and the cron utility executes the job according to the specified schedule (specific implementation details vary by system).

For example, * * * * * command will run command every minute. 0 * * * * command will run command during the first minute of every hour. 5 * 5 * 5 command will run command at minute 5 of the 5th day of each month and on Fridays. The schematic below shows what each of the * s corresponds to. You can copy this into your crontab for an easy reference.

Structure of a cron schedule

There are a few additional ways to specify times:

  • , can be used to list values. 30 1,5,9 * * * command will execute command at 1:30, 5:30, and 9:30.
  • - specifies a range of values. 0, 12, *, *, 1–5 command will execute command at noon daily from Monday through Friday (but not on weekends).
  • / indicates operations which are to be after the passage of each unit of time. */5 * * * * command will run command every five minutes; 0 12 2–10/2 * * command will run command at noon every second month from February through October.
  • As the prior example shows, these operations can be combined. For example, 0,2–10,28,30–59/5 16 * * * command will run command at 4:00PM, 4:02–4:10 PM, 4:28PM, and every five minutes from 4:30–4:59 PM.

To list the contents of your crontab (i.e. to list installed cron jobs), use crontab -l. crontab -r will delete the crontab.

Example

We’ll run our script weekly, shortly after 7 p.m. Eastern, so that our repository is updated shortly after the updated best-seller lists are published:

“Come holiday or hurricane, one thing you can count on is that The New York Times’s best-seller lists will be published online every Wednesday at 7 p.m. Eastern.” — NYT Best-Sellers List Staff

We can schedule this update with cron as follows.

  1. First, access the crontab file with crontab -e.
  2. Next, we set up a line in our crontab to (1) change to the correct project directory; (2) set the correct virtual environment; and (3) execute the main.py script:

15 16 * * 3 cd /path/to/project && source ./env/bin/activate && python ./src/main.py

For greater readability, we could also substitute WED for 3 and write:

15 16 * * WED cd /path/to/project && source ./env/bin/activate && python ./src/main.py

These two formulations are equivalent. Both mean that, every Wednesday at 4:15 PM (Pacific time, translates to 7:15 PM Eastern), cron will execute main.py, running the pipeline described above.

3. Lastly, we save the edited crontab to install the new cron job.

Caveats

  • cron jobs will not be run retroactively if the system is powered off at the scheduled run time. For example, if a cron job is set up on a laptop, and the laptop is off when the job is scheduled to run, the cron job will not run until the next scheduled time.
  • There are different cron implementations. The functionality and syntax described here is fairly basic and should work across systems (excluding Windows), though it’s always important to consult the system-specific documentation. Some implementations have extended functionality, such as the ability to schedule a script to run daily with @daily instead of the usual syntax.
  • cron jobs use the computer’s/server’s time zone. If running cron on a server, it is good practice to use UTC for the time zone.
  • By default, crontab -e edits the logged-in user’s crontab file. In general, you should avoid scheduling cron jobs with the system’s root account. Scheduling on the root account makes it much easier to break something with a typo, while scheduling on user-specific accounts provides clearer delineation of ownership and responsibility on multi-user systems.

cron Extensions and Resources

  • The site crontab.guru provides an interactive tool for specifying the time intervals of interest and for learning cron’s syntax, which can be challenging at first.
  • Testing cron jobs can be tricky. One very simple way to ensure a job will execute without errors is to first schedule the job a minute or two in the future.
  • cron provides some basic logging through email. “Email,” in this case, refers to the user’s local email account on the computer or server, not to an email provider like gmail. To access these emails, simply type mail at the command line and review the messages. This is a useful first debugging step for failed cron jobs.
  • more tips.

What is anacron?

Based on the name, you might guess that anacron is similar to cron. While both are command line scheduling systems, they have some important differences:

  • The minimum time increment in anacron is days (while cron jobs can be scheduled in terms of hours and minutes). anacron checks whether a job has been run in the last n days (where n is specified by the user) and runs the specified command if it has not.
  • anacron will execute missed jobs once it has the opportunity to do so (e.g. if the computer was powered off when the job was scheduled to run) while cron will not. This makes anacron more suitable for running on laptop or desktop computers because, unlike cron, it does not depend on the device being active exactly when the job is scheduled.
  • cron is a daemon; anacron is not. Thus, depending on the system, anacron must be invoked through other means (such as systemd or cron).
  • anacron is not available for MacOS.

How to Use anacron

Edit the /etc/anacrontab file with your preferred editor: for example, vi /etc/anacrontab. anacron jobs are specified with the following format:

#period in days   delay in minutes    job-identifier    command

For example, the following entry in the anacrontab file will print the date and time to a file each day, after a 15 minute delay after anacron reads the anacrontab:

#period in days    delay in minutes    job-identifier    command
1 15 daily_fifteen echo `date` >> /home/username/execution_time.txt

After an anacron job has been added to the anacrontab file, the anacron utility checks whether the job has been run in the last n days, where n is the “period in days” entry. If not, anacron runs the scheduled job after the specified delay in minutes.

Example

For our Books example, we need to remember the “every n days” format of anacron scheduling. The exact format will depend on which day it is when we register the command in anacron. Suppose it’s Wednesday. Recall, the book lists are updated weekly at 7:00 PM Eastern each Wednesday. So we’ll run anacron on a seven-day period with a 975 minute delay, thus ensuring it runs every Wednesday at approximately 4:15 PM(and will still run later if that time is missed).

#period in days    delay in minutes    job-identifier    command
7 975 nyt_books cd /path/to/project && source ./env/bin/activate && python ./src/main.py

Caveats

  • As with cron, implementations and usage details for anacron can vary by system.
  • anacron is not as widely used, documented, or supported as cron.
  • Unlike cron, anacron must be run as root (Though there are some workarounds for anyone wanting to run anacron under a specific user account.).
  • Fine timing control is more difficult under anacron than cron. The days an anacron job will run are determined by when the job is defined (e.g. a job set to run every other day starting Monday will run on different days than the same job starting Tuesday).

anacron Extensions and Resources

  • It’s possible to test the anacron job with anacron -ndf. The -n flag means to run the jobs with no delay; the -f flag says to force the execution of the jobs, regardless of when they were last run; and the -d flag ensures the job is not forked to the background, so information about execution will be printed to the terminal. If testing in this way, it’s important to ensure that running the scheduled script at the wrong time won’t have undesirable properties.
  • The anacrontab file specifies a number of additional variables defining how and when anacron should run. For example, the START_HOURS_RANGE variable defines when jobs can be run (e.g. 9–17 would specify that a job should only run during the 9–5 workday). The RANDOM_DELAY variable sets a random delay in minutes before the execution of each job. A value of 10 would add between 0 and 10 minutes to the delay in minutes specified for each job in the crontab. These variables are typically placed at the top of the anacrontab in the format VARIABLE=value. The specifics for a given system can be found in the anacrontab man pages (type man anacrontab in the terminal).

What is launchd?

launchd is the “canonical way to launch a daemon” on MacOS (see man launchd on MacOS). In fact, on MacOS, the cron utility is itself launched by launchd.

launchd is an extremely powerful application that can be used in a wide variety of ways. Below, we’ll cover just one of many possible ways to use launchd to schedule a data pipeline.

How to Use launchd

Each scheduled launchd job gets its own XML configuration file, called a “property list” or plist file. The specific configurations available in an XML plist file can be found with man launchd.plist. We will focus on those most relevant to setting up a repeating data pipeline.

  • Label uniquely identifies the job.
  • WorkingDirectory specifies the directory from which the job should run.
  • StartCalendarInterval defines the interval on which the job will run. The keys are Minute, Hour, Day, Weekday, and Month. Keys left undefined behave the same as * time values in cron.
  • StartInterval takes an integer N and specifies that the job should run every N seconds.
  • Program define a path to an executable
  • ProgramArguments specifies a vector of arguments to be passed to the job. This must be defined if Program is not. Without Program, this can be used to directly call utilities such as touch and cp, though the syntax around, for example, calling multiple commands or using special characters for pipes and redirection can be tricky. We’ll focus on the Program approach here.

To schedule a job, we first must create a .plist file in the correct location. We will use the /Library/LaunchAgents directory, which is for user-defined agents. More details on the available directories in which launchd plists can be saved are available in the official documentation.

A simple launchd plist to call on a bash script every 5 minutes might look like this:

A simple launchd plist example

We’ve used the StartInterval key to specify that the job should run every 300 seconds (five minutes) and the Program key to point to the bash script we want the job to execute.

Example

To schedule the New York Times Books ETL Pipeline, we’ll first write a short bash script to set up our python environment and run main.py. Note that we do not have to change the working directory in the script: we specify this option in our launchd plist.

We save this script as myscript.sh and make it executable with chmod +x myscript.sh.

Then we set up our launchd plist as described above. As with cron, we use 15, 16, and 3 as the values for Minute, Hour, and Weekday, respectively. This will ensure the script runs daily at 4:15 PM Pacific time.

We’ve set the working directory using the WorkingDirectory key; provided the path to the script with the Program key; and defined a dictionary with keys for Minute, Hour, and Weekday to specify the day and time the script should run using the StartCalendarInterval key.

Caveats

  • Though launchd offers a great deal of flexibility and many more configuration options than cron or anacron, it is also considerably more complex. A single-line cron job may take a dozen or more lines to represent in a launchd plist.
  • It is more difficult to find relevant documentation and tutorials for specific launchd applications than for specific cron applications.
  • launchd jobs will not execute if the computer is powered off at the scheduled time. If the device is asleep at the scheduled time, however, jobs scheduled with StartCalendarInterval will execute when the computer wakes.

launchd Extensions and Resources

  • This blog post provides an excellent overview of the essential job-scheduling functionality of launchd.
  • To run a scheduled launchd job immediately (regardless of schedule), call launchctl start <job.label>, where <job.label> is the Label key defined in the launchd plist. This is a good way to make sure the job will run correctly (though it does not guarantee that the schedule is set up correctly).
  • A launchd job can be configured to run based on a number of different triggers, not just time.
  • To confirm that a job is scheduled, you can use launchctl list | grep <job.label>, where <job.label> is the Label key defined in the launchd plist.

Which one should I use?

Each of these three options has benefits and tradeoffs. Here are a few suggestions for choosing the option that works best for you:

  • If you’re on Windows, use the task scheduler (which we will cover in a later post), or set up cron or anacron jobs in the Windows Subsystem for Linux (WSL).
  • If you’re on MacOS and you need the maximum amount of customization, use launchd. launchd provides the greatest amount of control over many automation features such as setting root and working directories, setting environment variables, managing system resources, prioritization, debugging, and logging. Furthermore, its “canonical” status in the MacOS ecosystem suggests it will be supported for some time to come.
  • If your job is running from a server or if you need fine-grained control over the time intervals, use cron. With cron, you can specify the exact minute of the day when you want a job to execute. However, if your computer is suspended/off when the cron job is scheduled, it will not run. cron is, therefore, particularly suitable for use in an always-on server environment.
  • Use anacron if you are on Linux and want to be sure your jobs will run even if your computer is occasionally off. anacron jobs will run if more than the specified number of days have elapsed since the last run, regardless of the specific scheduled time. This makes anacron suitable for use on personal computers that may be off at different times of day. However, running scripts at specific times is easier with cron or launchd than with anacron.

Your Turn

Want to try some of these approaches yourself? Clone the New York Times Book ETL Pipeline GitHub Repo if you need an example to work with. The README covers where/how to get your NYT API keys; sign up for bit.io; set up environment variables; and execute the ETL Pipeline. Here are some ideas to get you started:

  • Schedule the extraction of more of the lists from the New York Times API. The current example only gets the combined print and e-book fiction and nonfiction lists, but there are others, such as picture books, business books, and graphic novels.
  • Update the pipeline to keep a historical record of the lists by first loading all of the old lists into a bit.io repository and then appending each new list according to a schedule (instead of overwriting the list each week).
  • Extend the pipeline: use the data from bit.io to populate a dashboard with visualizations of the data. You might consider publishing a Deepnote notebook with a dashboard layout or hosting some Python- or R-generated visualizations on Datapane.
  • Improve the automation of the ETL pipeline. It runs on a schedule now, but what if something goes wrong? How can we diagnose or fix it? How can we avoid accidentally overwriting the existing table with bad data? Consider building some basic testing or logging capabilities. We’ll be writing on these topics in a future post, so make sure to sign up for our newsletter!

Keep Reading

We’ve written a whole series on ETL pipelines! Check them out here:

Core Concepts and Key Skills

Focus on Automation

ETL In Action

--

--