Cron, Anacron, and Launchd for Data Pipeline Scheduling
How to automate ETL pipelines with command line tools and a remote PostgreSQL database
Want to try it yourself? First, sign up for bit.io. Then clone the GitHub repository and get started!
In this article, we’re returning to automation to explore a few different options for scheduling ETL pipelines from the command line. In particular, we’ll show how to use cron
, anacron
, and launchd
. These tools are primarily for Linux and MacOS. Similar functionality for Windows can be found in the task scheduler (sign up for our newsletter or check back on the Inner Join for an upcoming article on the task scheduler).
These command line tools provide some of the easiest and most versatile ways of scheduling ETL processes. They can be run from server environments or right from your laptop. However, it’s important to pick the right tool for the job and to understand how the different tools handle common situations such as missing a scheduled run. Scheduling ETL pipelines with these tools is a good way to ensure your data are up to date when you need them without investing a lot of time and effort in more complicated tools.
For more details on the structure and implementation of ETL pipelines themselves, check out our two-part series on making a simple data pipeline (part 1, part 2), our article on scheduling a notebook to run at regular intervals with Deepnote (one of the easiest ways to get started), and our guide on logging IoT data to a cloud Postgres database on bit.io using Python on a Raspberry Pi.
Table of Contents
Example ETL Pipeline
In this example, we’ll be using data from the New York Times Books API. That said, the methods described below can be applied to a wide range of different ETL pipeline scripts.


Our pipeline downloads the full list of bestseller lists (fiction, nonfiction, graphic novels, etc.) as well as the bestseller lists themselves for fiction and for nonfiction; applies some simple transformations (for example, we remove some of the e-commerce links but make sure to retain those for Bookshop and Indiebound — we love indie booksellers!); and loads these tables to a cloud PostgreSQL bit.io repository. You can check out the pipeline code on github.
The main components of our pipeline are:
extract.py
: Contains a general method for obtaining JSON-formatted data with a GET requesttransform.py
: Contains simple transformations for two types of list from the New York Times Books API: (1) a list of all of the different bestseller lists; and (2) bestseller lists themselves.load.py
: Contains functions, largely sourced from here, for loading the transformed data to bit.io.main.py
: Brings the three files above together and contains logic for uploading the list of bestseller lists along with the combined print and ebook fiction and nonfiction lists to bit.io. The extract/transform/load modules are imported intomain.py
;main.py
is all we need to automate.
The end result of the pipeline is a PostgreSQL database accessible via the in-browser SQL query editor, the bit.io API, and a whole host of integrations (SQL clients, Python, R, etc.).

Now, let’s get into the automation tools themselves, starting with cron
.
What is cron
?
cron
is the classic command line scheduling utility, first released in 1975. cron
can be used to schedule jobs to run at fixed dates/times (e.g. “September 15 at 12:45”) or intervals (“every three days”).
How to use cron
In order to schedule execution of a given script using cron
, we first call crontab -e
on the command line. This opens a temporary copy of the user’s crontab file, which stores the list of a user’s cron
jobs. In that file, on a new line, we specify the minute, hour, day of month, month, and day of week, in that order, followed by the command to execute. Each of the five time fields must take some value; a value of *
means that it will run at each corresponding time interval (e.g. every minute, every hour, etc.).
Once you add a job, save, and close the crontab file, the new cron
job is installed and the cron utility executes the job according to the specified schedule (specific implementation details vary by system).
For example, * * * * * command
will run command
every minute. 0 * * * * command
will run command
during the first minute of every hour. 5 * 5 * 5 command
will run command
at minute 5 of the 5th day of each month and on Fridays. The schematic below shows what each of the *
s corresponds to. You can copy this into your crontab for an easy reference.
cron
scheduleThere are a few additional ways to specify times:
,
can be used to list values.30 1,5,9 * * * command
will execute command at 1:30, 5:30, and 9:30.-
specifies a range of values.0, 12, *, *, 1–5 command
will executecommand
at noon daily from Monday through Friday (but not on weekends)./
indicates operations which are to be after the passage of each unit of time.*/5 * * * * command
will runcommand
every five minutes;0 12 2–10/2 * *
command will runcommand
at noon every second month from February through October.- As the prior example shows, these operations can be combined. For example,
0,2–10,28,30–59/5 16 * * * command
will runcommand
at 4:00PM, 4:02–4:10 PM, 4:28PM, and every five minutes from 4:30–4:59 PM.
To list the contents of your crontab (i.e. to list installed cron
jobs), use crontab -l
. crontab -r
will delete the crontab.
Example
We’ll run our script weekly, shortly after 7 p.m. Eastern, so that our repository is updated shortly after the updated best-seller lists are published:
“Come holiday or hurricane, one thing you can count on is that The New York Times’s best-seller lists will be published online every Wednesday at 7 p.m. Eastern.” — NYT Best-Sellers List Staff
We can schedule this update with cron
as follows.
- First, access the crontab file with
crontab -e
. - Next, we set up a line in our crontab to (1) change to the correct project directory; (2) set the correct virtual environment; and (3) execute the
main.py
script:
15 16 * * 3 cd /path/to/project && source ./env/bin/activate && python ./src/main.py
For greater readability, we could also substitute WED
for 3
and write:
15 16 * * WED cd /path/to/project && source ./env/bin/activate && python ./src/main.py
These two formulations are equivalent. Both mean that, every Wednesday at 4:15 PM (Pacific time, translates to 7:15 PM Eastern), cron
will execute main.py
, running the pipeline described above.
3. Lastly, we save the edited crontab to install the new cron
job.
Caveats
cron
jobs will not be run retroactively if the system is powered off at the scheduled run time. For example, if acron
job is set up on a laptop, and the laptop is off when the job is scheduled to run, thecron
job will not run until the next scheduled time.- There are different
cron
implementations. The functionality and syntax described here is fairly basic and should work across systems (excluding Windows), though it’s always important to consult the system-specific documentation. Some implementations have extended functionality, such as the ability to schedule a script to run daily with@daily
instead of the usual syntax. cron
jobs use the computer’s/server’s time zone. If running cron on a server, it is good practice to use UTC for the time zone.- By default,
crontab -e
edits the logged-in user’s crontab file. In general, you should avoid schedulingcron
jobs with the system’s root account. Scheduling on the root account makes it much easier to break something with a typo, while scheduling on user-specific accounts provides clearer delineation of ownership and responsibility on multi-user systems.
cron
Extensions and Resources
- The site crontab.guru provides an interactive tool for specifying the time intervals of interest and for learning
cron
’s syntax, which can be challenging at first. - Testing
cron
jobs can be tricky. One very simple way to ensure a job will execute without errors is to first schedule the job a minute or two in the future. cron
provides some basic logging through email. “Email,” in this case, refers to the user’s local email account on the computer or server, not to an email provider like gmail. To access these emails, simply typemail
at the command line and review the messages. This is a useful first debugging step for failedcron
jobs.- more tips.
What is anacron
?
Based on the name, you might guess that anacron
is similar to cron
. While both are command line scheduling systems, they have some important differences:
- The minimum time increment in
anacron
is days (whilecron
jobs can be scheduled in terms of hours and minutes).anacron
checks whether a job has been run in the lastn
days (wheren
is specified by the user) and runs the specified command if it has not. anacron
will execute missed jobs once it has the opportunity to do so (e.g. if the computer was powered off when the job was scheduled to run) whilecron
will not. This makesanacron
more suitable for running on laptop or desktop computers because, unlikecron
, it does not depend on the device being active exactly when the job is scheduled.cron
is a daemon;anacron
is not. Thus, depending on the system,anacron
must be invoked through other means (such assystemd
orcron
).anacron
is not available for MacOS.
How to Use anacron
Edit the /etc/anacrontab
file with your preferred editor: for example, vi /etc/anacrontab
. anacron
jobs are specified with the following format:
#period in days delay in minutes job-identifier command
For example, the following entry in the anacrontab file will print the date and time to a file each day, after a 15 minute delay after anacron
reads the anacrontab:
#period in days delay in minutes job-identifier command
1 15 daily_fifteen echo `date` >> /home/username/execution_time.txt
After an anacron
job has been added to the anacrontab file, the anacron
utility checks whether the job has been run in the last n
days, where n
is the “period in days” entry. If not, anacron
runs the scheduled job after the specified delay in minutes
.
Example
For our Books example, we need to remember the “every n
days” format of anacron
scheduling. The exact format will depend on which day it is when we register the command in anacron
. Suppose it’s Wednesday. Recall, the book lists are updated weekly at 7:00 PM Eastern each Wednesday. So we’ll run anacron
on a seven-day period with a 975 minute delay, thus ensuring it runs every Wednesday at approximately 4:15 PM(and will still run later if that time is missed).
#period in days delay in minutes job-identifier command
7 975 nyt_books cd /path/to/project && source ./env/bin/activate && python ./src/main.py
Caveats
- As with
cron
, implementations and usage details foranacron
can vary by system. anacron
is not as widely used, documented, or supported ascron
.- Unlike
cron
,anacron
must be run as root (Though there are some workarounds for anyone wanting to runanacron
under a specific user account.). - Fine timing control is more difficult under
anacron
thancron
. The days ananacron
job will run are determined by when the job is defined (e.g. a job set to run every other day starting Monday will run on different days than the same job starting Tuesday).
anacron
Extensions and Resources
- It’s possible to test the
anacron
job withanacron -ndf
. The-n
flag means to run the jobs with no delay; the-f
flag says to force the execution of the jobs, regardless of when they were last run; and the-d
flag ensures the job is not forked to the background, so information about execution will be printed to the terminal. If testing in this way, it’s important to ensure that running the scheduled script at the wrong time won’t have undesirable properties. - The anacrontab file specifies a number of additional variables defining how and when
anacron
should run. For example, theSTART_HOURS_RANGE
variable defines when jobs can be run (e.g. 9–17 would specify that a job should only run during the 9–5 workday). TheRANDOM_DELAY
variable sets a random delay in minutes before the execution of each job. A value of 10 would add between 0 and 10 minutes to the delay in minutes specified for each job in the crontab. These variables are typically placed at the top of the anacrontab in the formatVARIABLE=value
. The specifics for a given system can be found in the anacrontab man pages (typeman anacrontab
in the terminal).
What is launchd
?
launchd
is the “canonical way to launch a daemon” on MacOS (see man launchd
on MacOS). In fact, on MacOS, the cron
utility is itself launched by launchd
.
launchd
is an extremely powerful application that can be used in a wide variety of ways. Below, we’ll cover just one of many possible ways to use launchd
to schedule a data pipeline.
How to Use launchd
Each scheduled launchd
job gets its own XML configuration file, called a “property list” or plist
file. The specific configurations available in an XML plist
file can be found with man launchd.plist
. We will focus on those most relevant to setting up a repeating data pipeline.
Label
uniquely identifies the job.WorkingDirectory
specifies the directory from which the job should run.StartCalendarInterval
defines the interval on which the job will run. The keys are Minute, Hour, Day, Weekday, and Month. Keys left undefined behave the same as*
time values in cron.StartInterval
takes an integerN
and specifies that the job should run everyN
seconds.Program
define a path to an executableProgramArguments
specifies a vector of arguments to be passed to the job. This must be defined if Program is not. WithoutProgram
, this can be used to directly call utilities such astouch
andcp
, though the syntax around, for example, calling multiple commands or using special characters for pipes and redirection can be tricky. We’ll focus on theProgram
approach here.
To schedule a job, we first must create a .plist
file in the correct location. We will use the /Library/LaunchAgents
directory, which is for user-defined agents. More details on the available directories in which launchd plists can be saved are available in the official documentation.
A simple launchd plist
to call on a bash script every 5 minutes might look like this:
launchd
plist
exampleWe’ve used the StartInterval
key to specify that the job should run every 300 seconds (five minutes) and the Program
key to point to the bash script we want the job to execute.
Example
To schedule the New York Times Books ETL Pipeline, we’ll first write a short bash script to set up our python environment and run main.py
. Note that we do not have to change the working directory in the script: we specify this option in our launchd
plist
.
We save this script as myscript.sh
and make it executable with chmod +x myscript.sh
.
Then we set up our launchd
plist
as described above. As with cron
, we use 15, 16, and 3 as the values for Minute, Hour, and Weekday, respectively. This will ensure the script runs daily at 4:15 PM Pacific time.
We’ve set the working directory using the WorkingDirectory
key; provided the path to the script with the Program
key; and defined a dictionary with keys for Minute
, Hour
, and Weekday
to specify the day and time the script should run using the StartCalendarInterval
key.
Caveats
- Though
launchd
offers a great deal of flexibility and many more configuration options thancron
oranacron
, it is also considerably more complex. A single-linecron
job may take a dozen or more lines to represent in alaunchd
plist. - It is more difficult to find relevant documentation and tutorials for specific
launchd
applications than for specificcron
applications. launchd
jobs will not execute if the computer is powered off at the scheduled time. If the device is asleep at the scheduled time, however, jobs scheduled withStartCalendarInterval
will execute when the computer wakes.
launchd
Extensions and Resources
- This blog post provides an excellent overview of the essential job-scheduling functionality of
launchd
. - To run a scheduled
launchd
job immediately (regardless of schedule), calllaunchctl start <job.label>
, where<job.label>
is theLabel
key defined in thelaunchd
plist
. This is a good way to make sure the job will run correctly (though it does not guarantee that the schedule is set up correctly). - A
launchd
job can be configured to run based on a number of different triggers, not just time. - To confirm that a job is scheduled, you can use
launchctl list | grep <job.label>
, where<job.label>
is theLabel
key defined in thelaunchd
plist
.
Which one should I use?
Each of these three options has benefits and tradeoffs. Here are a few suggestions for choosing the option that works best for you:
- If you’re on Windows, use the task scheduler (which we will cover in a later post), or set up
cron
oranacron
jobs in the Windows Subsystem for Linux (WSL). - If you’re on MacOS and you need the maximum amount of customization, use
launchd
.launchd
provides the greatest amount of control over many automation features such as setting root and working directories, setting environment variables, managing system resources, prioritization, debugging, and logging. Furthermore, its “canonical” status in the MacOS ecosystem suggests it will be supported for some time to come. - If your job is running from a server or if you need fine-grained control over the time intervals, use
cron
. Withcron
, you can specify the exact minute of the day when you want a job to execute. However, if your computer is suspended/off when the cron job is scheduled, it will not run.cron
is, therefore, particularly suitable for use in an always-on server environment. - Use
anacron
if you are on Linux and want to be sure your jobs will run even if your computer is occasionally off.anacron
jobs will run if more than the specified number of days have elapsed since the last run, regardless of the specific scheduled time. This makesanacron
suitable for use on personal computers that may be off at different times of day. However, running scripts at specific times is easier withcron
orlaunchd
than withanacron
.
Your Turn
Want to try some of these approaches yourself? Clone the New York Times Book ETL Pipeline GitHub Repo if you need an example to work with. The README covers where/how to get your NYT API keys; sign up for bit.io; set up environment variables; and execute the ETL Pipeline. Here are some ideas to get you started:
- Schedule the extraction of more of the lists from the New York Times API. The current example only gets the combined print and e-book fiction and nonfiction lists, but there are others, such as picture books, business books, and graphic novels.
- Update the pipeline to keep a historical record of the lists by first loading all of the old lists into a bit.io repository and then appending each new list according to a schedule (instead of overwriting the list each week).
- Extend the pipeline: use the data from bit.io to populate a dashboard with visualizations of the data. You might consider publishing a Deepnote notebook with a dashboard layout or hosting some Python- or R-generated visualizations on Datapane.
- Improve the automation of the ETL pipeline. It runs on a schedule now, but what if something goes wrong? How can we diagnose or fix it? How can we avoid accidentally overwriting the existing table with bad data? Consider building some basic testing or logging capabilities. We’ll be writing on these topics in a future post, so make sure to sign up for our newsletter!
Keep Reading
We’ve written a whole series on ETL pipelines! Check them out here: