Fighting COVID-19 With Data Engineering

Adam Fletcher
The Inner Join
Published in
6 min readJun 18, 2021

--

Early in the pandemic, my co-founder Jonathan and I took a break from bit.io to work with US Digital Response (USDR) to help governments with any pandemic-related data needs. Below is the story of our work; originally written in April 2020 as a summary for USDR, I wanted to post this here to encourage others to work with USDR by going in to some detail about what the works is and what the impact can be.

RTCOVIDStatus, Pennsylvania, and Data Automation

April, 2020

State leaders need accurate, up-to-date information about their state’s healthcare assets to make decisions that protect public safety. From tracking available beds to predicting staffing needs, data drives these decisions. The time it takes to get from collecting data to insight, though, comes at a real human cost. This was especially evident in the early stages of the COVID epidemic. We had systems producing data almost immediately, but it wasn’t easy to work with. Data collection is a tedious, manual process, and, ultimately, researchers shouldn’t be the ones collecting data anyway. It’s data engineering’s job to provide the information they need to make critical decisions. The process urgently needed streamlining. Reducing time to insight in responding to the COVID crisis can literally save lives.

Recognizing the urgency of this problem, a team of volunteers from US Digital Response partnered with the Commonwealth of Pennsylvania to automate the collection of hospital data. Pennsylvania was already doing an excellent job of recording hospital resource data (hourly, from all hospitals, in CSV format to a central SFTP server) and reporting (via ArcGIS dashboards). However, they lacked resources and time to improve the ingestion, transformation, and analysis stages of the data lifecycle. RTCOVIDStatus, as the project was called, was able to help.

The RTCOVIDStatus team excels at the whole lifecycle of data — recording, ingestion, transformation, analysis, and reporting. We were able to help localities with any aspect of this lifecycle; in this case, we automated COVID-19-related data ingestion and processing for Pennsylvania at the state level.

The project enabled Pennsylvania to launch an online dashboard to track the number of available hospital beds and ventilators on a county-by-county basis. Access to this data saved staff dozens of hours per week. One of the core tenets of RTCOVIDStatus was to not require any additional time or intervention from healthcare workers, so those hours saved went directly back to the front lines.

Background

At the beginning of the COVID-19 crisis, Pennsylvania mandated that all hospitals report resource availability every hour. The resources tracked include all beds by type, PPE by type, ventilators, ECMO, NP specimen supplies, employee availability, and COVID-19-specific data, including number of patients diagnosed with or suspected of having COVID-19. PA has roughly 280 hospitals reporting this information (this number changes as temporary hospitals open or close, hospitals are renamed, and so on).

Some of the 170+ columns of data used by PA

This data is reported into a master CSV file with 170+ columns and one row per hospital. The most recent file along with all historical hourly submissions are available on an FTP server.

Before RTCOVIDStatus was involved, the process the GIS team in Pennsylvania had for generating an ArcGIS report from this data was mostly manual and a time-sink. They had to manually download the latest file, apply transformations on the file (mostly column renames, as ArcGIS’ CSV parser is fragile, and fixing some incorrect latitude/longitude information), upload the file to ArcGIS by overwriting the previous CSV file, and generating a map from the CSV file. The GIS team did this a few times per day when they had free time available. As a result, the dashboard often lagged behind the data. Further, the manual process took a significant portion of a GIS engineer’s time. Instead of finding insights in the data, the engineer was busy moving data around.

RTCOVIDStatus’s Contribution

After discussing needs and challenges with PA’s GIS and emergency response team, RTCOVIDStatus suggested it could replace the manual process with an automated one — a high-value offering that wouldn’t interrupt the workflows of the GIS engineers. To provide that automation, the team at RTCOVIDStatus accomplished the following:

  1. Built python scripts to download the latest data from the SFTP site
  2. Transform that data into a format which ArcGIS could parse
  3. Upload that data into ArcGIS to be used in a dashboard
  4. Trigger this whole flow once every 15 minutes via Google Cloud Platform’s Cloud Functions and Cloud Scheduler

By automating away manual work for GIS engineers, the team freed the PA GIS engineers up to do value-added work and the automation ensured that the dashboard was always up-to-date.

The GIS team in Pennsylvania quickly followed with more requests. Since the start of our engagement, RTCOVIDStatus has:

  • Built a historical database (in ArcGIS) of all the data gathered since the start of PA’s gathering processes
  • Added summary analysis tables in ArcGIS
  • Added county information to the hospital data
  • Summarized data by hospital for the public release of the dashboard
  • Built a database of PA hospitals, past and present

The code is all open-source and available here.

The Impact

Automating the data ingestion and processing saved hundreds if not thousand of human hours during the course of the pandemic. To highlight the value of our contribution, in the first month RTCOVIDStatus was live, there were 1.7 million total views of the live-updating dashboard that the (now) live-updating data underpins (‘PA Launches Hospital Preparedness Database’ announcement).

We collaborated directly with Pennsylvania’s GIS team led by Carrie Tropasso. We cannot say enough good things about what this team has achieved over the course of the last fifteen months. Carrie’s work and guidance was critical to our effort; our team built on top of PA’s work.

Pennsylvania itself deserves praise for their overall data-driven response to COVID-19. They used current, relevant data to make public health decisions in uncertain, volatile times. Knowing what we know now, it’s hard not to think about what we could have done differently, both individually and as a global community. This partnership highlights what I want to see more of in the data community going forward: more immediate productivity with data so key stakeholders like policy makers and researchers can take decisive action that has a real impact on the world.

[Back to Today] Some Lessons

I don’t go into this in the original summary of work above, but I feel it’s important to mention some of the realities of working with the data that cropped up.

Hospitals are more mobile than you might expect.

Hospital data — literally the list of hospital names & addresses — was surprisingly dirty. The 290+ hospitals in PA would frequently change addresses, names, and locations, and some reported data would use a different name for the same hospital. Oh, and some hospitals split county lines, and we were rolling up data by county, so the numbers would change depending on which entrance to that hospital was used in that hour’s reporting.

Columns — we started this in March 2020, right when the pandemic was taking off. It was not clear to anyone what data was needed for response to the pandemic so the column counts, names, and data types were changing on a daily basis as everyone learned what data might be useful for pandemic response.

Bad CSVs — as usual with CSVs, some were broken. There’s a subtle bug in the ArcGIS CSV importer where double quoted strings with certain other features would break the importer and some columns would shift in the middle of the data. We had to figure out these issues and write code to handle bad CSVs on ingestion, and push for the upstream providers to generate ArcGIS-compatible CSVs.

Go Volunteer!

If you’re reading this blog, you have the technical expertise to make an impact in the world around you. Volunteering with USDR or similar groups, writing open source software, even simply applying your programmer’s mindset to working on social problems can have tremendous benefits.

Today, reach out to USDR. Reach out to your local volunteer groups and take some time to listen to what their top technology problems are, and get in there and solve those problems. You’ll be happy you did.

I’d love to hear similar stories; have you used data to have a social impact? Reply below and tell me about it!

--

--