A Week to a Day: Machine Learning Pipeline Optimization at Clover

Published in

Clover Health

6 min readMay 31, 2023

Illustrations by Lisa Xu

Clover’s data science team is focused on building machine learning (ML) models that are designed to improve the detection and management of chronic diseases. One of the things that makes our platform unique is the feedback loop that allows for rapid iteration and increased model accuracy. In 2022, the pipeline that processes our data for ML would take almost a week to run. A week is lightning quick in healthcare IT, but as a technology company, this was not a standard we were happy with.

Join us as we share some of the optimizations our ML engineers made to bring our typical pipeline run duration down from over a week to less than a day!

Getting faster by being lazier

Our pipeline uses very common tools: Python code running in Airflow and backed by a PostgreSQL database. We followed a simple 3-step process to optimize it:

Understand what our code was doing
Determine what our code didn’t need to be doing
Remove as much unnecessary work as possible

The first step was probably the most important. We made heavy use of profiling tools to understand what our code was doing. Since we use Google Cloud Platform, we leveraged Cloud Profiler to transparently measure what our jobs were spending the most time doing.

Identifying unnecessary work required a strong understanding of our code. We asked ourselves questions like, “Is this database round-trip necessary?”, or “Is this date parsing function more flexible than we need?”

Once we knew what to cut out, the last part was easiest–although careful testing and comparisons were essential to ensure our changes didn’t alter the pipeline’s behavior.

With that framework in mind, here are some things we found and improved on.

Time slowed to a crawl

We found something surprising while examining profiles for one of our longer jobs: roughly 20% of its 10+ hour runtime was being spent parsing timestamps from strings. This seemed excessive, since the strings were all stored in standard ISO 8601 format.

After investigation, we switched from using pandas’ very flexible yet slow to_datetime() function to use the standard library’s much more rigid and faster datetime.fromisoformat() method instead. Our timestamp-parsing code then stopped running in slow motion.

More than enough protobuf

Clover makes heavy use of Protocol Buffers and gRPC for service-to-service communication, and our ML pipeline even receives its raw data as protobuf messages. These messages’ data can be accessed directly in Python code, but you can also convert them to standard Python dictionaries for more familiar semantics.

We found a 4 hour job which was spending 3 hours dutifully converting protobuf messages to dictionaries. We realized we could retrieve the same data from the messages without converting them and thereby relieved 75% of that job’s duties.

The case of the missing index

It was a tale as old as time: a database query performing a lookup on an ID column comprised most of the runtime of an 11 hour job. But there was a unique index that included the ID column! What gives?

Upon closer inspection, the relevant ID wasn’t the first column in the unique index, meaning our PostgreSQL database elected not to use the index for lookups based solely on that ID. We quickly rectified that by adding a separate unique index just for the problematic ID, making the lookups much faster.

Pivoting to SQL

The final step in our job to generate aggregated features for our ML models was calculating a massive pivot using the aptly named DataFrame.pivot() method. Not only did the computation consume a ton of memory and require us to run it in batches, but it was the only part of the job that required retrieving all the data from the database. This led us to ask: could we calculate the pivot in SQL?

One very ugly set of CASE statements later, we were able to run the huge pivot entirely in our database, drastically reduce memory usage, eliminate the need for batching, and cut a 20 hour job down to only 5 hours.

Similarly, a separate process generating our analytic datasets for ML model training and inference was querying each aggregated feature from the database separately and joining them in-memory as pandas DataFrames. We rewrote this to do as much of the joining in the database as possible and, after adding appropriate indexes, realized an even more impressive speedup from 20 hours to 2 hours.

Not-so-incremental improvements

We found several ways to improve the speed of our processing jobs, but ingestion was by far the longest-running step. At the beginning of our pipeline, we were reloading all raw data used for ML, even if the data hadn’t changed. This was obviously a huge waste of time, but performing incremental calculations would require drastic logic changes across our whole pipeline.

Fortunately, our raw data and the APIs we used to access it already contained the key ingredient for an incremental pipeline: a field representing the time the record was last modified. We set out to retrofit our entire pipeline with an incremental processing mode based on that timestamp.

This was a long and arduous journey, since our work on each step of the pipeline revealed subtle edge cases for incremental processing logic. We performed exhaustive at-scale testing in our staging environment with rigorous before-and-after comparisons to convince ourselves we hadn’t broken anything. We also built tools to help our data scientists to manage the new functionality, including a command to trigger full refreshes on downstream jobs; this was useful when changing a job’s logic, since the changed job and all downstream jobs would then need to reprocess all of their data.

Fortunately, the reward at the end made our incremental processing quest totally worth it. Ponderous ingestion jobs that ran for over 60 hours and dragged out longer every week were reduced to 5 hours or even less if no new data was available. 15 hour downstream tasks also shrunk to 5 hours or less depending on the amount of new data. With incremental processing fully implemented, we finally saw our regular runtimes drop below the 24-hour mark.

Learnings and what’s next

We took away several learnings from our optimization adventure:

Timestamp parsing can be really slow, but it doesn’t need to be slow if your timestamp formats are uniform
In-memory protobuf message conversion can be slow and should be avoided unless necessary
Database indexes are important for good performance in read-heavy workloads, but they have to be specified correctly
Offloading calculations to the database can be a large performance boost, especially when it saves you from having to retrieve the data from the database
Most of all, try not to re-process data that hasn’t changed!

Our data scientists are very happy with the new sub-day runtime, but our ML engineers suspect they could be happier still. We’re adding data and functionality to the pipeline all the time, and we anticipate reaching the limits of Python and PostgreSQL at some point in the near future. We’re currently investigating ways to re-architect the pipeline to use more scalable compute and storage engines, such as Apache Spark and BigQuery.