Platform Engineering journey in Clover

Published in

Clover Health

7 min readFeb 29, 2024

Background

As a healthtech company, Clover has used cloud technologies from the very beginning. AWS was the major cloud provider until we migrated to GCP later on in 2019. PostgreSQL and Google Kubernetes Engine play a major part in our software. As we grew, it has become increasingly challenging to make sure our infrastructure is staying up to date and that we can keep track of changes. We’ve learnt and evolved throughout the years in terms of how we manage our cloud infrastructure, and as the Clover engineering team, we’re very excited to share our thought process along the way, how we arrived at our current state of managing the infrastructure, and how we would continue to evolve to the next state.

Bumping on the IaC road

If you look back 10 years, Infrastructure as Code (IaC) was in very different forms. Virtual machines would often be created by hand and the deployment process might have involved copying the latest release assets from a developer’s laptop. Replicating an environment has become easier over time but it may still involve taking a snapshot of an existing virtual machine, or by replicating a harddisk. As the ecosystem evolved, tools such as Chef, Puppet and Salt Stack gained popularity. Combined with Hashicorp’s Packer, it became feasible to deploy identical virtual machines and the term “Cattle vs Pets” was coined.

To explain the “Cattle vs Pets” analogy — traditionally, when a virtual machine was “sick”, we would diagnose and resolve the issue. There may be service degradation or an outage until the “illness” is diagnosed, the medicine administered and the virtual machine is “cured”. This virtual machine would be the pet in the analogy. We care for it and want it to be well. For cattle, if the virtual machine is”sick”, we may quickly decide to delete it and replace it with a new one.

When Clover started migrating to GCP in 2019, we decided to use Terraform to manage our infrastructure for the following reasons:

We’re fully cloud, no on-prem infrastructure and GCP has good support for Terraform with their Terraform modules
It’s easy to deploy to multiple environments with Terraform workspaces

Once we settled on moving forward with Terraform, we needed to design the IaC repository layout and deployment pipeline.

Back in 2019, our infrastructure was rather simple. In each environment, we mainly needed a GKE cluster, SQL Databases and GCS buckets. We decided to build a monolithic Terraform module which could create all infrastructure needed for an environment in a single .tfvars file.

# tfvars Example

# GKE section
gke = {
  enabled              = true
  (other settings)
}

# SQL section
sql_instances = {
  "database1" = {
    (options in database1)
  },
  "database2" = {
    (options in database2)
  },
}

(etc...)

For the deployment, we wrote a script to deploy it in our CI.

When a PR is opened, it would plan the change for our environments, and then it would apply when merged to master. Here is our previous workflow.

This kind of worked, but also give us following pain points when our system grew organically:

When more and more components are needed, the .tfvars file grows longer
The growth of the Terraform modules contribute to a longer wait for planning and applying.
If we want to test out some upgrades or changes in a specific environment, we would need to introduce conditional blocks based on the environment, which makes the module harder to maintain.
In the CI pipeline, the plan might look fine, but as our CICD runs Terraform apply after a PR is merged, occasionally we would still encounter errors during the apply stage. It then creates blockers for other collaborators wanting to merge and apply while the code is being fixed or rolled back.

Revamping our IaC strategy

Terraform is a great tool. We were able to manage our infra in GCP with it and our team members know how to use it.

However, there were several things on our wishlist which we wanted to solve:

Easily plan and change one part or resource in our infra, e.g.: we might have dozens of GCS buckets in a single GCP project, but if we want to modify only one of them, how to do it effectively without planning for the other buckets
Keeping things simple and better not repeating itself
Easier to replicate or redeploy some infrastructure
Better way to review the infrastructure changes which gives us more confidence to merge any changes into master

IaC structure

After doing some research and a proof of concept, we were inspired by this Terragrunt doc, which gives us a good example in the terragrunt-infrastructure-live-example repository.

With a little twist, here is our structure used in Clover

clover-infra-repo/
├─ _envcommon/
│  ├─ gcp_gcs.hcl
│  ├─ gcp_gke.hcl
│  ├─ (etc...)
├─ cloverhealth/
│  ├─ gcp/
│  │  ├─ gcp_project_1/
│  │  │  ├─ project.hcl
│  │  │  ├─ global/
│  │  │  │  ├─ iam/
│  │  │  │  │  ├─ serviceaccounts/
│  │  │  │  │  │  ├─ clover-sa-1/
│  │  │  │  │  │  │  ├─ terragrunt.hcl
│  │  │  │  │  ├─ (other iam resources...)/
│  │  │  ├─ us-west1/
│  │  │  │  ├─ region.hcl
│  │  │  │  ├─ gcs/
│  │  │  │  │  ├─ clover-bucket-1/
│  │  │  │  │  │  ├─ terragrunt.hcl
│  │  │  │  ├─ (other resources...)/
│  │  ├─ gcp_project_2/
│  │  │  ├─ (similar setup as project 1...)
├─ README.md

With the setup above, we host all the resources template under the _envcommon folder. We also put some default values in project.hcl and region.hcl . So the related resources would reference the template and inherit all default inputs and values by which folder it is located.

You may also be wondering, how do we choose which Terraform module to use, do we still write modules on our own?
Our current strategy is using google official module as much as possible, but that still doesn’t cover all the use cases we have, so we still write our own modules if needed.

With the new IaC structure, we are able to achieve what we needed, which is:

Changing part of our infrastructure easily
Keeping our repo DRY (Don’t repeat yourself).
Easily redeploy infrastructure by just copy and pasting the folder.
Join the community in maintaining Terraform modules.

IaC review process

The next topic is how we can deploy things within our CI pipeline. As we mainly focus on getting a better review process, we found that Atlantis is a great tool for us.

In order to make Atlantis work, it requires a config file for it to know how to plan and apply the resources, as we are using Terragrunt with Atlantis, terragrunt-atlantis-config project helps us to generate the desired config dynamically. This tools give us flexibility to inject extra options when Atlantis kicks in (e.g.: use different Terraform versions for different modules). Plus it would evaluate the dependencies, making sure changes are cascaded down correctly.

So each time our engineer wants to make a change, they create a PR, and Atlantis will comment nicely in PR with the Terraform plan.

Once the PR approved, we just need to comment atlantis apply in the PR to ask Atlantis apply the approved plan for us

Under compliance, we need to make sure all changes are approved, so we could simply enable branch protection and make sure all merged changes are from approved PR and applying from Atlantis.

With Atlantis, we updated our workflow as below:

With Atlantis and some setup in Github, we are able to:

Have a clear view of what is changed within the PR.
Confident to merge the change because it would be applied at the PR level already.
Ensure conflicting changes aren’t in-flight at the same time
Enable engineers to self service

Summary and future improvements

We built this IaC model fully based on open source tools. We’d like to give a shout out to the communities, and we’re amazed how the infrastructure world (now called platform engineering) has changed.

Clover has been using this IaC model for a few months, so far we feel this solved most of the pain points we had, and is giving a better way for our engineers to set up the infrastructure they need. Although there are some learning curves for engineers when managing resources with Terragrunt, once examples are set up for a specific type of resource people are able to follow the patterns relatively easily.

The Clover SRE team believe this is a midway point of our Platform Engineering journey and there are still some topics we are looking forward to improve:

Importing non-managed resources
Monitoring and alerting on configuration drift
Testing on Terragrunt / Terraform

We hope you found something useful in this write-up of our experiences. See you next time!