Why Clover Chose gRPC and Protocol Buffers Over Other API Technologies

API Choices when rebuilding our flagship product: Clover Assistant

Danny Vu
Clover Health

--

When we began iterating Clover Assistant in 2019, we had a rare opportunity to rebuild the infrastructure from the ground up. This gave us the flexibility to think about utilizing different technologies that would best suit the software we wanted to build. One of the open questions we had was: what API technology would best suit us going forward?

Behind the Clover Assistant sits a sophisticated full stack technology platform.

Our Problem Statement

Our first iteration of Clover Assistant used REST APIs for service to service communication. REST was adequate and served its purposes — it was simple and easy to implement — but it also presented some challenges. Most notably, brittle and inconsistent data exchange techniques.

How so?

  1. Data exchange techniques were brittle because of a lack of shared schemas, as well as schema versioning and support for backwards compatibility.
  2. Data exchange techniques were inconsistent based on source and target for the data.
  3. Streaming large amounts of data in REST can be cumbersome.

Historically, sharing data models across domains has been problematic. For one, while the JSON generated on the API side might be well-structured and have known schemas, because those schemas aren’t shared in any meaningful way, the data becomes unstructured on the data platform side of things. Furthermore, we were required to build tons of custom serializers to validate incoming and outgoing data within our APIs, which was incredibly time-consuming.

Adopting shared schema definitions

To address this problem, we knew we had to adopt shared schema definitions, and homogenize data exchange interfaces. We looked at different technologies and weighed their pros and cons to achieve this.

Our first idea was to utilize shared Python objects using a schema with Marshmallow annotations. We first considered inferring the schema and sharing those objects and schemas between the interfacing services. This would have minimized engineering lift and allowed us to continue using REST APIs.

However, as we knew we had this chance to choose a foundational technology to build upon, we wanted to maximize potential gains in performance as well. When handling schema enforcement and data validation with bulk data transfers for example, REST was performing adequately via transfer rates of 50–100 MB/s. But binary based transfers were much more performant as we reached higher levels of transfers (>100MB/s).

So, we started looking at some of the binary formats for data serialization.

There are a lot of standardized RPCs (remote procedure calls) that enable binary serialization. We considered packaging these up ourselves, but ultimately decided to explore open-source solutions.

At the time, there were several decent options for data serialization. We looked at Thrift, Avro, and Cap’n Proto. But the combination of gRPC and Protobuf won out over all of them due to their popular adoption across the developer community.

gRPC is used by Netflix, Square, Cisco, etc.

In all of these cases, however, a shared schema is required to describe the binary encoding format and all of them have grown to support a wide range of programming languages. This required shared schema as a natural interface between services has some incredible benefits:

  • The Protobuf binary encoded data is much more compact than even binary JSON variants because they can omit field names when encoding the data.
  • Since the schema is required for decoding the data, it can serve as a source of truth for documentation purposes. No more chasing to get API documentation updated!
  • Forward and backward compatibility is much easier to understand by looking at the schema changes. This makes it easier to service migrations between clients and services.
  • Statically typed programming languages can generate code from the schema and enable type checking at compile time. No more building serializers!
  • We use Typescript in the front-end, and this schema generated code from the protobuf definitions enables automatic type checking for all of our code that is coming from the back-end services, without us having to redefine any types.
gRPC works across many languages and platforms. Image courtesy: https://grpc.io/

Sample schema definition

Here is a sample protobuf schema for a Visit object:

enum VisitStatus {
VISIT_STATUS_INVALID = 0;
VISIT_STATUS_CREATED = 1;
VISIT_STATUS_IN_PROGRESS = 2;
VISIT_STATUS_SIGNED = 3;
}

message Visit {
string visit_id = 1;
VisitStatus status = 2;
string patient_id = 3;
repeated string procedure_codes = 4;
}

The build process automatically gives us a Python data serializer:

_VISIT = _descriptor.Descriptor(
name='Visit',
full_name='visits.visits.v1.Visit',
filename=None,
file=DESCRIPTOR,
containing_type=None,
create_key=_descriptor._internal_create_key,
fields=[
_descriptor.FieldDescriptor(
name='visit_id', full_name='visits.visits.v1.Visit.visit_id', index=0,
number=1, type=9, cpp_type=9, label=1,
has_default_value=False, default_value=b"".decode('utf-8'),
message_type=None, enum_type=None, containing_type=None,
is_extension=False, extension_scope=None,
serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
_descriptor.FieldDescriptor(
name='status', full_name='visits.visits.v1.Visit.status', index=1,
number=2, type=14, cpp_type=8, label=1,
has_default_value=False, default_value=0,
message_type=None, enum_type=None, containing_type=None,
is_extension=False, extension_scope=None,
serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
_descriptor.FieldDescriptor(
name='patient_id', full_name='visits.visits.v1.Visit.patient_id', index=3,
number=3, type=9, cpp_type=9, label=1,
has_default_value=False, default_value=b"".decode('utf-8'),
message_type=None, enum_type=None, containing_type=None,
is_extension=False, extension_scope=None,
serialized_options=None, file=DESCRIPTOR, create_key=_descriptor._internal_create_key),
_descriptor.FieldDescriptor(
name='procedure_codes', full_name='visits.visits.v1.Visit.procedure_codes', index=15,
number=4, type=9, cpp_type=9, label=3,
has_default_value=False, default_value=[],
message_type=None, enum_type=None, containing_type=None,
is_extension=False, extension_scope=None,
serialized_options=None, file=DESCRIPTOR,
],
serialized_start=483,
serialized_end=1483,
)

As well as an automatically generated and typed Typescript data serializer in the front-end:

export class Visit extends jspb.Message {
getVisitId(): string;
setVisitId(value: string): Visit;

getStatus(): VisitStatus;
setStatus(value: VisitStatus): Visit;

getPatientId(): string;
setPatientId(value: string): Visit;

getProcedureCodesList(): Array<string>;
setProcedureCodesList(value: Array<string>): Visit;
clearProcedureCodesList(): Visit;
addProcedureCodes(value: string, index?: number): Visit;

serializeBinary(): Uint8Array;
toObject(includeInstance?: boolean): Visit.AsObject;
static toObject(includeInstance: boolean, msg: Visit): Visit.AsObject;
static serializeBinaryToWriter(message: Visit, writer: jspb.BinaryWriter): void;
static deserializeBinary(bytes: Uint8Array): Visit;
static deserializeBinaryFromReader(message: Visit, reader: jspb.BinaryReader): Visit;
}

Shared schema definitions and performance improvements.

In all of these cases, the data passed is compact and efficient — it serializes and deserializes very quickly. With all of these approaches, bulk data transfer would work seamlessly, with full validation and schema security. Protocol buffers and gRPC also support streaming — a database service could be streaming the result through the RPC at the same time and thus increase throughput.

How much more efficient was gRPC than REST?

We did a time trial of sending 100,000 medication statement records through a Django REST API vs. a gRPC API. In both cases, we sent over records using a batch size of 500 records per batch and we saw an almost 2x performance increase (this was performed on a blocking single threaded web server):

Django REST API:
Time to run: 31:50.86 minutes

gRPC API:
Time to run: 17:26.26 minutes

The above numbers seem impressive. But when we compare the performance with multi core threading / multiprocessing, we see near 10x performance improvements. Adding gRPC streaming to the mix makes it even more of an unfair comparison!

Conclusion

To solve the problem of brittle data exchange between all of Clover’s services, we adopted an RPC framework for binary data serialization with shared schema definitions. This not only gave us a source of truth for encoding and decoding data, but yielded many other benefits including performance, flexible forward and backward API versioning, and multi-language automatically generated serializers.

In the end, we chose gRPC and protobuf solely because the documentation and quality of support were paramount. Cap’N proto was slightly faster but much less adopted within the community.

This problem of shared data interfacing has actually been around at Clover since 2015. When engineers first started looking for a solution, most of these RPC options lacked adequate Python3 support and support for most languages. RPC options have come quite a long way since then.

However, adopting gRPC comes with its own sets of drawbacks — there is a give and take for everything — such as:

  • Adding more complexity to the build process to share the schemas and bump dependencies between services.
  • Larger code bundle sizes to incorporate all the auto-generated protobuf serializers.
  • Serialized data is not human-readable (as compared to JSON) which can make debugging more difficult.
  • Ramping up engineers to switch from thinking about HTTP requests to gRPC calls and proto syntax.

But in the end, the positives have far outweighed the negatives:

  • Bumping dependencies to share schemas is much easier than writing custom serializers.
  • We’ve used various bundle optimization techniques to minimize our front-end javascript packages.
  • Debugging hasn’t been too difficult, since data that can’t be deserialized is pretty obvious to spot.
  • Clover engineers have learned to embrace the proto.

Disclaimer: This article would not be possible without the hard work of previous Clover engineers like Alec Clowes and Jane Williams. Anyone who has achieved anything in technology stands on the shoulders of giants.

--

--

Tech enthusiast, software engineer, vegan motivator, and spiritual artisan.