Building a Control Plane for Lyft’s Shared Development Environment

Published in

Lyft Engineering

11 min readSep 6, 2023

Background

Note: This publication assumes you have basic familiarity with the service mesh pattern (e.g. Istio, Linkerd, Envoy — created at Lyft!) in microservice architectures. In addition, it is recommended you read the 2021 precursor post written by my colleague, Matt Grossman.

Lyft runs hundreds of microservices to power the company’s offerings. Our team, the Developer Infrastructure team, aims to build the best tools to enable microservice owners (our “customers”) to reliably and quickly test changes in a local and/or end-to-end environment.

Testing in a prod-like environment plays a critical role in Lyft’s Software Development Life Cycle (SDLC). In 2021, the DevInfra team launched a fundamental shift to our developer environments to scale developer productivity: instead of providing difficult to maintain, fully isolated environments (EC2 VMs), we created tooling to isolate requests to pull requests (PRs) deployed within our shared staging environment. We called this workflow staging overrides. In a nutshell, it can be broken down into 3 steps:

Offloaded deployment: create a deployment of your experimental code in our shared staging that doesn’t register with service discovery.
Routing overrides metadata: embed metadata in API request headers defining which offloaded deployment the request will get routed to.
Context propagation: routing overrides metadata is propagated throughout the service mesh. At the appropriate upstream call, an Envoy filter mutates the routing to the offloaded deployment under test. The term context propagation comes from distributed tracing.

Header metadata allows users to modify call flow on a per-request basis

Step 2 is achieved via an in-house Electron application, henceforth referred to as ProxyApp, that functions as a man-in-the-middle (MITM) proxy similar to Charles. ProxyApp proxies calls between the mobile app and the staging API Gateway. As a developer, you write Typescript configuration snippets in ProxyApp to specify SHA-based routing overrides. Then, ProxyApp translates these SHAs into IP addresses corresponding to your offloaded deployments. It embeds this IP-based routing overrides metadata into the OpenTracing HTTP header x-ot-span-context baggage (a key-value structure embedded within the header). This is the header that undergoes context propagation referenced in Step 3. Our 2021 infrastructure leveraged the ProxyApp to embed routing overrides metadata due to implementation convenience, but in practice, it came with many limitations in usability, availability, and extensibility.

*Routing overrides metadata with ProxyApp*

Staging overrides was well-received by our product engineers. After we observed 100+ services onboard in the first few months, we received good feedback from customer teams on potential upgrades and started to dream bigger. We wanted to mature the tooling UX, as well as expand the surface area coverage of microservices by supporting new use cases. We delivered a successful 0-to-1, but now it was time to double down on this testing paradigm and build out the 1-to-10.

Goal

The vision for this 1-to-10 centered on moving the translation of routing overrides from ProxyApp to a central Envoy control plane. Test requests from clients such as the iOS rider app, Chrome web browser, in-house CLIs, and cURL add a header with an identifier we call the Context ID. As a request propagates throughout the service mesh, its Context ID is used to query the control plane for routing modifications.

What did moving from ProxyApp to a centralized control plane achieve?

By removing the local machine dependency from our cloud-based development environments, we boosted the availability of environments, because overrides could be persisted to a durable datastore, rather than read from an Electron app where they had an ephemeral lifetime tied to laptop availability.
We created a better abstraction and mental model for one’s individualized test sandbox. With the ProxyApp workflow, an engineer needed to perform Typescript surgery to attach routing overrides metadata to a test request; in contrast, with the control plane, an engineer only needs to specify their Context ID.
We unlocked new collaboration opportunities with shareable public URLs. More on this in the Use Case: Frontend Preview Environments section. Plus, this Context ID abstraction came with support for multiple environments per developer.
We increased the extensibility of staging overrides by moving the context injection duties from ProxyApp to Envoy (Edge Gateway and Sidecar Envoy). This unlocked new infrastructure capabilities, such as acceptance testing. More on this in the Extensions: Safer Automated Testing section.

The following sections of this article will cover:

Control Plane for Context IDs — a deep dive into building the control plane to teach Envoy how to natively translate a Context ID into routing overrides and re-route an in-flight request.
Use Case: Frontend Preview Environments — an exploration of one, example use case where the Context ID paradigm brought significant gains in developer productivity.
Extensions: MITM Filter & Safer Automated Testing—a discussion of how we re-plumbed the MITM functionalities and leveraged the increased extensibility of routing overrides to integrate with automated testing infrastructure.

Control Plane for Context IDs

As discussed in the ProxyApp era of staging overrides, we used our ContextProp filter to extract override information from the x-ot-span-context and redirect the request to our offloaded deployment via Envoy’s ORIGINAL DST cluster.

*Traffic processing in our Envoy filter chain, prior to Context IDs*

We evolved this mechanism into a control plane that translates the Context ID to a specific behavior for that request, at that hop in time. There are two main components that comprise the control plane: ContextRouter, an Envoy decoder filter, and ContextManager, a Go service that stores the mappings of Context ID to routing overrides.

First, let’s cover ContextRouter. This filter is layered right before our ContextProp filter, but still after basic API Gateway filters such as rate limiting and auth, ultimately chained together by the HTTP connection manager. The responsibility of ContextRouter is to translate a Context ID HTTP header into our beloved x-ot-span-context header with routing overrides in the trace baggage. This is the x-ot-span-context that ContextProp already understands how to use for request forwarding.

ContextRouter does this Context ID translation by consulting ContextManager, via Envoy’s built-in gRPC client. ContextManager returns routing overrides for a given Context ID, in the form of IP addresses of the offloaded Kubernetes pods.

*ContextRouter translates Context IDs into IP overrides that ContextProp understands*

This new control plane paradigm flipped the availability of staging overrides from ephemeral to durable, where routing overrides are persisted in ContextManager’s datastore. In the ProxyApp generation, if Teammate A had their laptop closed, Teammate B could not use the routing overrides set up on Teammate A’s proxy URL. However, with the newfound durability of Context IDs, a San Francisco-based engineer can sleep well at night knowing that their teammate in Eastern Europe can still use the same Context ID sandbox environment.

*Routing overrides in ContextManager’s datastore*

To enable our customers to use this new workflow, we built out tooling to support creating and modifying Context IDs. When a Lyft engineer performs an offloaded deploy for their GitHub PR, a Context ID is created, mapping to a routing override for the commit SHA that was offloaded. Customers may also edit these Context IDs for more complex use cases, such as simultaneously testing changes across 2+ services. We leveraged Lyft’s OSS tool for infrastructure management, Clutch, to build a portal where customers can manage the overrides associated with Context IDs.

*Clutch portal for Context IDs, replacing delicate Typescript configuration*

The job of ContextManager is to translate these SHA-based overrides into IP-based overrides when ContextRouter asks. It does this by talking to Kubernetes APIs, which allow for querying the state of API objects—in this case, pods. Once ContextRouter receives the IP-based overrides, it performs operations — creating the right protobuf, base64-encoding, etc — to neatly package x-ot-span-context. If that packaging process sounds familiar, it’s because it moved from ProxyApp into ContextRouter, native to Envoy!

To summarize, the new Context ID test journey for a Lyft engineer looks like this:

Specify your Context ID in a test client, say the Lyft iOS rider app.
Simulate a Lyft ride 🚗 or other test scenario.
A multitude of requests are generated from request to pickup to drop-off. At the request level, the control plane translates your Context ID into routing overrides.
With that routing information, Envoy will route to your offloaded deployment at the appropriate hop to exercise the work-in-progress business logic you wanted to test.

Use Case: Frontend Preview Environments

With our control plane built, we’re halfway to delivering this next-generation of developer environments. Now, the attention turns to: how do we enable clients across all use cases — from iOS/Android mobile apps to Chrome/Safari/etc web browsers — to have the same, fast, reliable experience in using a Context ID sandbox?

On the mobile side, in our company-internal pre-release apps, we leveraged existing infrastructure to inject our Context ID HTTP header to all outbound requests towards our API Gateway.

*Pointing a mobile client at a Context ID sandbox*

On the web side, the existing workflow for sharing changes was limited. Teams stuck to the tried and true method of sharing back-and-forth screenshots or screen recordings, spinning up the branch locally, or deploying to staging. Each of these iteration cycles took more than 10 minutes and painfully slowed development.

With our new control plane, we were able to leverage the fact that Envoy natively resolves Context IDs to easily create shareable links for frontend changes in progress. All you need to do is specify your Context ID via a query parameter: for example, applytodrive-staging.lyft.net?lyft_context_id=mmeng. The query parameter design fit naturally into Lyft’s infrastructure: it didn’t require scaling TLS certificate management and avoided issues with SAML redirects.

When the plugin we added to our Next.js ecosystem sees this query parameter, it sets a Context ID cookie. Our ContextRouter filter reads the Cookie header and understands how to route the request accordingly to your offloaded deployments.

*Frontend preview environment for lyft.com*

Extensions: MITM Filter

Even with all the benefits of Context IDs, developers still needed to inspect requests/responses in this new workflow. To accomplish this, we built another Envoy filter, henceforth referred to as the MITM filter. The essence of the MITM filter is that it relays HTTP and gRPC requests from itself to the ProxyApp server using a bidirectional gRPC stream. For the curious Envoy-oriented reader, we considered using the external processing OSS filter, but it didn’t quite fit our needs. We wanted more control at each hop: for instance, the ext_proc filter always calls the ext_proc server, but we only wanted to relay requests when users explicitly tell us they want to tap the traffic of a specific service.

A request is only intercepted if the Context ID contains metadata opting in to interception; otherwise, the MITM filter acts as a passthrough. We refer to the act of opting-in as ConnectContext. From the Lyft engineer’s perspective, you specify what requests to intercept based on the source (downstream) and destination (upstream) services. For example, the following config would intercept requests egress-ing to service_B or ingress-ing from service_A.

ConnectContext({
  // Opt-in requests tagged with this Context ID HTTP header
  context: 'mmeng',
  intercept_to: ['service_B'],
  intercept_from: ['service_A'],
})

ConnectContext can intercept requests/responses for any service, at any hop, in real-time. Contrast this to the previous generation of staging overrides: to see what upstream requests are made by your offloaded facet, you had to comb through historical logs in Kibana.

*Routing requests back to ProxyApp from the MITM Filter*

MITM interception also extends to mocking applications. In the previous staging overrides workflow, testing what happens if an upstream service gives back a certain response meant one needed to run the downstream service locally and hardcode the upstream call response. With ConnectContext, mocking the response for an upstream request can be a one-liner. Here’s an example of mocking one’s upstream response code to 500 to test fault tolerance.

post('/v1/post-endpoint', async (c) => {
  // Example mock: Set the foo header to value bar
  c.request.headers.set('foo', 'bar')

  // Fetch the real upstream response
  await c.fetchApiResponse()
  
  // Example mock: Setting response code to 500
  c.response.status = 500
})

Extensions: Safer Automated Testing

Lastly, moving the context injection responsibilities from ProxyApp to Envoy (Edge Gateway and Sidecar Envoy) elevated the extensibility of routing overrides, because we are now able to attach routing overrides to requests that originate from within the service mesh. With this, we unlocked several new infrastructure capabilities for safer, automated testing.

The first infrastructure piece we improved was Lyft’s automated acceptance tests (henceforth abbreviated as ATs), moving the runtime from post-PR-merge to pre-PR-merge to improve the reliability of our shared staging environment. Previously, when ATs ran post-PR-merge, the AT ran against the main staging deployment; however, this sometimes led to bad code getting deployed and causing staging outages. Moving to pre-PR-merge ATs meant that the microservice worker that runs ATs would instead route to an offloaded deployment, so bad changes would be caught before they are deployed to staging. Before, when x-ot-span-context was strictly handled by ProxyApp, these requests that originate from within the service mesh were unable to attach routing overrides to offloaded deployments. But now, with Sidecar Envoy natively understanding Context IDs, we were able to implement this design.

The second integration point arose when a customer team, Dispatch, wanted to make predictions on how PRs will alter marketplace metrics. The idea was: create an offloaded deployment of the PR under test, have a service that sends historical dispatch cycles from S3 to both the offloaded deployment and the main staging deployment, and compare the outputs. Similar to pre-PR-merge ATs, the Dispatch team was able to have their mesh-originated request attach a Context ID upon egress and trust that it’ll be routed to the intended offloaded deployment.

Conclusion

Usage of context propagation for developer testing is growing rapidly among companies (DoorDash, Uber, Cruise/Yelp) managing complex microservice architectures. At Lyft, we combined OpenTracing/OpenTelemetry, Envoy (HTTP filters, ORIGINAL_DSTclusters), Kubernetes (Go control plane), and in-house UI/UX to shift engineers into this new paradigm of development.

Context IDs significantly improved developer productivity at Lyft, across all types of engineering disciplines.

Backend: LOC (lines of code) of Typescript configuration were completely eliminated, in favor of the aforementioned Clutch UI. Environment availability advanced from ephemeral to durable.
Frontend: workflow times were cut down from minutes to seconds with Context ID preview URLs.
New use cases: safer pre-deploy automated testing with acceptance tests and mesh-originated experiments.

Building out a centralized control plane brought significant improvements to the usability, availability, and extensibility of staging overrides. As a result, we were able to double-down on the paradigm of context propagation-based development workflows for Lyft and realize the full potential of request isolation in a prod-like environment of manual and simulated traffic.

Special thank you to all the engineers who helped make this vision a reality: Pierre-Guillaume Herveou, Max Melamed, Scott Wilson, Matt Grossman, Jake Kaufman, Kyle Xiao, Anatolii Kurochkin, Seun Suberu, Rithu John, and Brian Balser. This project would not have been possible without your technical depth, grit, and contributions. Lastly, thank you to the copy editors of this blog post: Taylor Overturf and Jake Kaufman.

Lyft is hiring—if you’re interested in working on infrastructure problems like these, please take a look at our careers page.