Vulnerability Management at Lyft: Enforcing the Cascade - Part 1

Published in

Lyft Engineering

13 min readNov 17, 2022

Diagram of how we convert container scan data into tickets, linked with automated pull requests — Converting container scan data into tickets, linked with automated pull requests

Abstract

Over the past 2 years, we’ve built a comprehensive vulnerability management program at Lyft. This blog post will focus on the systems we’ve built to address OS and OS-package level vulnerabilities in a timely manner across hundreds of services run on Kubernetes. Along the way, we’ll highlight the technical challenges we encountered and how we eliminated most of the work required from other engineers. In this first of two posts, we describe our graph approach to finding where a given vulnerability was introduced — a key building block that enables automation of most of the patch process.

Introduction

Resolving software vulnerabilities on a microservices architecture is a complicated task, as most vulnerable software installed on a given service’s container image will not have been installed by the service itself — rather, they are more likely to have been included by one of the service’s parent images, possibly several layers deep.

This first of two posts will be focused on containers and tell the story of how and why we decided to use a graph approach — powered by cartography — to locate vulnerabilities in container images and understand parent-child relationships between these images in the form of dependency trees.

In the second blog of this series, we’ll show how this data model can be used to automate the fix process so that in most cases all a service owner has to do is ship an auto-generated pull request (which usually deploys automatically too). With hundreds of services, it is not viable to open individual tickets with teams and manually track mitigation from there. Our system automatically marks tickets as complete when we detect that the underlying vulnerabilities are resolved.

This is a complex set of tools to build and maintain in-house, but we’ve decided to build and not buy because at the time of our initial research, no existing solution or set of solutions fit our requirements (detailed in later sections).

It’s also worth noting that this blog only covers OS package dependencies and not how to handle vulnerable libraries from specific languages like Python, Java, and Go. Handling language-specific vulnerabilities adds complexity that is beyond the scope of this post because each language has its own notion of dependency management and specific ways to mitigate issues.

Scope, requirements, and principles

Before diving into technical specifics, we need to explain what exactly we were trying to accomplish with our vulnerability management program. For our initial scope, we decided to prioritize publicly accessible interfaces, which included internet exposed endpoints, Kubernetes hosts, and container images.

Our program has 3 requirements:

Continuously scan in-scope resources for vulnerabilities. This is made more difficult because we have many homegrown, bespoke infrastructure components that don’t easily integrate with vulnerability detection/mitigation tools available commercially or in open source.
Remediate detected vulnerabilities in a timely manner. Publicly known computer security flaws are cataloged as CVEs (Common Vulnerability Enumerations), which are assigned severity levels by other well-known organizations based on their risk and impact. We file Jira tickets to teams on a monthly basis and set timelines for resolution based on the severities found across multiple vulnerability data sources. We manually triage any conflicts. If our scan detects a critical vulnerability, then we cannot wait for the standard monthly cadence and must resolve the vulnerability as soon as possible.
Provide quarterly reports on our progress. We’ve previously blogged on how we do this with Flyte workflows.

We’ve tried to adhere to some key principles:

Prefer fixing over telling. Wherever possible we should try to fix issues ourselves rather than delegating them to other engineers so that they can continue focusing on the priorities of their teams. Obviously this doesn’t work for risky, business-critical services that we do not own, but service teams are busy and we need to be cognizant of that.
Prefer automation. This ties into our requirement to run the program in a sustainable, steady state.
Any guidance we (as the security team) provide to a service team must be actionable. We must only ask a service team to fix an issue if we are confident that it is fixable. Many documented vulnerabilities do not have known fixes, so we must triage data feeds to avoid spamming teams with problems that they can’t do anything about. Critical issues without known fixes require a more all-hands-on-deck approach, but that is out of scope for this post.

Lyft’s image deployment flow

As mentioned earlier, we run hundreds of services whose images are deployed as containers on Kubernetes. We decided that an image presented a risk only if it was deployed and actively running in a production or staging environment. Put another way, an image presents no risk if it never actually runs. We’ve minimized our scope by only scanning resources that present risk because of time constraints and resource constraints at scale.

Every time a developer pushes code to a service’s GitHub repository, our CI (Continuous Integration) pipeline builds a container image, tags it with the Git commit revision, and pushes it to that service’s AWS ECR (Elastic Container Registry) repository:

Diagram showing how every commit pushed to a service’s GitHub repo gets built as a container image and saved in AWS ECR. — Every commit pushed to a service’s GitHub repo gets built as a container image

Problem: where do we track all these assets

The foundation for any vulnerability management solution is in its inventory system. We chose to unify all of our program data (ECR images, Kubernetes hosts, network scans, etc.) using cartography not only because a graph approach is able to faithfully represent things like image lineage data (covered in a lot of detail in the next section), but also because it lets us easily correlate any ECR image to its service definition, on-call team, GitHub repo, and pretty much anything related to it as shown below:

Rough sketch of our issue management graph data model. Jira ticket nodes are linked to their associated software package, image, and service nodes. — Rough sketch of our issue management data model

In the past we’ve found cartography useful in other graph problems like understanding what users can access a resource, so it was exciting to see container vulnerability management take the shape of a graph problem too. Being able to quickly attribute assets to services, teams, and on-calls, as well as identify other affected assets and potential attack paths is incredibly useful during security incident response, but we digress; that’s a topic for another article.

The initial architecture for our program involved aggregating most of our assets to cartography and looked something like this:

Architecture diagram showing how we use cartography as a vulnerability management inventory. Infra data is aggregated to the graph database and we have various reporting mechanisms that query the graph to determine what action items our service teams should do. — Using cartography as a vulnerability management inventory

Problem: how to fix a single vulnerability on a given service

If there is a vulnerable software package on a service’s container, how can we fix it?

If the service installed a package itself on the image, then it’s straightforward to update the package by redeploying the service (which can install updated OS packages automatically in most cases) or editing the service’s Dockerfile and redeploying.

However, service images at Lyft are built from a global base image, so if a vulnerable package was introduced by a service’s base image, then the service must first wait for the base image owner to publish an updated revision and then consume the updated packages by rebuilding and redeploying.

Put another way, there is nothing that a service owner can do to fix a vulnerable package that was introduced in one of its parent images until the parent updates itself. Conversely, if a vulnerable parent image updates itself, then all of its children will still remain vulnerable until they consume the latest revision of that parent image. An earlier Lyft Security blog post gives a good overview on how our images are structured, and how we’ve decided to generate pull requests to child images when parent images change.

If we run a scan tool on a service image, it will tell us what software packages were installed on the image and what vulnerabilities applied to each package. However, this is not immediately useful because we don’t know if a given vulnerable package was introduced by the service itself or a parent image.

Diagram showing that a service image can be composed of multiple layers and that it is difficult to determine from which layer a vulnerability was introduced at. — Which image layer introduced each CVE?

In the example diagram above, we needed a way to differentiate between issues introduced by the users service itself versus those introduced by one of its parent images. This is simple with our graph model: we scan all images that are part of a service’s parent-child lineage for vulnerabilities and find the greatest ancestor of each affected package.

Some sophisticated scanner tools are able to deconstruct and report on image layers, but we found them hard to read and could not correlate their results back to our own parent-child image tree where each image node may be owned by different teams. It was simpler for us to essentially diff scan results to calculate the origin of an issue.

Diagram showing that we use a graph approach to calculate the origin layer of each CVE. — Using a graph approach to calculate the origin of each CVE

Scenario 1: service fixes the problem themselves

In the above example, if the users service introduced a vulnerable package, the owning team (Identity, in this case) must remove or update it to a fixed version and deploy the change.

Scenario 2: cascading fixes down from parent images to children

Things are more interesting when we need to deal with parent images. If users inherited a vulnerable package through one of its parent images, then the parent must fix the issue by either removing the package or updating to a fixed version and publishing a new revision. Finally, the Identity team must update the users service to consume the new revision.

Explained concretely, in the diagram above, CVE-123 affects users@0x0C via its dependency bionicbase@0xEE. To resolve CVE-123, we need to know if the latest production version of bionicbase is still affected by CVE-123. If not, then that’s great; bionicbase has fixed the problem! The fixes must cascade down: s2ipython3 now needs to update its production version to consume the latest version of bionicbase, and finally after that, users must consume the fixed s2ipython3 and redeploy.

A better scanner

Our initial scanning solution used AWS ECR’s built-in basic scanner which uses an open source project called Clair. The biggest downside we experienced with the basic scanner was that its results do not include fixed versions of software packages. Obviously we can’t ask service teams to fix vulnerabilities that are not actually fixable — this makes for a bad developer experience and they would learn to ignore our tickets and potentially take our guidance less seriously in the future.

As a temporary solution, we wrote a one-off script to take our scan results and correlate them against the Ubuntu CVE tracker. Maintaining our own logic to find fixed versions adds complexity, so we were eager to get rid of this and find a better solution.

We were thrilled to find Trivy, an open source tool that scans containers, includes fixed versions, and maintains a data feed. This was huge because now we knew which findings were actually actionable without reinventing our own correlation logic. Trivy is also configurable with Open Policy Agent rules, which let us filter out irrelevant or low-signal results. Big thanks to the Trivy maintainers!

Problem: expanding to scan 1000+ services

Next, we needed an efficient way to scan all of our services. The naive approach to scan all images across all ECR repos is too slow. Every time a developer pushes code to a GitHub repo, CI builds a container image even if it is never deployed to production. Our ECR repos house over 400,000 images at any given time (we keep images around so it’s fast to revert a bad change that makes it to production). If it takes an average of 20 seconds to scan a single image, it will take over 85 days to scan all of them one-by-one. Even if we were willing to spend the money to parallelize this with more compute resources, the naïve approach is wasteful because less than 1% of images would ever be deployed.

Keeping images up to date involves cascading software fixes down dependency trees, and we were able to save significant time and resources by identifying the parts of a tree that were deployed to production and therefore presented actual risk to us.

Essentially,

We scan service image revisions that are deployed to production and staging because they actually have bits that execute and therefore present risk.
We scan the latest versions of parent images so that we can determine if a parent image update will fix issues discovered by (1).
This approach allows us to scan as few images as possible while still being able to determine how fixes can be cascaded down from parents to children.

A concrete example

Suppose we have a service called users and it is built from a parent image called s2ipython3. In turn, s2ipython3 is built from an image called bionicbase.

A toy example showing how the Users service has a parent image called s2ipython3, which has a parent image called bionicbase. Each revision of each image can depend on a parent project at a different revision. — The Users service and its parent images

In this diagram, note that:

The users service has multiple images but only revision 0x0C is deployed to production.
Each users image depends on different revisions of parent project s2ipython3: 0x11 and 0x12.
The production version of s2ipython3 is 0xF3, but it has no children because no child projects have updated to include it as a dependency yet.

After we build the dependency tree, we then identify which images to scan by taking the union of the nodes visited by two different graph traversals described below.

Traversal (1) — Production images from bottom to top

To find out what vulnerabilities apply to which layer, we scan all service images deployed to production and traverse up to all parent images and greatest ancestors. The greatest ancestor in the below example is bionicbase@0xEE.

Diagram showing the first way we traverse image lineage trees to efficiently determine scan targets: we go from the User service’s production image and up through its parents. — Traversal 1 visualized

Traversal (2) — Scan production images

In traversal 2, we scan the production versions of all images. The black stars denote our starting points from each project and the dark gray circles (e.g. users@0x0C) denote images identified as scan targets.

Diagram showing that we then scan the latest production versions of each project. This allows us to determine if updating a service to the latest version of a parent image will actually fix any vulnerabilities. — Traversal 2 visualized

In the above diagrams, note that users@0x0C and s2python3@0xF3 are covered by both traversals (1) and (2).

Here’s the important part: traversal (2) uniquely includes parent image bionicbase@0x4F. We need to scan bionicbase@0x4F even though it isn’t currently used by any child images because it is the latest version of bionicbase. If CVE-123 affects software package XYZ version 2.1 on bionicbase@0xEE, then we need to know if bionicbase0x4F (the latest version) fixes that issue. If so, that’s great; we should now cascade this change down the tree by instructing child s2ipython3 to consume the latest fix, and then instructing the users service to consume the latest s2ipython3.

Our data model and scan process summarized

The final result is that we have an intelligent way to go from naively scanning all images one by one:

Diagram showing how we could naively scan 400,000 images one by one.

To being selective:

Diagram showing how our new approach using graph traversals allows us to scan just 4,000 images instead of 400,000.

while still making necessary calculations.

You might be wondering why we don’t run this set of scanner tools as part of our CI to block deployments from running vulnerable bits in the first place. The quick answer here is that we do use a separate set of tools to perform security checks in our CI, but we will always need the ability to scan images that we’ve identified as running in production. CVE definitions change daily so an initial CI check will only be one piece of our security posture.

Conclusion

One major challenge with vulnerability management on internally managed container images was identifying where a given issue was introduced so that we could determine the appropriate owner and the fix. In many cases, a service owner must wait for upstream dependencies to update first so that our automated systems can cascade those changes down the image tree. Designing and building this data model and set of algorithms was difficult but instrumental for everything that would come later.

We’re not done telling this story yet but this is a decent stopping point to include some learnings and takeaways:

We have many homegrown components — such as our own notion of a service-to-base image dependency tree — that made it difficult to integrate with existing vulnerability detection/mitigation tools. The effort to integrate in many cases was more than building out the logic ourselves.
We spent a lot of time on data quality. We started off with the default AWS ECR scanner and realized that having fixed version data was as important as the scanning itself, and we pivoted to Trivy to get that. As great as this is, it’d help us further cut down on noise if Trivy could tell us if a vulnerable codepath would never be called or executed from our services. There are a lot of different projects in the supply chain security space working on this though, and we’ll be watching developments here closely.

Our two year vulnerability management journey has not been as smooth as it could have been. Our largest technical hurdles involved non-actionable vulnerabilities, complex business logic, and randomizing teams when we made mistakes. This last point is most important because if we lost the trust of fellow engineering teams, they would not take future recommendations seriously.

We’ve also put more focus on taking care of our teammates by making sure we can continue to run this program in a sustainable way. We’ve benefited from building controls, such as alarms to know about data quality issues, feature flags to handle controlled releases, and circuit breakers to prevent our state machine from making obviously incorrect decisions.

In our next blog post, we’ll go into detail about the automated machinery we built around our data model to keep parent and child container images up to date, and we’ll tell some war stories about how early iterations regrettably ended up randomizing all of our engineering teams.

Have you worked on container vulnerability management at your company? I’d love to swap notes — leave a comment below or reach me at @alexchantavy on Twitter.

If you’re interested in working on Security problems like these then take a look at our careers page.