Building Real-time Machine Learning Foundations at Lyft

Published in

Lyft Engineering

10 min readJun 28, 2023

Written by Konstantin Gizdarski and Martin Liu at Lyft.

In early 2022, Lyft already had a comprehensive Machine Learning Platform called LyftLearn composed of model serving, training, CI/CD, feature serving, and model monitoring systems.

On the real-time front, LyftLearn supported real-time inference and input feature validation. However, streaming data was not supported as a first-class citizen across many of the platform’s systems — such as training, complex monitoring, and others.

While several teams were using streaming data in their Machine Learning (ML) workflows, doing so was a laborious process, sometimes requiring weeks or months of engineering effort. On the flip side, there was a substantial appetite to build real-time ML systems from developers at Lyft.

Lyft is a real-time marketplace and many teams benefit from enhancing their machine learning models with real-time signals.

To meet the needs of our customers, we kicked off the Real-time Machine Learning with Streaming initiative. Our goal was to develop foundations that would enable the hundreds of ML developers at Lyft to efficiently develop new models and enhance existing models with streaming data.

In this blog post, we will discuss some what we built in support of that goal and the lessons we learned along the way.

Capabilities of Real-time Machine Learning

One of the first questions we asked ourselves is — what are the general use cases within the ML ecosystem that can leverage streaming data?

We identified three main capabilities of real-time ML applications which could leverage streaming:

Real-time Features → Computing features with real-time streaming data
Real-time Learning → Training models with real-time streaming data
Event Driven Decisions → Making decisions, e.g. retraining a model, triggering an alert, or running an inference call, with real-time streaming data

These three capabilities were exhaustive based on our exploration and similar to those described in the Flink use-cases documentation.

The Event Driven Decisions capability in particular turned out to be general enough as to be applicable to a wide range of use cases.

Shortly after we built it, it was utilized by another pod within our team to build a Real-time Anomaly Detection product. At the time of writing, a Mapping team is working to utilize theEvent Driven Decisions product to rebuild Lyft’s Traffic infrastructure by aggregating data per geohash and applying a model.

It was great to see our newly built abstractions being used to build even more new higher order abstractions right away — a testament to their utility.

Real-time Capabilities Across the ML Lifecycle

On the LyftLearn team, a core focus has been to promote a rigorous machine learning lifecycle, codified in the diagram below.

In support of that focus, for each of the real-time ML capabilities, we endeavored to provide support across all stages of a model’s lifecycle — from prototyping to development to production and through maintenance.

The steps of the ML Model Development Life Cycle at Lyft. — Illustrated ML Model Development Flow at Lyft

In our prior work in building LyftLearn Serving, a key design tenet was creating a uniform training and serving environment for models — in the form of a shared Docker image. This ensured any subtle bugs stemming from differences in runtimes would be stamped out.

With streaming, our goal was to do the same — provide uniformity across environments to ensure reproducibility and development velocity.

Developers had to be able to develop a real-time pipeline in a notebook, test it on realistic streaming data (in that same notebook) to ensure their code is functionally and logically correct, commit the code to a GitHub repository, and deploy the pipeline to production — all with as little friction as possible.

The code for a real-time ML application should be written once and run everywhere.

Designing a Common Interface: RealtimeMLPipeline

Recall that our aim was to simplify integrating streaming into ML models for developers by abstracting away technical complexity.

To facilitate this, we created a common interface called RealtimeMLPipeline for defining all real-time ML applications. The interface was designed such that a minimal amount of metadata was needed to construct a pipeline object which performs a given capability.

Example RealtimeMLPipeline

driver_accept_proportion_10m

To better understand the RealtimeMLPipeline interface, we can take a closer look at a Real-time Features use case as an example.

To define a RealtimeMLPipeline which encodes a real-time feature, a developer provides metadata such as a feature name and version, a query to compute it, and instantiates a RealtimeMLPipeline Python object. That Python object is portable across all environments that we support—local test environment, notebook, staging, and production.

The following code block is an example which defines real-time feature computation using the RealtimeMLPipeline interface:

feature_sql = """
SELECT driver_id AS entity_id, window_start AS rowtime, count_accepted / count_total as feature_value
FROM (
SELECT driver_id,
window_start,
CAST(sum(case when status = 'accepted' then 1.0 else 0.0 end) AS DOUBLE) as count_accepted,
CAST(count(*) AS DOUBLE) as count_total
FROM TABLE(
TUMBLE(TABLE driver_notification_result, DESCRIPTOR(rowtime), INTERVAL '10' MINUTES)
)
GROUP BY driver_id, window_start, window_end
)
"""

feature_sink = DsFeaturesSink()
feature_definition = FeatureDefinition('driver_accept_proportion_10m', 'some_feature_group', Entity.DRIVER, 'float')
pipe = RealtimeMLPipeline()
pipe\
.query(feature_sql)\
.register_feature(feature_definition)\
.add_sink(feature_sink)

In this example, a feature called driver_accept_proporition_10m is defined, which represents the proportion of notifications a driver accepts per ten minute tumbling window.

The RealtimeMLPipline object constructed can be executed in a Jupyter notebook, a local test environment, and against staging and production Flink clusters. In a notebook and on a local test environment, the pipeline runs against an ad hoc Flink cluster and the output is written to the local file system for validation.

In staging and production, the pipeline runs against a multi-tentant production grade and scale Flink cluster and outputs computed features to Kafka which in turn delivers them to our Feature Storage infrastructure.

Note that in the code block we defined a RealtimeMLPipline as pipe. To run it in various environments, we would simply need to execute the following code:

pipe.run()

You can imagine that pipe object being serialized and loaded in different environments.

RealtimeMLPipeline Across Development and Production Environments

Architecture Diagram for Prototyping and Production Products — Illustration of the Analytics Event Abstraction which enabled development in a notebook of streaming applications as well as deployment to staging and production.

In the diagram above, we illustrate how RealtimeMLPipeline operates uniformly across both development and production environments.

One key component is the Analytics Event Abstraction layer. It abstracts away the data sources used in a RealtimeMLPipeline to read from warehoused data in S3 when run in a non-production environment and from a real-time data stream (Kinesis at the time of writing) in production.

Another key component is the ad-hoc Flink cluster which we spawn alongside the Jupyter notebook in our prototyping environment, which runs the same version of Flink via PyFlink as our production cluster.

With this design, we achieved uniform behavior of a complex distributed system across two different operational environments: (1) a Kubernetes-based hosted Jupyter environment that consumes event data from an S3 FileSystem connector and (2) a Kubernetes-based production environment that consumes data from Kafka and Kinesis.

This uniform behavior across the ML lifecycle enables significantly faster iteration on real-time applications which reduces the time to build a RealtimeMLPipeline from many weeks to days.

Developers could quickly iterate on their applications in a notebook and then deploy the same code in production.

Applying Real-time ML with Streaming to Lyft’s Business

Before starting the project, we had identified several new real-time ML use cases and performed simulation and offline analysis, which demonstrated that incorporating real-time data into our ML models could enhance performance metrics.

The two Alpha use cases we started out with were (1) rapidly retraining a secondary model to correct bias in our ETA model and (2) computing certain Safety Features (much like the driver feature example) per driver.

When our real-time capabilities had come together and our Alpha use-cases were shipped, we started to evangelize the capabilities through internal talks, our team’s newsletter, and meetings with other teams. The value of both saving time to build new real-time ML use cases and minimizing operational burden long-term was clear to most. The Real-time ML project garnered a pipeline of customers from almost all engineering pillars at the company, including Rider, Driver, Marketplace, Mapping, and Safety.

Over time, we settled on an engagement model. In the early days, someone from our team would work with engineers from a partner team to scope out, design, and build a real-time ML app. We’d iterate together on adding observability, setting up an experiment, and analyzing the results. We called this the Open Beta phase of the project.

As we refined our onboarding process and improved our documentation, our involvement decreased, and we moved towards a self-service model where partner teams would primarily follow how-to guides and tutorials, and only ask for occasional support when they needed to make tricky decisions or work through kinks.

With our first few use cases, we would spend multiple weeks to get an application up and running. Today, it only takes a LyftLearn engineer a few days to launch a new real-time ML application.

Technical Challenges

Ensuring uniform behavior of streaming applications turned out to be far trickier for streaming applications than it was in LyftLearn Serving. The reason: streaming is almost always stateful and almost always distributed — two things it turns out exponentially increase complexity of software development.

Furthermore, the abstractions in streaming are not always intuitive for data scientists or software engineers — or for LyftLearn engineers for that matter — so the learning curve was steep.

While working on building the real-time ML capabilities, some of the technical challenges we faced stemmed from:

The steep learning curve for developers when using streaming which required us to package up our interfaces to be as straightforward as possible while still being flexible.
The ephemeral nature of streaming data which made back-testing challenging.
The lack of streaming data support in a notebook environment due to security limitations which required us to simulate streaming data with warehoused data.
The difficulty of debugging stateful and distributed streaming jobs compared to traditional code.
The opacity caused by having multiple processes which comprised an application. For example, when running a Flink job in a notebook, we were dealing with the JVM, a Python kernel, a notebook pod, a job manager pod, task manager pods, etc.

It took some patience, but over time, our team gained expertise in both operating and extending Flink.

At Lyft, we run a fork of the open source Flink project and an important early investment was revamping our release process to make it faster to iterate on and release that fork. This investment saved us time we needed to tune the project.

In terms of concrete changes to Flink, we added FileConnector support to to read warehoused data from from S3, changed the order in which files are read from a FileSystemConnector, forked apache-flink-libraries to inject the necessary Java extensions to support S3 and Hive connectivity, exposed the Flink UI in our prototyping environment, instrumented additional stats to diagnose issues, and unpinned numpy dependencies in PyFlink, among others.

In the grand scheme of things, these were small changes, but being able to iterate on our own fork rather than having to make changes to upstream and having an efficient process for doing so was imperative to progress.

Note: while we do not discuss the Flink stack at Lyft at length in this post, we recommend this talk by Micah Wylde for some historical context.

Key Lessons

Looking back, many of the key lessons from this project were around how to develop and scale new capabilities that are based on complex technology within an organization.

In this section, we highlight some of those lessons.

Identify and validate need for capabilities → start by talking to customers to identify the set of capabilities that may be useful to build. Ensure each new capability in your platform addresses an actual problem that they have right now. If there’s no immediate demand, consider designing the platform to extend for that capability rather than immediately developing it.
Find Alpha users to maintain customer focus early on → start with a few Alpha users per capability to ensure you stay customer-focused as you build. One Alpha customer is usually not enough. Aim for at least two; three is better. You will work closely with these customers to evolve your system.
Design an extensible and simple interface → by striving for a flexible yet simple interface upfront, you can avoid bumping into limitations as more requirements emerge.
Design an easy on-boarding process → by simplifying onboarding with a prototyping environment which is quick to setup and has documentation, you will enable people to tinker and learn your system. It will also provide quick validation to you and your stakeholders that whatever it is you are building actually works.
Drive adoption via documentation and then evangelization → after some time, your platform success will depend on being able to on-board customers in a scalable way. The best way to do that is to invest in documentation. A successful platform will have more on-boarding support burden than you as an individual have bandwidth. Once you have adequate documentation, you can evangelize the capability and point interested people to the combination of the documentation and prototyping environment.

Acknowledgements

Special thank you to Martin Liu for spearheading Real-time ML with Streaming at Lyft and for the outstanding technical depth, grit, and patience. This project would not have been possible without your initiative and leadership.

We are grateful to the engineering partners and leaders without whose contributions the project would not have been possible, including Seth Saperstein, Ravi Magham, Hakan Baba, Mihir Mathur, Xiao Zheng, and Shiraz Zaman. Thank you for your sponsorship, guidance, and contributions.

Thank you to our partners and early adopters such as the ETA and Traffic teams, Market Signals, Trust and Safety, and several other teams. Special thanks to Jacob van Gogh, Alex Contryman, Xiaoyi Duan, Quinn Liu, and Jieyi Wang for being some of the early adopters of this technology.

Finally, I want to acknowledge the rest of the LyftLearn team — Jonas Timmermann, Alex Jaffe, Adriana Deneault, Andy Rosales, Anindya Saha, Eric Yaklin, and Rajeev Prabhakar — for being the best of teammates.

About Lyft

Lyft is a leading rideshare company with no shortage of real-time, machine learning, and real-time machine learning problems to solve. If you are interested in the problems described here, reach out and come work with us.

Visit Lyft Careers to see our openings.