Leveraging CockroachDB’s Change Feed for Real-Time Inventory Data Processing

November 21, 2022 8 Minute Read Backend 14

Irene Chen

Irene is a software engineer at DoorDash working on DashMart since March 2021. As one of the team’s earliest engineers, Irene helped build out the core foundations of the team. Today, Irene is mostly focused on growing external partnerships with DashMart.

Aleks Pesti

Aleks is a software engineer at DoorDash on the DashMart team since early 2022. He mainly works on DashMart-as-a-Service and expanding DashMart into new verticals.

Subscribe for weekly updates

The solution to real-time processing of inventory changes

The simplest approach to propagating inventory level changes in the database to the rest of the system may have been to invoke the service code to take actions every time something that affects the inventory table is called. However, this approach is difficult to maintain and error-prone, as there are many different code paths that affect inventory levels. Additionally, it couples the action of changing the inventory with the reaction to inventory changes.

Instead, since the inventory levels are stored in specific CockroachDB tables, we decided to leverage CockroachDB’s change feed to send data changes to Kafka, which then starts Cadence workflows to accomplish whatever task needs to be done. This approach is fault-tolerant, near real-time, horizontally scalable, and more maintainable for the engineers as there are clear separations of concerns and layers of abstraction.

More specifically, the high-level solution of utilizing changefeed is as follows (Figure 1):

Create separate Kafka topics that consume data changes from the inventory tables
Configure the changefeeds on the inventory tables to publish to those Kafka topics from the previous step
Start Cadence workflows to trigger different workflows based on the data changes

*Figure 1- High-Level Architecture of Consuming CockroachDB Updates for Different Use Cases*

As illustrated in the diagram above, multiple tables can be configured with changefeeds to send messages to Kafka. We currently have two inventory tables with slightly different business needs. We have set up one Kafka consumer per table (more details on how the consumer is set up below). Consumers can choose which Cadence workflow they want to start. Note that the consumers do not have to start any Cadence workflows: they can choose to ignore the Kafka message, or do something else completely (e.g. interact with the database). For our use cases, we wrapped everything in Cadence to take advantage of Cadence’s fault-tolerance, logging, and retry capabilities.

We wrote a general Kafka processor and abstract stream processing Cadence workflow to process the inventory updates from the CockroachDB changefeed. The general Kafka processor provides a simple three-method interface for client code to implement sub processors that can kick off different Cadence workflows. The framework also handles errors, logging, and duplicate stale updates while leaving the behavior configurable through the sub processors and concrete Cadence workflow implementations.

The abstract stream processing Cadence workflow is implemented as multiple long-running workflows that process messages through an input queue. The queue is populated through Cadence signals using the SignalWithStart API. We have also implemented functionality to easily batch process messages from the queue if the client implementation desires. Once the long-running Cadence workflow runs through to the client-specified duration time, the workflow will either complete or start a new workflow depending on whether there are more messages that still need to be processed.

*Figure 2: Single Kafka Consumer for Starting Different Cadence Jobs for Different Use Cases*

An alternative design where each Kafka topic has multiple consumers was also considered. Each consumer would handle different tasks instead of having one consumer that has many different subprocessors that handle different tasks. However, DoorDash’s internal server framework for Kafka consumers only allows one consumer per Kafka topic. This limitation provided a strong incentive for us to use one consumer and multiple sub processors in order to avoid having to write custom Kafka consumer logic.

*Figure 3: Alternate Design for One Kafka Consumer per Use Case*

Building for requirement extensibility and code maintainability

As mentioned, today DashMart writes to two separate tables for inventory levels for different business needs. Initially, there was a business requirement where we only wanted one table to kick off a certain Cadence workflow, and did not want the other table to kick off that Cadence workflow. Business requirements changed later, and we decided that we wanted the other table to kick off the Cadence workflow as well. The layers of abstraction in the framework made it very easy to add that new functionality: simply add the existing Kafka subprocessor to the existing processor. Enabling the functionality was as simple as a one-line code change.

If new functionality needs to be added, a new Cadence workflow and subprocessor would need to be written, then the subprocessor needs to be added to the existing processor, providing clear abstraction and separations of concern. Again, engineers adding new functionality would not need to worry about duplicate Kafka messages, logging, retries, etc. since that is all handled by the framework code. This setup enables engineers to focus on the business logic, and worry less about resiliency, failure modes, logging, alerting, and recovery.

Additionally, inventory table schema evolution was also considered in this design. The CockroachDB changefeed exports JSON, so any schema changes would be included in the JSON. As long as the data deserialization is written in a backwards-compatible way (e.g. do not fail deserialization on unknown properties, make columns that are to be deleted nullable), schema evolution can happen seamlessly and without any breaking deployments.

Ensuring durability and recovery with Cadence

We use Cadence to handle retries. In the event of failed Cadence workflows or even failed Kafka event consumption, it is easy to recover from the failures. We recently experienced some failed Cadence workflows due to a connection leak from other unrelated features. Thanks to the way everything was abstracted, we simply updated the “last updated” column for the affected rows in the inventory tables, which automatically sent updates to Kafka and started new workflows for the failed Cadence workflows.

An additional layer of protection can be added with a dead-letter queue for Kafka messages that fail to be processed. The dead-letter queue would allow us to debug failed message consumption more easily and replay only the failed messages that make it to the dead-letter queue. This additional capability has not yet been implemented since we have not seen many failures, but is something on the roadmap for engineering excellence.

Utilizing Kafka Pods for better scalability

We have a number of Kafka pods running the Kafka consumers that are consuming messages from the Kafka topics. We have separate Cadence pods running the Cadence workflows. We have tried sending thousands of simultaneous database updates to the existing Kafka pods at once, and the resulting Cadence workflows have all completed without issues. We can scale up the Kafka and Cadence pods if our system health metrics indicate that we need more resources to process the growing number of updates.

Conclusion

With CockroachDB’s change feed feature, DashMart has built a scalable and durable system that can react to database updates in real time. Adding Kafka adds an additional layer of resiliency for moving data from one system to another. Using Cadence further provides robustness and easy access to successes and failures through a user interface. Creating a general framework for the Kafka and Cadence portions makes the system easily extensible, as adding new functionality involves only writing the core business logic that needs to be updated, saving the developer the time and effort for thinking about how to move the data around in a fast and durable way.

Comments

Comments 3

Ellie
November 21, 2022 at 11:19 am

What an interesting article Irene and Aleks! It was especially cool to learn about the scalability Doordash’s engineer solutions have, and how you’re working on real time inventory tracking.

> Achieving this task in a clean, fault-tolerant way is non-trivial due to all the complex ways inventory levels can change.

I have a question though- are the inventory change situations reported manually or is this also done automatically? If it’s done manually, it seems like the propagation would be bottlenecked by manual reporting. Any plans to improve the scalability on that?

Reply
Irene Chen
November 21, 2022 at 12:21 pm

Some inventory changes, such as sales, are done automatically. Some are done manually, but we build internal tools to make the process faster.

Reply
Jigar Bhati
February 7, 2023 at 11:44 am

Hi Irene and Aleks, a good article explaining the business constraints to solve the problem.

I have a question: I believe you need exactly-once guarantees to accurately track the inventory right? How do you achieve that given CockroachDB’s CDC and Kafka do not provide them out of the box?

Reply

Leave a Reply Cancel reply

Related Positions

Autonomy Engineer, Behavior Prediction San Francisco, CA Autonomy Engineer, Platform San Francisco, CA Autonomy Engineer, Remote Assistance - Labs San Francisco, CA See All Jobs

Building a Platform to Translate DoorDash into Multiple Languages

Learn how DoorDash automated the process of creating translations to enable us to offer a localized experience as we expand globally

Venkataramanan Kuppuswamy

Arun Dharumar

Saagarikha Srinivasan 8 Minute Read

Backend

Moving e2e testing into production with multi-tenancy for increased speed and reliability

Learn how we used multi-tenancy to improve production testing standards, speed and effectiveness in this new technical blog post

Santosh Banda 13 Minute Read

Backend General

Examining Problematic Memory in C/C++ Applications with BPF, perf, and Memcheck

As applications grow in complexity, memory stability is often neglected, causing problems to appear over time. When applications experience consequences of problematic memory implementations, developers may find it difficult to pinpoint the root cause. While there are tools available that automate detecting memory issues, those tools often require re-running the application in special environments, resulting ...

Filip Busic 47 Minute Read

Backend

Leveraging OpenTelemetry For Custom Context Propagation

Enabling custom content propagation allows our microservices architecture to take advantage of several powerful use-cases

Amit Gud 10 Minute Read

Backend

Migrating From Python to Kotlin for Our Backend Services

To support our migration to microservices we needed to find a new tech stack. Learn how we compared all the options and chose Kotlin

Matt Anger 14 Minute Read

Backend

Migrating from Heroku to AWS (using Docker)

At DoorDash, providing a fast, on-demand logistics service would not be possible without a robust computing infrastructure to power our backend systems. After all, it doesn’t matter how good our algorithms and application code are if we don’t have the server capacity to run them. Recently, we realized that our existing Heroku-based infrastructure wasn’t meeting ...

Alvin Chow 8 Minute Read

Backend

Introducing DoorDash’s In-House Search Engine

Empowering Seamless Searches: DoorDash develops an in-house search engine using Lucene for enhanced search capabilities

Konstantin Shulgin

Satish Saley

Anish Walawalkar 9 Minute Read

Backend

Building a gRPC Client Standard with Open Source to Boost Reliability and Velocity

In a microservice architecture, cross-service communication happens under a set of global rules that are hard to effectively enforce across all services without standardizing client-service communication. Relying on individual service client implementations to adhere to these rules means a lot of additional repeated work on individual teams, which has a negative impact on developer velocity. ...