Setting Up Kafka Multi-Tenancy

March 27, 2024 9 Minute Read Backend 0

Yunji Zhong

Yunji Zhong is a Software Engineer at DoorDash on Kotlin Platform team. His focus is on Microservice platform and Monorepo.

Amit Gud

Amit Gud is a software engineer at DoorDash, since 2020, working on the Developer productivity team. He holds a Masters Degree in Computer Science from Kansas State University.

Carlos Herrera

Carlos Herrera is a Software Engineer at DoorDash, since December 2018, working on the Developer Platform team, a sub team of the Infrastructure Engineering Team. He has increased developer productivity by authoring onboarding bootcamps including how to create a Backend Unit Test, hosting weekly learning sessions (topics around infrastructure including AWS and Kubernetes), simplifying deployment strategies (using Terraform, Helm, and Argo Rollouts), sunsetting legacy code, and making testing in production a reality. He holds a Bachelor of Science in Computer Science and Engineering from the University of California, Davis.

The world of multi-tenancy

DoorDash has pioneered the testing in production which utilizes the production environment for end-to-end testing. This provides a number of advantages including reduced operational overhead. But this also brings forth interesting challenges around isolating production and test traffic flowing through the same stack. We solve this using a fully multi-tenant architecture where data and traffic is isolated at the infrastructural layer with minimal interference with the application logic.

Multi-tenancy involves designing a software application and its supporting infrastructure to serve multiple customer segments or tenants. At DoorDash, we have introduced a new tenant called doortest in the production environment. Under this tenant, the same application or service instances are shared with different user management, data, and configurations, ensuring efficient and effective testing in a production-like environment.

Data isolation in multi-tenants

In a multi-tenant environment, data isolation is crucial to ensure that tenants don’t impact each other. While we have achieved this in databases, it also needs to be extended to other infrastructure components. In Kafka, a test tenant processing production event can cause data inconsistencies, including outages and other incidents.

There are limitations to the traditional approach of using separate Kafka topics for each tenant, including scalability issues for multiple tenant environments, and inaccurate load testing.

To overcome these challenges, DoorDash has made Kafka a tenant-aware application, which allows different tenants to share the same topic. Figure 1 below provides an overview of the Kafka workflow architecture.

In this workflow, messages originating from various tenant environments are tagged with distinct tenant information by an agent of OpenTelemetry, or OTEL — an open-source framework that provides tools and software to collect and process telemetry data from cloud-native applications. OTEL uses native Kafka headers to propagate context. Upon receipt by the consumer, the context filter relays messages containing the appropriate tenant information to the processor. This ensures that sandbox consumers mirror the configurations of production consumers and subscribe to the same topic.

To achieve this, we made several changes to Kafka producer and consumer clients as described below.

Kafka producer with context propagation

As explained in a previous post, OTEL provides custom context propagation, which simplifies implementation of multi-tenancy on the Kafka producer side.

Each event sent out by the Kafka producer includes propagated tenant and route information.

Additionally, we have scenarios in which a single service requires multiple sandbox environments. To distinguish which sandbox environment an event is directed toward, we incorporate route information to map a production service application name to a sandbox host. A unique host label is generated upon sandbox deployment. The host label varies between deployments but remains consistent among all pods within the same deployment. The pod machine’s environment variable sets the host label, which provides route information in the context propagation. Both of these contexts can easily be configured through an internal UI tool.

Kafka consumer as a service

In DoorDash, the Asgard framework offers a range of standard libraries that encapsulate commonly used server and client functionalities. Asgard dependencies are presented as a single opaque list, providing all the boilerplate necessary for integrating widely used libraries and hiding their versions behind one Asgard version. Asgard also offers yet-another-markup-language, or YAML, configuration files for various environments such as prod, and sandbox.

Asgard lets product team engineers concentrate solely on implementing the business logic in their services. For Kafka consumers, Asgard runs as a service, only exposing configurations through YAML files while processing the event method for developers.

Figure 2 below shows an overview of Asgard. Thanks to this framework, product team engineers only need to focus on the YAML configuration and Service implementation sections.

The Asgard framework allows us to inject multi-tenancy awareness for Kafka consumers in one place, which is then automatically applied to all the product team's services.

Consumer group isolation

Consumer groups allow Kafka consumers to work together and process events from a topic in parallel. Events sent to the same topic will be load-balanced to all consumers in the same group, meaning the first requirement is to set different consumer groups for various tenants. We offer two ways to do consumer group isolation in a sandbox environment.

The first option is manual configuration, where the user can update the YAML config file and set a different group ID for the sandbox environment.

The second option is auto-generation, which is enabled by default for Asgard Kafka consumers. When running in a sandbox environment, the Asgard Kafka consumer service automatically appends the host label’s suffix to the group ID. This ensures that different sandbox deployments have different consumer groups and that within the same deployment, all consumer pods are part of the same consumer group. This approach ensures proper load balancing of events to all consumers within the same group while maintaining isolation between different tenant groups.

This is an example of configuration:

kafka:
groupId: xxx_group_id
randomTenantGroupId: true
…

Another important consideration is setting the auto.offset.reset property for the Kafka consumer. In the sandbox environment, we set it to latest by default. This is to prevent the inefficient polling of all existing events in the Kafka cluster whenever a new deployment occurs. Instead, the consumer starts from the latest available event.

Tenant and route context isolation

The test tenant Kafka consumer can now subscribe to the same topic as the production tenant to receive real-time events. The next step is to filter out events not targeted to the current tenant consumers.

To achieve this, we introduced an additional Kafka consumer config field that accepts a list of allowed tenant events. By setting this config field, the Kafka consumer verifies the tenant context information and skips non-matching events. This step ensures that sandbox consumers do not accidentally process events intended for production consumers.

After that, there is another filter based on the route information. We compare the host label retrieved from the environment variable with the one inside the route context header to determine whether the current consumer is the event's target destination. This step ensures that production and sandbox consumers do not process events that belong to a different tenant. In the absence of the route information, the production tenant processes the doortest events ensuring that test traffic gets processed if there are no sandbox deployed for the service.

For example, our Advertisements Team sought to segregate production and testing events to prevent adverse impacts on our ad serving algorithms caused by production services processing test events. Consequently, they opted for the config pattern, explicitly defining allowedConsumerTenancies for both production and sandbox environments.

In production environment:

kafka:
    allowedConsumerTenancies:
      - prod
    ...

In sandbox environment:

kafka:
    allowedConsumerTenancies:
      - doortest
    …

Meanwhile, our Logistics Team preferred not to handle the responsibility of deploying sandboxes solely for processing all test events. They found it safe for their production services to handle both production and test events. However, they aimed to restrict sandboxes to processing specific test events following the deployment of a new release. To achieve this, they simply set enableTenantRouting to true.

kafka:
enableTenantRouting: true
…

Separately, our Dasher Team wanted to shadow all the production events to test a new alternative architecture. This was safe since the processing of the events did not mutate production data. To achieve this, they simply set enableTenantRouting to false.

kafka:
enableTenantRouting: false
…

The table in Figure 3 is created by combining tenant and routing context to monitor which Kafka consumer from each environment will handle a specific message.

Consumer Env	*Tenant ID ()**	*Route Info ()**	Allowed Consumer Tenancies ()**	Process Event?
prod	prod	N/A	prod	Yes
prod	doortest	N/A	prod	No
sandbox	prod	N/A	doortest	No
sandbox	doortest	N/A	doortest	Yes
prod	prod	N/A	both	Yes
prod	doortest	absent	both	Yes
prod	doortest	present	both	No
sandbox	prod	N/A	both	No
sandbox	doortest	sandbox host is not a match	both	No
sandbox	doortest	sandbox host is a match	both	Yes

Figure 3: Kafka message consumption decision table
(*) from Kafka event context
(**) from yaml config

Putting it all together

With this new multi-tenant aware Kafka, testing Kafka applications in isolation has become easier for the developers. No code changes are required; developers only need to add a single line to the configuration file. This update addresses several use cases, including the consumption of messages with designated tenant IDs and routing contexts. Additionally, it ensures that all Kafka messages are consumed without any being left unprocessed.

This solution ensures that the multi-tenancy paradigm is fully realized in Kafka, providing data isolation between different tenants and avoiding potential issues with data inconsistencies. Overall, this is a crucial step toward achieving a more robust and reliable production environment at DoorDash.

Conclusion

In summary, DoorDash has implemented a multi-tenancy awareness system for both Kafka producers and consumers that makes the production environment’s tech stack more efficient and developer-friendly for testing new features and patches. DoorDash has streamlined the test-and-release process for product team engineers through simple YAML file configurations while ensuring the security and isolation of each tenant’s data. The result is a more robust and simpler testing-in-production environment.

Related Positions

Autonomy Engineer, Behavior Prediction San Francisco, CA Autonomy Engineer, Platform San Francisco, CA Autonomy Engineer, Remote Assistance - Labs San Francisco, CA See All Jobs

Building Reliable Workflows: Cadence as a Fallback for Event-Driven Processing

Reengineering our event-driven delivery service for DoorDash Drive into Kotlin, we added the open source Cadence as a fallback for retries.

Alan Lin 9 Minute Read

Backend

How to Boost Code Coverage with Functional Testing

Introducing a non manual functional testing approach that can be run like unit tests locally or in a Continuous Integration (CI) pipeline.

Lev Neiman

Venkataramanan Kuppuswamy

Carlos Herrera

James Lamine 15 Minute Read

Backend Data

Building Faster Indexing with Apache Kafka and Elasticsearch

DoorDash describes how it built a faster search index using open source projects.

Satish Saley

Danial Asif

Siddharth Kumar 13 Minute Read

Backend Web

Building a Unified Chat Experience at DoorDash

Learn how we unified our chat across all our platforms by leveraging common UI components and an extensible backend and automated it with NLP

Dan Behar 8 Minute Read

Backend

Separating User Data with Multi-tenancy To Improve User Management

Learn how multi tenancy can help enable more convenient guest checkout

Carol Wang 7 Minute Read

Backend Data

How We Applied Client-Side Caching to Improve Feature Store Performance by 70%

Learn about which caching libraries we considered, the analysis of our system and how we were able to use experiments to validate our approach.

Kornel Csernai 18 Minute Read

Backend

Writing Delightful HTTP Middleware in Go

While writing complex services in go, one typical topic that you will encounter is middleware. This topic has been discussed again, and again, and again, on internet. Essentially a middleware should allow us to: Intercept a ServeHTTP call, and execute any arbitrary code. Make changes to request/response flow along continuation chain. Break the middleware chain, ...

Zohaib Sibte Hassan 4 Minute Read

Backend

Meeting DoorDash Growth with a Self-Service Logistics Configuration Platform

DoorDash evolves from restaurant deliveries to a diverse business portfolio by using a robust logistics platform.

Saurabh Gupta

Reid Arwood 20 Minute Read

Backend

Data at DoorDash: Transparent, Ubiquitous, and Still Just Getting Started

(Cross-posted from Job Portraits, a site that highlights fast-growing startup teams. For the interview below, Job Portraits spoke with Hendra, DevOps/Data Infrastructure Manager; Rohan, Engineering Manager; Preston, Data Scientist & M.L. Engineer; and Jessica, Head of BizOps/Analytics.) DoorDash tracks hundreds of variables to make sure a customer’s food arrives on-time and fresh, but the impact of data ...

JobPortraits 25 Minute Read

Thank you for subscribing!

Want More
Engineering Updates?

Susbscribe to the DoorDash engineering blog

Setting Up Kafka Multi-Tenancy

Yunji Zhong

Recent Posts

Amit Gud

Recent Posts

Carlos Herrera

Recent Posts

The world of multi-tenancy

Data isolation in multi-tenants

Kafka producer with context propagation

Kafka consumer as a service

Consumer group isolation

Tenant and route context isolation

Putting it all together

Conclusion

Popular Posts

Related Positions

You May Also Like

Building Reliable Workflows: Cadence as a Fallback for Event-Driven Processing

How to Boost Code Coverage with Functional Testing

Building Faster Indexing with Apache Kafka and Elasticsearch

Building a Unified Chat Experience at DoorDash

Separating User Data with Multi-tenancy To Improve User Management

How We Applied Client-Side Caching to Improve Feature Store Performance by 70%

Writing Delightful HTTP Middleware in Go

Meeting DoorDash Growth with a Self-Service Logistics Configuration Platform

Data at DoorDash: Transparent, Ubiquitous, and Still Just Getting Started