Introducing DoorDash’s In-House Search Engine

February 27, 2024 9 Minute Read Backend 0

Konstantin Shulgin

Konstantin Shulgin is a Software Engineer at DoorDash on the Search Platform team.

Satish Subhashrao Saley

Satish Subhashrao Saley is a Software Engineer on DoorDash's Search Platform team.

Anish Walawalkar

Anish Walawalkar is a Software Engineer at DoorDash on the Search Platform team.

Path to Our Search Engine

We wanted to design the new system as a horizontally scalable general-purpose search engine capable of scaling to all traffic – indexing or searching – by adding more replicas. We also designed the service to be a one-stop solution for all DoorDash teams that need a search engine.

Apache Lucene, the new system’s foundation, provides a mature information retrieval library used in several other systems, including Elasticsearch and Apache Solr. Because the library provides all the necessary primitives to create a search engine, we only needed to design and build opinionated services to run on top of the library.

The Search Engine Components

To address scalability challenges, we adopted a segment-replication model. We split indexing and searching responsibilities into two distinct services – indexer and searcher, as shown in Figure 1 below. The indexer is a non-replicated service that handles all incoming indexing traffic and uploads newly created index segments to S3 for searcher consumption. The searcher is a replicated service that serves queries against the index downloaded from S3.

Because the searcher is not responsible for indexing traffic, it only needs to scale proportionally to the search traffic. In other words, the searcher will not be affected by any volume of indexing traffic. The indexer is not a replicated service; horizontally scaling the indexer means increasing the number of index shards, which could be expensive. To alleviate that issue, we split the indexing traffic into bulk and high-priority updates. The high-priority updates are applied immediately, while the bulk updates are only applied during the next full index build cycle, usually every six hours.

*Figure 1: The Search Stack Architecture*

It’s insufficient to query an index with only indexers and searchers because the index could consist of multiple index shards. Therefore, we designed the broker service as an aggregation layer that fans out the query to each relevant index shard and merges the results. The broker service also rewrites the user's raw query using a query understanding and planning service.

We also needed a component that could do query understanding and query planning. The component needs to know the specifics of a particular index and the business domain where the index is used. It would be suboptimal to outsource this responsibility to the client because each client would need to replicate this logic and keep updated. But if the logic were consolidated into the query planning service, the clients would only need to know the high-level interface without getting into all the details about query internals.

Subscribe for weekly updates

General Purpose Search Engine

As a general-purpose search engine, the Search Engine must power not only DoorDash’s store and item search but also must be available for every team that needs an information retrieval solution. That meant designing the system to provide a clear separation between core search and business logic. A user must be able to express business logic with little to no code changes and that logic must be completely isolated from the logic of other users.

The best approach to separating core search and business logic would be to introduce a declarative configuration for index schema and provide a generic query language. The index schema allows users to define strongly typed documents, or namespaces, and create relationships between the namespaces. A namespace definition consists of three primary parts:

Indexed fields are fields the indexer processes and writes (or not) in some shape or form into the inverted index. The Search Engine supports all Apache Lucene fields, including text, numeric doc values, dimensional points, and KNN vectors.

Computed fields are fields computed dynamically during query time based on inputs such as the query, the indexed fields, and other computed fields. The computed fields framework provides a means to express complex ranking functions and custom business logic; as an example, we can define a BM25 or an ML model as a computed field.

Query planning pipelines define the logic of how to process raw client queries into the final form used to retrieve and rank documents. The primary objective is to encapsulate the business logic and store it in one place. For example, a client calling DoorDash’s global search does not need all the complexity of the geo constraints if the logic is implemented in a query planning pipeline. The client would only need to supply the search with coordinates or a geo-hash of the delivery address and the name of the query planning pipeline to invoke.

In addition to the flexible index schema model, we created an SQL-like API as a powerful and flexible search query to allow customers to express their business logic with minimal code changes. The API provides a set of standards for search engine operators, such as keyword groups, filter constraints, sorting by fields, and a list of returned fields. Additionally, the Search Engine supports join and dedupe operators.

To support the join operator, we designed relationships between namespaces. A relationship can be either local-join or block-join. The local-join relationship is set between parent and child namespaces to guarantee that a child document will be added to the index shard only if a parent document references it. The nested relationship works similarly to the local-join relationship, but the parent and the children must be indexed together as a single block. Both options have advantages and weaknesses. The local-join relationship allows updating documents independently but requires executing queries sequentially. The nested relationship allows faster query execution but requires reindexing the whole document block.

Tenant Isolation and Search Stacks

Data and traffic isolation are important for users of a general-purpose search engine. To provide this isolation, we designed a search stack — a collection of search services dedicated to one particular index. A component of one search stack only knows how to build or query it’s index. Thus, sudden issues in one search stack will not cause any issues for other search stacks. Additionally, we can easily account for all resources provisioned by tenants to keep them accountable.

Search stacks are great for isolating tenants’ index schemas and services. Additionally, we wanted to find an easy way to mutate index schema and stack configuration without worrying about backward compatibility of changes. Users must be able to make changes in the index schema or fleet configuration and deploy them as soon as the changes do not have internal contradictions.

We designed a special component called a control plane — an orchestration service that is responsible for stack mutation, as shown in Figure 2 below. The control plane deploys stacks by gradually deploying a new generation and descaling the previous one. A generation has a fixed version of the search Docker image to deploy. All search components in the same generation have the same code version, index schema, and fleet configuration. The components inside a generation are isolated and can only communicate with other components within the same generation. A searcher can only consume an index produced by the indexer of the same generation, and a broker can only query searchers of the same generation.

*Figure 2: Deployment of a New Stack Generation*

This simplifies user-side changes in exchange for a more complex deployment pipeline. The control plane deploys a new generation of a stack every six hours, although that can be changed to any arbitrary timing. It starts by cutting a new release of the search repository. When the release is ready, the control plane deploys a new stack, starting from the indexer. The indexer builds a new index from scratch — full index build — and catches up with high-priority updates. After the indexer signals the new index is ready, the control plane starts gradually scaling the serving side of the current generation and descaling the previous one.

Conclusion

We spent 2023 implementing the Search Engine and migrating DoorDash to it. In the first half of the year, we delivered the initial version of the system and migrated the global store search. That led to a two-fold reduction of the store retrieval latency and a four-fold reduction of the fleet cost.

During the second half of the year, we added support for the join queries, query planning, and support for ML-ranking functions. We migrated the query understanding from the client to the query planning layer. Now, any client can call the search without replicating complex query-building logic. The join query and ML ranking are used to do global item searches without first calling the store index. These features contributed to significant improvements in the precision and recall of the item index.

Migrating to an in-house search engine has given us tight control over the index structure and the query flow. The Search Engine lets us create a flexible, generic solution with features optimized for specific DoorDash needs and the scalability to grow at the same pace as DoorDash’s business.

Related Positions

Autonomy Engineer, Behavior Prediction San Francisco, CA Autonomy Engineer, Platform San Francisco, CA Autonomy Engineer, Remote Assistance - Labs San Francisco, CA See All Jobs

Enabling Efficient Machine Learning Model Serving by Minimizing Network Overheads with gRPC

Learn the challenges of reducing network overheads with gRPC optimizations

ArbazKhan 17 Minute Read

Backend

Speeding Up Redis with Compression

One of challenges we face almost everyday is to keep our API latency low. While the problem sounds simple on the surface, it gets interesting sometimes. One of our endpoints that serves restaurant menus to our consumers had high p99 latency numbers. Since it’s a high traffic endpoint we naturally use caching pretty intensively. We ...

Zohaib Sibte Hassan 5 Minute Read

Backend

Eliminating Task Processing Outages by Replacing RabbitMQ with Apache Kafka Without Downtime

Learn how we utilized a custom Kafka solution to reduce outages and enable horizontal scalability for task processing

Saba Khalilnaji

Ashwin Kachhara 17 Minute Read

Backend General

Scaling Splunk Securely by Building a Custom Terraform Provider

Ensure your growing team can search, analyze, and visualize data securely by integrating Splunk with a custom built Terraform provider.

Esha Mallya 9 Minute Read

Backend

How to Boost Code Coverage with Functional Testing

Introducing a non manual functional testing approach that can be run like unit tests locally or in a Continuous Integration (CI) pipeline.

Lev Neiman

Venkataramanan Kuppuswamy

Carlos Herrera

James Lamine 15 Minute Read

Backend Web

Rebuilding and Migrating a Session Management System with Zero Downtime

Migrating DoorDash's business-critical session management system in a disruption-free manner required careful planning and monitoring.

Sin Ko

Li Pei 18 Minute Read

Backend

Cassandra Unleashed: How We Enhanced Cassandra Fleet’s Efficiency and Performance

Optimizing Cassandra performance for cost savings and increased efficiency.

Seed Zeng 19 Minute Read

Backend

Using Active Probing to Detect Infrastructure Failures

DoorDash engineers built Infra Prober, a new monitoring tool, to continually look for component failures and provide accurate alerts.

Shannon Li 11 Minute Read

Backend Culture Data General Machine Learning Mobile Web

2020 Hindsight: Building Reliability and Innovating at DoorDash

DoorDash recaps a number of its engineering highlights from 2020, including its microservices architecture, data platform, and new frontend development.

WayneCunningham 8 Minute Read

Thank you for subscribing!

Want More
Engineering Updates?

Susbscribe to the DoorDash engineering blog

Introducing DoorDash’s In-House Search Engine

Konstantin Shulgin

Recent Posts

Satish Subhashrao Saley

Recent Posts

Anish Walawalkar

Recent Posts

Path to Our Search Engine

The Search Engine Components

Subscribe for weekly updates

General Purpose Search Engine

Tenant Isolation and Search Stacks

Conclusion

Popular Posts

Related Positions

You May Also Like

Enabling Efficient Machine Learning Model Serving by Minimizing Network Overheads with gRPC

Speeding Up Redis with Compression

Eliminating Task Processing Outages by Replacing RabbitMQ with Apache Kafka Without Downtime

Scaling Splunk Securely by Building a Custom Terraform Provider

How to Boost Code Coverage with Functional Testing

Rebuilding and Migrating a Session Management System with Zero Downtime

Cassandra Unleashed: How We Enhanced Cassandra Fleet’s Efficiency and Performance

Using Active Probing to Detect Infrastructure Failures

2020 Hindsight: Building Reliability and Innovating at DoorDash