Improving ETAs with Multi-Task Models, Deep Learning, and Probabilistic Forecasts

March 12, 2024 12 Minute Read Machine Learning 0

Chi Zhang

Chi Zhang is a Machine Learning Engineer on ETA ML team at DoorDash. Her focus is on predictive modeling and using ML to solving real-world problems.

Recent Posts

Improving ETAs with Multi-Task Models, Deep Learning, and Probabilistic Forecasts 12 Minute Read

How DoorDash Quickly Spins Up Multiple Image Recognition Use Cases 6 Minute Read

Chi Zhang

Lewis Warne

Lewis is a Data Scientist at DoorDash.

Recent Posts

Improving ETAs with Multi-Task Models, Deep Learning, and Probabilistic Forecasts 12 Minute Read

Lewis Warne

Ziqi Jiang

Ziqi Jiang is a Machine Learning Engineer on ETA ML team at DoorDash. She concentrates on enhancing ETA models to provide consumers with more precise expectations.

Recent Posts

Improving ETAs with Multi-Task Models, Deep Learning, and Probabilistic Forecasts 12 Minute Read

Ziqi Jiang

Qingyang Xu

Qingyang Xu is a Machine Learning Engineer on ETA ML team at DoorDash. He combines state-of-the-art machine learning and optimization techniques to solve challenging prediction problems and improve consumer experience

Recent Posts

Improving ETAs with Multi-Task Models, Deep Learning, and Probabilistic Forecasts 12 Minute Read

Qingyang Xu

Hubert Jenq

Hubert is a Machine Learning Engineer at DoorDash.

Recent Posts

Improving ETAs with Multi-Task Models, Deep Learning, and Probabilistic Forecasts 12 Minute Read

Hubert Jenq

Jianzhe Luo

Jianzhe is a data science manager at DoorDash, since the summer of 2020, working with the logistics team. He holds a PhD in Operations Research from the University of North Carolina at Chapel Hill.

Recent Posts

Improving ETAs with Multi-Task Models, Deep Learning, and Probabilistic Forecasts 12 Minute Read

Using ML and Optimization to Solve DoorDash’s Dispatch Problem 18 Minute Read

Jianzhe Luo

The DoorDash ETA team is committed to providing an accurate and reliable estimated time of arrival (ETA) as a cornerstone DoorDash consumer experience. We want to ensure that every customer can trust our ETAs, ensuring a high-quality experience in which their food arrives on time every time.

With more than 2 billion orders annually, our dynamic engineering challenge is to improve and maintain accuracy at scale while managing a variety of conditions within diverse delivery and merchant scenarios. Here we delve into three critical focus areas aimed at accomplishing this:

Extending our predictive capabilities across a broad spectrum of delivery types and ETA scenarios
Harnessing our extensive data to enhance prediction accuracy
Addressing each delivery’s variability in timing, geography, and conditions

To address these challenges, we've developed a NextGen ETA Machine Learning (ML) system, as shown in Figure 1 below. The key novelty in NextGen’s architecture is a two-layer structure which decouples the decision-making layer from the base prediction problem. The goal of the base layer is to provide unbiased accuracy-first prediction with uncertainty estimation through probabilistic predictions. Then the decision layer leverages the base model’s predictions to solve various multi-objective optimization problems for different businesses. This is an evolution from our previous ETA modeling method [1, 2] which focus more on long-tail minimization.

This harnesses state-of-the-art deep learning (DL) algorithms through a novel two-layer ML architecture that provides precise ETA predictions from vast, real-world data sets for optimal robustness and generalizability. The new system introduces:

Multi-task modeling to predict the various types of ETAs via a single model
Cutting-edge DL algorithms, ensuring state-of-the-art prediction accuracy
A probabilistic base layer coupled with a distinct decision layer to quantify and communicate uncertainty

*Figure 1. NextGen ETA Machine Learning System*

The base layer model outputs a predicted distribution to estimate expected ETA time. The decision layer leverages the base model’s predictions to solve various multi-objective optimization problems for different businesses.

Scaling ETAs Across Different Delivery Types

DoorDash's ETAs cater to various customer interaction stages and delivery types. Initially, customers can use ETAs on the home page to help them decide between restaurants and other food merchants. The features available for predicting these ETAs are limited because they are calculated before the customer has placed anything in their cart; feature latency must remain low to quickly predict ETAs for all nearby providers.

Delivery types range from a Dasher picking up prepared meals to grocery orders that require in-store shopping, which introduces distinct delivery dynamics.

Historically, this diversity required multiple specialized models, each finely tuned for specific scenarios, leading to a cumbersome array of models. While accurate, this approach proved unsustainable because of the overhead required to adapt and evolve each model separately.

The pivot to multi-task (MT) modeling addresses these challenges head-on. This approach enables us to streamline ETA predictions across different customer touchpoints and delivery types within a singular, adaptable framework.

Nowadays, MT architecture is commonly used in large-scale ML systems such as computer vision and recommendation systems. Our MT architecture consists of a shared heavyweight foundation layer, followed by a specialized lightweight task head for each prediction use case (see “Base Layer” in Figure 1 above).

The consolidated MT model offers four distinct advantages:

We can quickly onboard new ETA prediction use cases by attaching new task heads to the foundation layer and fine-tuning the predictions for the new task.
This model offers a seamless experience across the platform by providing consistent ETA predictions through different stages of the consumer's journey, including on the home page, store page, and checkout page. Before MT, we trained separate models for each stage, which led to discrepancies in ETA predictions across stages, negatively affecting the consumer experience.
Using the MT model, we can fully leverage the tremendous capacity of DL models, using high-dimensional input features to capture the uncertainty in ETA.
The MT model elegantly solves the data imbalance issue via transfer learning. At DoorDash, certain types of ETA predictions, for example Dasher delivery, are more common than others, such as consumer pick-up. If we train separate models for each use case, the infrequent ones will suffer from lower prediction accuracy. MT improves these infrequent use cases by transferring ETA patterns learned from frequent use cases. For example, MT greatly improved ETA predictions for deliveries in Australia, which previously used a separately trained model.

However, the novel MT architecture involves unique challenges. Model training is more complex because each model must simultaneously learn to predict different ETA use cases. After experimenting with different training strategies, we found that sequentially training different heads, or tasks, achieves the best overall prediction performance. Additionally, there is higher model latency in the MT architecture because of the neural network’s larger number of parameters. We address this through close collaboration with backend engineering teams.

Subscribe for weekly updates

Using Deep Learning to Enhance Accuracy

In our quest to maximize ETA prediction accuracy, we encountered a challenge with our tree-based ML models: they reached a performance plateau beyond which further model enhancements, additional features, or more training data fail to yield significant improvements. Tree-based models also could not generalize effectively to unseen or rare use cases — a common issue with these types of models.

Using tree-based models often resulted in tradeoffs that sacrificed earliness to reduce lateness, and vice versa, rather than improving on-time accuracy. To overcome this and align with our system design goals, we revamped our ETA model architecture from tree-based to neural networks that could provide more accurate, robust, and generalizable ETA prediction performance.

We plan to publish more blog posts on MT ETA model development. Here are high-level summaries of the three key areas we will address:

Feature engineering: We integrated feature-embedding techniques to effectively encapsulate a broad spectrum of high cardinality data. This advancement has significantly refined our model's ability to discern and learn from intricate patterns unique to specific data segments.
Enhanced model capability: Our model was augmented with advanced components pivotal in detecting both high- and low-level signals and in modeling complex interactions between features.
Flexible optimization: We leveraged the adaptable nature of the DL optimization process to provide flexibility for our model to meet our diverse set of objectives more effectively.

Leveraging Probabilistic Models for More Accurate ETAs

DoorDash believes it is pivotal to ensure customer trust. Customers depend on our delivery time estimates to make choices between restaurants and to organize their schedules. Previously, we used machine learning to produce estimates based on analyzing past delivery data, factoring in various elements like food preparation time and Dasher availability.

However, aspects of the delivery process are inherently uncertain and involve factors we can't fully predict or control. This unpredictability can affect out accuracy. Consider these scenarios:

Component	Unknown Factor	Potential Impact
Food preparation time	# of sit-down and non-DoorDash restaurant orders	Delay in food readiness
Dasher availability	Whether nearby Dashers will accept the order	Pickup time variability
Parking availability	In dense urban areas, Dashers may have trouble finding parking	Pickup and/or drop-off delays

By factoring in these uncertainties, we aim to refine our delivery estimates, balancing accuracy with the reality of unpredictable factors. We do this by leveraging the innovative two-layer structure which employs a probabilistic prediction model as the base layer.

Consider two theoretical deliveries as shown in Figure 2 below, both with a predicted ETA of 20 minutes, but one with a larger variance and a higher chance of being either early or late.

*Figure 2: Modeling the uncertainty of delivery time via by predicting its probability distribution*

These visuals underscore an important concept: It’s as crucial to understand the probability of all possible outcomes via a distribution's shape and spread as it is to know the mean ETA, or most likely arrival time. That’s why our new base layer was developed to produce a probabilistic forecast of delivery time, giving us a much more developed understanding of uncertainty.

Deep Dive — Evaluating the Accuracy of a Probabilistic Forecast

With our shift to probabilistic forecasts, we encounter a new challenge: Measuring the accuracy of our predictions. Unlike point estimates, where accuracy is assessed in a straightforward manner through metrics like mean absolute error (MAE) or root mean square error (RMSE), probabilistic predictions require a different approach.

Weather forecasting offers valuable insights into what is required to meet this challenge. Forecasters predict probabilities for weather events in a similar manner to our delivery time predictions. In both cases, there's a single actual outcome, whether it's a specific temperature or a delivery time, involving two primary metrics: Calibration and accuracy.

Calibration

Probabilistic calibration ensures a model's predicted distributions align closely with actual outcomes. In other words, realizations should be indistinguishable from random draws from predictive distributions. Consider a weather model that predicts temperature ranges with certain probabilities. A model that consistently underestimates high temperatures likely has a calibration issue.

To apply this to DoorDash’s ETA model, imagine our deliveries all have the same underlying distribution. Figure 3 shows lines on a Weibull cumulative distribution function (CDF) that denote five equal probability buckets. If the predicted distribution is accurate, we can expect an equal number of deliveries to land in each bucket.

*Figure 3: CDF of Weibull distribution, used to model long-tailed delivery time*

The following simulation illustrates the different types of calibration errors. In Figure 4 below, we show the results of simulating two underlying distributions versus a well-calibrated static predicted distribution. To create a probability integral transform (PIT) histogram visualization, we drew 500 deliveries and plotted the actual delivery times against our predicted quantiles to inspect probabilistic calibration.

*Figure 4: Use PIT to measure the calibration of the predicted probability distribution*

These visuals show how well our model captures the real-world variability in delivery times. Calibration is key to ensuring that our predictions are not just accurate for the most likely delivery time, but also reliable and reflective of actual delivery scenarios.

Accuracy

While calibration is essential, it cannot by itself encompass the complete picture of a model's accuracy. To holistically assess our model, we need a unified metric that can compare different models effectively. Here, we borrow from the weather forecasting field once again, employing a technique known as the continuous ranked probability score (CRPS).

CRPS extends the concept of MAE to probabilistic forecasts. The formula for CRPS is:

CPRS(F(· |X), x) ﹦∫ (F(y|X) - 1{y ≥ x} )² dy

Where:

F(y|X) represents the CDF of the forecast, conditioned on the covariates X
x is the single observed delivery time for an order
1{y ≥ x} is the indicator function, equal to 1 if y ≥ x and 0 otherwise

In simpler terms, CRPS measures how well the predicted distribution (every point weighted by its probability) aligns with the actual observed delivery time. Visually, we want to minimize the area between the predicted CDF and a vertical line at the observed delivery time, with a higher CRPS penalty for the area further away from the observed delivery time due to the square. Figure 5 shows two examples of this:

*Figure 5: Use CRPS to measure the alignment of predicted distribution with ground truth*

By averaging the CRPS across multiple predictions, we derive a single, comparable score for each model. This score is crucial to identify both those areas where our model excels and where it needs refinement, ensuring continuous improvement in ETA predictions. We have made significant strides in the probabilistic prediction and distribution evaluation, marking a milestone in our journey. But this journey is far from over; we anticipate further improvements and breakthroughs as we continue to refine and advance our efforts.

Experiment Results

Combining the above techniques, we developed three versions of NextGen ETA models which all improve consumer outcomes.

Accuracy: Our north star metric for the NextGen ETA metric measures how often a delivery arrives on time according to our prediction and the overall quality of our model
Consistency: Our guardrail metric ensures our ETAs are consistent and that customers don't see large changes in their ETAs

Chart — *Figure 6: Accuracy and consistency improvements achieved by next-gen ETA models*

Conclusion

At DoorDash, our commitment to providing transparent and accurate information is a fundamental part of fostering trust with our consumers. Understanding that the ETA is crucial for our customers, we've dedicated ourselves to enhancing our estimates’ precision. Through embracing advanced predictive modeling with multi-task models, deep learning, and probabilistic forecasts, we're producing world-class predictions while accounting for real-world uncertainties. This approach doesn't only improve our service; it reinforces our reputation as a reliable and customer-centric platform, ensuring that every ETA we provide is as accurate and trustworthy as possible.

Related Positions

Director/Senior Director, Marketing Analytics San Francisco, CA; Seattle, WA; New York, NY; Los Angeles, CA; Chicago, IL; Austin, TX; Washington, D.C. See All Jobs

Augmenting Fuzzy Matching with Human Review to Maximize Precision and Recall

When we encountered a business problem that required an error rate of zero, we turned to a classification model with human review

Robert B. Kaspar

Mitchell Koch 9 Minute Read

Machine Learning

Experiment Rigor for Switchback Experiment Analysis

At DoorDash, we believe in learning from our marketplace of Consumers, Dashers, and Merchants and thus rely heavily on experimentation to make the data-driven product and business decisions. Although the majority of the experiments conducted at DoorDash are A/B tests or difference-in-difference analyses, DoorDash occasionally relies on a type of experimentation internally referred to as ...

Carla Sneider

Yixin Tang 14 Minute Read

Machine Learning

Why Good Forecasts Treat Human Input as Part of the Model

At DoorDash, getting forecasting right is critical to the success of our logistics-driven business, but historical data alone isn’t enough to predict future demand. We need to ensure there are enough Dashers, our name for delivery drivers, in each market for timely order delivery. And even though it seems like people’s demand for food delivery ...

Brian Seo 10 Minute Read

Machine Learning

Using a Human-in-the-Loop to Overcome the Cold Start Problem in Menu Item Tagging

Learn how we built a classification model quickly, cheaply, and at scale

Abhi Ramachandran 19 Minute Read

Machine Learning

Leveraging Causal Inference to Generate Accurate Forecasts

Learn how DoorDash captures hard to measure macroeconomic effects like IRS refunds and the effect of Daylight savings in these case studies

Chad Akkoyun

Qiyun Pan 11 Minute Read

Machine Learning

Things Not Strings: Understanding Search Intent with Better Recall

For every growing company using an out-of-the-box search solution there comes a point when the corpus and query volume get so big that developing a system to understand user search intent is needed to consistently show relevant results. We ran into a similar problem at DoorDash where, after we set up a basic “out-of-the-box” search ...

Siddharth Kumar

Jimmy Zhou

Xiaochang Miao

Ashwin Kachhara 19 Minute Read

Machine Learning

Improving Experimental Power through Control Using Predictions as Covariate (CUPAC)

Too much varience can reducing experimental power. Learn how we solved this problem with our new CUPAC method

Jeff Li

Yixin Tang

Jared Bauman 9 Minute Read

Machine Learning

Building DoorDash’s Product Knowledge Graph with Large Language Models

Using Large Language Models, DoorDash built a product knowledge graph hat captures relationships between millions of consumer goods.

Steven Xu

Sree Chaitanya Vadrevu 9 Minute Read

Machine Learning

Increasing Operational Efficiency with Scalable Forecasting

Scaling forecasting to a large data team is not practical without a scalable platform. Learn how we built forecast factory at DoorDash

Ryan Schork 11 Minute Read

Thank you for subscribing!

Want More
Engineering Updates?

Susbscribe to the DoorDash engineering blog

High Uncertainty	Low Uncertainty

This distribution shows a wide spread, indicating significant variability. Although the average ETA is 20 minutes, actual delivery times may vary widely, reflecting a high level of uncertainty.	Here the distribution is closely clustered around the 20-minute mark. This tight grouping suggests that deliveries will likely align closely with the estimated time, indicating lower uncertainty.

Well-Calibrated	Under-Dispersion	Over-Dispersion

Deliveries are evenly distributed across the quantiles, indicating accurate predictions.	A U-shaped pattern suggests the model predicts a wider range than observed, indicating over-caution.	An inverse-U shape reveals the model underestimates variability, missing the extreme cases.

Less Accurate	More Accurate

Shows a larger area between the predicted CDF and the observed time, indicating lower accuracy.	Exhibits a smaller area, suggesting a closer alignment between prediction and reality.

Improving ETAs with Multi-Task Models, Deep Learning, and Probabilistic Forecasts

Chi Zhang

Recent Posts

Lewis Warne

Recent Posts

Ziqi Jiang

Recent Posts

Qingyang Xu

Recent Posts

Hubert Jenq

Recent Posts

Jianzhe Luo

Recent Posts

Scaling ETAs Across Different Delivery Types

Subscribe for weekly updates

Using Deep Learning to Enhance Accuracy

Leveraging Probabilistic Models for More Accurate ETAs

Deep Dive — Evaluating the Accuracy of a Probabilistic Forecast

Calibration

Accuracy

Experiment Results

Conclusion

Popular Posts

Related Positions

You May Also Like

Augmenting Fuzzy Matching with Human Review to Maximize Precision and Recall

Experiment Rigor for Switchback Experiment Analysis

Why Good Forecasts Treat Human Input as Part of the Model

Using a Human-in-the-Loop to Overcome the Cold Start Problem in Menu Item Tagging

Leveraging Causal Inference to Generate Accurate Forecasts

Things Not Strings: Understanding Search Intent with Better Recall

Improving Experimental Power through Control Using Predictions as Covariate (CUPAC)

Building DoorDash’s Product Knowledge Graph with Large Language Models

Increasing Operational Efficiency with Scalable Forecasting