DeepETA_ How Uber Predicts Arrival Times Using Deep Learning

6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog
Uber Blog Sign up
Engineering
AI, Data / ML
DeepETA: How Uber Predicts Arrival Times Using Deep

Learning
February 10, 2022 / Global
At Uber, magical customer experiences depend on accurate arrival time predictions (ETAs). We
use ETAs to calculate fares, estimate pickup times, match riders to drivers, plan deliveries, and
more. Traditional routing engines compute ETAs by dividing up the road network into small
road segments represented by weighted edges in a graph. They use shortest-path algorithms
to find the best path through the graph and add up the weights to derive an ETA. But as we all
know, the map is not the terrain: a road graph is just a model, and it can’t perfectly capture
conditions on the ground. Moreover, we may not know which route a particular rider and driver
will choose to their destination. By training machine learning (ML) models on top of the road
https://www.uber.com/en-SA/blog/deepeta-how-uber-predicts-arrival-times/ 1/18
graph prediction using historical data in combination with real-time signals, we can refine
ETAs that better predict real-world outcomes.
For several years, Uber used gradient-boosted decision tree ensembles to refine ETA
predictions. The ETA model and its training dataset grew steadily larger with each release. To
keep pace with this growth, Uber’s Apache Spark™ team contributed upstream improvements
[1, 2] to XGBoost to allow the model to grow ever deeper, making it one of the largest and
deepest XGBoost ensembles in the world at that time. Eventually, we reached a point where
increasing the dataset and model size using XGBoost became untenable. To continue scaling
the model and improving accuracy, we decided to explore deep learning because of the
relative ease of scaling to large datasets using data-parallel SGD [3]. To justify switching to
deep learning we needed to overcome three main challenges:
Latency: The model must return an ETA within a few milliseconds at most.
Accuracy: The mean absolute error (MAE) must improve significantly over the
incumbent XGBoost model.
Generality: The model must provide ETA predictions globally across all of Uber’s lines of
business such as mobility and delivery.
To meet these challenges, Uber AI partnered with Uber’s Maps team on a project called
DeepETA to develop a low-latency deep neural network architecture for global ETA prediction.
In this blog post, we’ll be walking you through some of the learnings and design choices that
helped DeepETA become the new production ETA model at Uber. For more information, please
refer to our full paper.
Problem Statement
In the past few years, there has been increasing interest in systems that combine physical
models of the world with deep learning. We take a similar approach to ETA prediction at Uber.
Our physical model is a routing engine that uses map data and real-time traffic measurements
to predict an ETA as a sum of segment-wise traversal times along the best path between two
points. We then use machine learning to predict the residual between the routing engine ETA
and real-world observed outcomes. We call this hybrid approach ETA post-processing, and
DeepETA is an example of a post-processing model. From a practical perspective, it’s generally
easier to assimilate new data sources and accommodate fast-changing business
requirements by updating the post-processing model than it is to refactor the routing engine
itself.
Figure 1: Hybrid approach of ETA post-processing using ML models
To be able to predict the ETA residual, a post-processing ML model takes into account spatial
and temporal features, such as the origin, destination and time of the request, as well
information about real-time traffic and the nature of the request, such as whether it is a
delivery dropoff or rideshare pickup, as illustrated in Figure 1. This post-processing model is
the highest QPS (queries per second) model at Uber. This model needs to be fast so as not to
add too much latency to an ETA request, and they need to improve ETA accuracy as measured
by mean absolute error (MAE) across different segments of the data.
How We Made It Accurate

The DeepETA team tested and tuned 7 different neural network architectures: MLP [4], NODE
[5], TabNet [6], Sparsely Gated Mixture-of-Experts [7], HyperNetworks [8], Transformer [9] and
Linear Transformer [10]. We found that an encoder-decoder architecture with self-attention
provided the best accuracy. Figure 2 illustrates the high-level architecture of our design. In
addition, we tested different feature encodings and found that discretizing and embedding all
of the inputs to the model provided a significant lift over alternatives.
Figure 2: An overview of the DeepETA model pipeline
Encoder with Self-Attention

Many people are familiar with the Transformer architecture because of its applications in
Natural Language Processing and Computer Vision, but it might not be obvious how
Transformers can be applied to tabular data problems like ETA prediction. The defining
innovation of Transformer is the self-attention mechanism. Self-attention is a sequence-to-
sequence operation that takes in a sequence of vectors and produces a reweighted sequence
of vectors. Details can be found in the Transformer paper [9].
In a language model, each vector represents a single word token, but in the case of DeepETA,
each vector represents a single feature, such as the origin of the trip or the time of day. Self-
attention uncovers pairwise interactions among the K features in a tabular dataset by explicitly
computing a K*K attention matrix of pairwise dot products, using the softmax of these scaled
dot-products to reweight the features. When the self-attention layer processes each feature, it
looks at every other feature in the input for clues and outputs the representation of this
feature as a weighted sum of all features. This process is illustrated in Figure 3. Through this
way, we can bake the understanding of all the temporal and spatial features into the one
feature currently being processed and focus on the features that matter. In contrast to
language models, there is no positional encoding in DeepETA since the order of the features
doesn’t matter.
Figure 3: Attention matrix in self-attention
Taking a trip from origin A to destination B as an example, the self-attention layer scales the
importance of the features given the time of the day, origin and destination, traffic conditions
etc. A visualization of the self-attention is shown in Figure 4 (generated from Tensor2Tensor
[11]) with 8 colors corresponding to 8 attention heads and shares corresponding to random
generated attention weights.
Figure 4: An illustration of pairwise dot products of

input features: colors correspond to attention heads
and color shares correspond to random generated
attention weights
Feature Encoding
Continuous and Categorical Features
The DeepETA model embeds all categorical features, and bucketizes all continuous features
before embedding them. Somewhat counterintuitively, bucketizing continuous features led to
better accuracy than using continuous features directly. While bucketizing isn’t strictly
necessary because neural networks can learn any non-linear discontinuous function, a
network with bucketized features may have an advantage because it doesn’t have to spend any
of its parameter budget learning to partition the input space. Like the Gradient Boosted
Decision Tree Neural Network paper [12], we found that using quantile buckets provided better
accuracy than equal-width buckets. We suspect that quantile buckets perform well because
they maximize entropy: for any fixed number of buckets, quantile buckets convey the most
information (in bits) about the original feature value compared to any other bucketing scheme.
Geospatial Embeddings
Post-processing models receive the origin and destination of a trip as latitudes and longitudes.
Because these start and end points are very important for predicting ETAs, DeepETA encodes
them differently from other continuous features. Location data is distributed very unevenly
over the globe and contains information at multiple spatial resolutions. We therefore quantize
locations into multiple resolution grids based on latitude and longitude. As the resolution
increases, the number of distinct grid cells grows exponentially and the average amount of
data in each grid cell decreases proportionally. We explored three different strategies for
mapping these grids to embeddings:
Exact indexing, which maps each grid cell to a dedicated embedding. This takes up the
most space.
Feature hashing [16], which maps each grid cell into a compact range of bins using a
hash function. The number of bins is much smaller than with exact indexing.
Multiple feature hashing, which extends feature hashing by mapping each grid cell to
multiple compact ranges of bins using independent hash functions. See Figure 5:
Figure 5: An illustration of multi-resolution location grids with multiple

feature hashing using independent hash functions h1 and h2
Our experiments showed that while feature hashing saved space compared to exact indexing,
the accuracy was the same or slightly worse depending on the grid resolution. This is likely due
to hash collisions that cause some information to be lost. Multiple feature hashing provided
the best accuracy and latency while still saving space compared to exact indexing. This implies
that the network is able to combine the information from multiple independent hash buckets
to undo the negative effects of single-bucket collisions.
How We Made It Fast

DeepETA’s serving latency requirements are very stringent. While it’s possible to accelerate
inference by using specialized hardware or post-training optimizations, in this section we
discuss architectural design decisions that helped DeepETA minimize latency.
Fast Transformer
While the transformer-based encoder provided the best accuracy, it was too slow to meet the
latency requirements for online real-time serving. The original self-attention model has
quadratic complexity because it computes a K K attention matrix from K inputs. There have
been multiple research works that linearize the self-attention calculation, for example linear
transformer [10], linformer [13], performer [14]. After experimentation, we chose the linear
transformer, which uses the kernel trick to avoid calculating the attention matrix.
To illustrate the time complexity, let’s use the following example: say we have K inputs with
dimension d. The original transformer’s time complexity is O(K2d), while the linear
transformer’s time complexity is O(Kd2). If we have 40 features all with 8 dimensions, i.e. K = 40,
d = 8, K2d = 12,800, while Kd2 = 2,560. It’s clear that the linear transformer is faster whenever
K>d.
More Embeddings, Fewer Layers

Another secret to making DeepETA fast is to utilize feature sparsity. While the model has
hundreds of millions of parameters, any one prediction touches only a tiny fraction of them,
roughly 0.25%. How did we achieve this?
First of all, the model itself is relatively shallow with just a handful of layers. The vast majority
of the parameters exist in embedding lookup tables. By discretizing the inputs and mapping
them to embeddings, we avoid evaluating any of the unused embedding table parameters.
Discretizing the inputs gives us a clear speed advantage at serving time compared to
alternative implementations. Take the geospatial embeddings pictured in Figure 5 as an
example. To map a latitude and longitude to an embedding, DeepETA simply quantizes the
coordinates and performs a hash lookup, which takes O(1) time. In comparison, storing
embeddings in a tree data structure would require O(log N) lookup time, while using fully-
connected layers to learn the same mapping would require O(N2) lookup time. Seen from this
perspective, discretizing and embedding inputs is simply an instance of the classic space vs
time tradeoff in computer science: by precomputing partial answers in the form of large
embedding tables learned during training, we reduce the amount of computation needed at
serving time.
How We Made It General

One of the design goals of DeepETA is to provide a general ETA model that serves all of Uber’s
lines of business across the globe. This can be challenging because different lines of business
have different needs and different data distributions. The overall model structure is shown in
Figure 6 below.
Figure 6: Illustration of the DeepETA model structure
Bias Adjustment Decoder

Once we have learned meaningful feature representations, the next step is to decode them
and make predictions. In our case, the decoder is a fully connected neural network with a
segment bias adjustment layer. The distribution of absolute errors varies by a lot across
delivery trips vs rides trips, long vs short trips, pick-up vs drop-off trips, and also across global
mega-regions. Adding bias adjustment layers to adjust the raw prediction for each of the
different segments can account for their natural variations and in turn improve the MAE. This
approach performs better than simply adding segment features to the model. The reason we
implemented a bias adjustment layer instead of a multi-task decoder is due to latency
constraints. We also employ a few tricks to further improve prediction accuracy, such as using
ReLU at output to force predicted ETA to be positive; clamping to reduce the effect of extreme
values.
Asymmetric Huber Loss
Different business use cases require different types of ETA point estimates and will also have
varying proportions of outliers in their data. For example, we want to estimate a mean ETA for
fare computation but need to control for the effect of outliers. Other use cases may call for a
specific quantile of the ETA distribution. To accommodate this diversity, DeepETA uses a
parameterized loss function, asymmetric Huber loss, which is robust to outliers and can
support a range of commonly used point estimates.
Asymmetric Huber loss has two parameters, delta and omega, that control the degree of
robustness to outliers and the degree of asymmetry respectively. By varying delta (as shown in
Figure 7), squared error and absolute error can be smoothly interpolated, with the latter being
less sensitive to outliers. By varying omega, you can control the relative cost of
underprediction vs overprediction, which is useful in situations where being a minute late is
worse than being a minute early. These parameters not only make it possible to mimic other
commonly used regression loss functions, but also make it possible to tailor the point estimate
produced by the model to meet diverse business goals.
Figure 7: Asymmetric Huber Loss
How We Train and Serve the Model

We leveraged the Canvas framework from Uber’s ML platform – Michelangelo [15] to train and
deploy the model. The exact architecture we prototyped is essentially depicted in Figure 8.
Once the model is trained and deployed to Michelangelo, we need to serve those predictions to
users for real-time ETA prediction. The high-level architecture is shown in Figure 9 below: Uber
consumers’ requests are routed through various services to the uRoute service. The uRoute
service serves as a frontend for all routing lookups. It makes requests to the routing engine to
produce route-lines and ETAs. It uses this ETA and other model features to make requests to
the Michelangelo Online prediction service to get predictions from the DeepETA model.
Periodic auto retraining workflows are set up to retrain and validate the model.
Figure 8: Model Re-training and deployment pipeline
Figure 9: High-level system diagram of online serving
Conclusions and Future Work

We have launched this DeepETA model into production for global 4-wheel ETA prediction. The
DeepETA model launch makes it both possible and efficient to train and serve large-scale
Deep Learning models that predict ETAs better than XGBoost approaches. DeepETA delivers
an immediate improvement to metrics in production and establishes a model foundation that

can be reused for multiple consumer use cases.
DeepETA enables the ETA team to explore diverse model architectures and output multiple
ETAs that can be tailored for each consumer. More work is underway to further expand the
accuracy improvements it can deliver by looking at every aspect of the modeling process
covering Dataset, Features, Transformations, Model Architectures, Training Algorithms, Loss
Functions, and Tooling/Infrastructure Improvements. In the future, the team will explore
enhancements such as continuous, incremental training, which will enable ETAs to be trained
on fresher data.
Acknowledgements
DeepETA is a product of an Uber AI <> Maps partnership. We’d like to thank the following people
who helped make this possible:
Software Engineers: Michael Albada, Apurva Bhandari, Aditya Bhave, Tanmay Binaykiya,
Chongxiao Cao, Sashikanth Chandrasekaran, Eric Chen, Hao Jiang, Vladimir Kuzmin,
Michael Mallory, Michael Mui, Sean Po, Rich Porter, Vivek Sah, Raajay Viswanathan,
Joseph Wang, Peng Zhang
Data Scientists: Wenqi Hu, Alvira Swalin
Research Scientists: Olcay Cirit, Eric Frank, Xinyu Hu, Felipe Petroski Such
Product Managers: Wataru Ueno
Technical Program Managers: Samuel Chan, Ramit Hora
Eng Managers: Bozhena Bidyuk, Himaanshu Gupta, Sally Lee, Xu Ning
Leadership: Jaikumar Ganesh, Bobby Parikh, Smitha Shyam, Thomas Williams, Zoubin
Ghahramani, Jan Pedersen, Douglas Bemis
Apache Spark™ and Spark™ are trademarks of the Apache Software Foundation in the United
States and/or other countries. No endorsement by The Apache Software Foundation is implied
by the use of these marks.
References
1. Distributed training upstream improvement contributed by Uber’s Spark team.
2. Scalability upstream improvement contributed by Uber’s Spark team.
3. Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for
TensorFlow.
4. Multilayer perceptron
5. Popov, Sergei, Stanislav Morozov, and Artem Babenko. “Neural oblivious decision
ensembles for deep learning on tabular data.” arXiv preprint arXiv:1909.06312 (2019).
6. Arık, Sercan O., and Tomas Pfister. “Tabnet: Attentive interpretable tabular learning.” arXiv
(2020).
7. Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey
Hinton, and Jeff Dean. “Outrageously large neural networks: The sparsely-gated mixture-
of-experts layer.” arXiv preprint arXiv:1701.06538 (2017).
8. Ha, David, Andrew Dai, and Quoc V. Le. “Hypernetworks.” arXiv preprint arXiv:1609.09106
(2016).
9. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need.” In Advances in
neural information processing systems, pp. 5998-6008. 2017.
10. Katharopoulos, Angelos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret.
“Transformers are rnns: Fast autoregressive transformers with linear attention.” In
International Conference on Machine Learning, pp. 5156-5165. PMLR, 2020.
11. Vaswani, Ashish, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez,
Stephan Gouws, Llion Jones et al. “Tensor2tensor for neural machine translation.” arXiv
preprint arXiv:1803.07416 (2018).
12. Saberian, Mohammad, Pablo Delgado, and Yves Raimond. “Gradient boosted decision
tree neural network.” arXiv preprint arXiv:1910.09340 (2019).
13. Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. “Linformer: Self-
attention with linear complexity.” arXiv preprint arXiv:2006.04768 (2020).
14. Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins et al. “Rethinking attention with performers.” arXiv preprint
arXiv:2009.14794 (2020).
15. Meet Michelangelo: Uber’s Machine Learning Platform
16. Weinberger, Kilian, et al. “Feature hashing for large scale multitask learning.” Proceedings
of the 26th annual international conference on machine learning. 2009.
Xinyu Hu
Xinyu Hu is a Senior Research Scientist at Uber AI focused on large-
scale machine learning applications in spatial-temporal problems
and causal inference. Xinyu holds a Ph.D. in Biostatistics from
Columbia University.
Olcay Cirit
Olcay Cirit is a Staff Research Scientist at Uber AI focused on ML
systems and large-scale deep learning problems. Prior to Uber AI, he
worked on ad targeting at Google.
Tanmay Binaykiya
Tanmay Binaykiya is a Senior Software Engineer on the Maps ETA
Prediction team, and focuses on projects at the intersection of ML
and systems.
Ramit Hora
Ramit Hora is the Technical Program Management Lead for Uber AI
& Maps where he works to drive innovation across Uber’s
technical systems and product experiences.
Posted by Xinyu Hu, Olcay Cirit, Tanmay Binaykiya, Ramit Hora

Category: AI Data / ML
Related articles
Engineering, AI, Backend
Personalized Marketing at Scale: Uber’s Out-of-App Recommendation

System
June 13 / Global
Engineering, AI, Data / ML
From Predictive to Generative – How Michelangelo Accelerates Uber’s

AI Journey
May 2 / Global
AI
DragonCrawl: Generative AI for High-Quality Mobile Testing
April 23 / Global
Engineering, AI, Data / ML
Scaling AI/ML Infrastructure at Uber
March 28 / Global
Engineering, AI, Backend, Data / ML
DataCentral: Uber’s Big Data Observability and Chargeback Platform
February 1 / Global
Most popular
Engineering, AI, Backend June 13 / Global
Personalized Marketing at Scale: Uber’s Out-of-App Recommendation System
Engineering, Backend June 18 / Global

Debugging with Production Neighbors – Powered by SLATE
Engineering, Backend June 20 / Global

How Uber ensures Apache Cassandra®’s tolerance for single-zone failure
Engineering June 27 / Global

Modernizing Logging at Uber with CLP (Part II)
View more stories
Uber
Visit Help Center
Company
About us
Our offerings
Newsroom
Investors
Blog
Careers
AI
Gift cards
Products
Ride
Drive
Deliver
Eat
Uber for Business
Uber Freight
Global citizenship
Safety
Diversity and Inclusion
Sustainability
Travel
Airports
Cities
English
Riyadh
© 2024 Uber Technologies Inc.
Privacy Accessibility Terms

DeepETA_ How Uber Predicts Arrival Times Using Deep Learning

Uploaded by

Copyright:

Available Formats

You might also like

DeepETA_ How Uber Predicts Arrival Times Using Deep Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DeepETA_ How Uber Predicts Arrival Times Using Deep Learning

Uploaded by

Copyright:

Available Formats

6/29/24, 6:54 PM DeepETA: How Uber Predicts Arrival Times Using Deep Learning | Uber Blog

Uber Blog Sign up

DeepETA: How Uber Predicts Arrival Times Using Deep

Figure 1: Hybrid approach of ETA post-processing using ML models

How We Made It Accurate

Encoder with Self-Attention

Figure 3: Attention matrix in self-attention

Figure 4: An illustration of pairwise dot products of

Continuous and Categorical Features

Figure 5: An illustration of multi-resolution location grids with multiple

How We Made It Fast

More Embeddings, Fewer Layers

How We Made It General

Figure 6: Illustration of the DeepETA model structure

Bias Adjustment Decoder

Asymmetric Huber Loss

Figure 7: Asymmetric Huber Loss

How We Train and Serve the Model

Figure 8: Model Re-training and deployment pipeline

Figure 9: High-level system diagram of online serving

Conclusions and Future Work

an immediate improvement to metrics in production and establishes a model foundation that

Posted by Xinyu Hu, Olcay Cirit, Tanmay Binaykiya, Ramit Hora

Engineering, AI, Backend

Personalized Marketing at Scale: Uber’s Out-of-App Recommendation

Engineering, AI, Data / ML

From Predictive to Generative – How Michelangelo Accelerates Uber’s

DragonCrawl: Generative AI for High-Quality Mobile Testing

Engineering, AI, Data / ML

Scaling AI/ML Infrastructure at Uber

Engineering, AI, Backend, Data / ML

DataCentral: Uber’s Big Data Observability and Chargeback Platform

Engineering, Backend June 18 / Global

Engineering, Backend June 20 / Global

Engineering June 27 / Global

View more stories

Visit Help Center

Uber for Business

Diversity and Inclusion

© 2024 Uber Technologies Inc.

Privacy Accessibility Terms

You might also like