Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

What Truly Matters in Trajectory Prediction for

Autonomous Driving?

Haoran Wu1,2∗ , Tran Phong2∗ , Cunjun Yu2∗, Panpan Cai3 , Sifa Zheng1 , David Hsu2
1 2
Tsinghua University National University of Singapore
3
Shanghai Jiao Tong University
arXiv:2306.15136v1 [cs.RO] 27 Jun 2023

Abstract

In the autonomous driving system, trajectory prediction plays a vital role in ensur-
ing safety and facilitating smooth navigation. However, we observe a substantial
discrepancy between the accuracy of predictors on fixed datasets and their driving
performance when used in downstream tasks. This discrepancy arises from two
overlooked factors in the current evaluation protocols of trajectory prediction: 1)
the dynamics gap between the dataset and real driving scenario; and 2) the com-
putational efficiency of predictors. In real-world scenarios, prediction algorithms
influence the behavior of autonomous vehicles, which, in turn, alter the behaviors
of other agents on the road. This interaction results in predictor-specific dynamics
that directly impact prediction results. As other agents’ responses are predeter-
mined on datasets, a significant dynamics gap arises between evaluations conducted
on fixed datasets and actual driving scenarios. Furthermore, focusing solely on
accuracy fails to address the demand for computational efficiency, which is critical
for the real-time response required by the autonomous driving system. Therefore,
in this paper, we demonstrate that an interactive, task-driven evaluation approach
for trajectory prediction is crucial to reflect its efficacy for autonomous driving.

1 Introduction
Current trajectory prediction evaluation [19, 5, 3] relies on real-world datasets, operating under the
assumption that dataset accuracy is equivalent to prediction capability. We refer to this as Static
Evaluation. This methodology, however, falls short when the predictor serves as a sub-module for
downstream tasks in Autonomous Driving (AD) [18, 15]. As illustrated in Figure 1, the evaluation of
Average Distance Error (ADE) and Final Distance Error (FDE) on the dataset does not necessarily
reflect the actual driving performance [25, 4]. This discrepancy stems from two factors: the dynamics
gap between fixed datasets and AD systems, and the computational efficiency of predictors.
The dynamics gap arises from the fact that the behavior of the autonomous vehicle, also known as
the ego-agent, changes with different trajectory predictors. In real-world scenarios, the ego-agent
utilizes trajectory predictions to determine its actions. However, different trajectory predictions result
in varied behaviors of the ego-agent, which, in turn, influence the future behaviors of other road
users, leading to different dynamics within the environment. This directly affects the prediction
results as other agents behave differently. Consequently, there exists a disparity between the dynamics
represented in the dataset and the actual driving scenario when assessing a specific trajectory predictor.
To tackle this issue, we propose the use of an interactive simulation environment to evaluate the
predictor for downstream decision-making. This environment enables us to calculate a "Dynamic
ADE/FDE" while the ego-agent operates with the specific predictor, thus, mitigating the dynamics
gap. We demonstrate a strong correlation between Dynamic ADE/FDE and driving performance

Equal contribution.

Preprint. Under review.


ADE vs Driving Performance
0.78
ca Driving
CV 1/FDE Driving
CA 1/FDE
KNN 1/FDE
Driving
S-KNN1/FDE
Driving
hivt lanegcn knnsocial

Driving Performance
0.76

knndefault
0.74 cv
Effi 1/ADE Effi 1/ADE Effi 1/ADE Effi 1/ADE
0.72

Reality lstmsocial
0.70 Assumption lstmdefault
1.75 2.00 2.25 2.50 2.75
ADE
3.00 3.25 Comfort Safety Comfort Safety Comfort Safety Comfort Safety

FDE vs Driving Performance


ca
HiVT 1/FDE
Driving
LaneGCN
Driving 1/FDE
LSTM1/FDE
Driving
S-LSTM1/FDE
Driving
0.77
hivt lanegcn knnsocial
0.76
Driving Performance

0.75 knndefault
0.74 cv
0.73 Effi 1/ADE Effi 1/ADE Effi 1/ADE Effi 1/ADE
0.72
0.71
lstmsocial
Reality
0.70 Assumption lstmdefault
3.5 4.0 4.5 5.0 5.5 6.0
FDE
6.5 Comfort Safety Comfort Safety Comfort Safety Comfort Safety

Figure 1: Prediction performance vs driving performance. Contrary to popular belief (Black Curve),
our study surprisingly indicates no strong correlation between current prediction evaluation metrics
and real driving performance (Red Curve). Eight representative prediction models are selected:
Constant Velocity (CV) [20], Constant Acceleration (CA) [20], K-Nearest Neighbor (KNN) [5],
Social KNN (S-KNN) [5], HiVT [27], LaneGCN [14], LSTM and Social LSTM (S-LSTM) [1].
Details can be found in the supplementary materials.

through extensive experiments. This underscores the significance of addressing the dynamics gap
and emphasizes the importance of incorporating it into the evaluation process.
The computational efficiency of trajectory prediction models is also a critical aspect of driving perfor-
mance. The downstream planners in AD systems, with their varying levels of complexity, impose
different efficiency requirements on these models. Consequently, a balance between computational
efficiency and prediction accuracy is essential, especially as simpler planners [23] can accommodate
slow predictors, whereas sophisticated planners [22] necessitate efficient predictors for safety and
timely responses. To delve deeper into this balance, we conduct experiments with differing time
budgets for planning execution. Our findings suggest that computational efficiency is the determinant
of driving performance under time constraints, while the Dynamic ADE/FDE becomes predominant
in scenarios with sufficient time. This interplay between dynamics gap mitigation and computational
efficiency underscores the complexity of optimizing trajectory prediction in AD systems.
In this paper, we aim to address two specific aspects of trajectory prediction for AD systems. Firstly,
we uncover the limitations of existing trajectory prediction evaluation methods, highlighting their
inability to accurately reflect driving performance. Secondly, we introduce and validate task-driven
interactive evaluation metrics by identifying these shortcomings. These metrics provide a more
effective way of assessing prediction models for autonomous driving by considering the dynamics
gap between datasets and AD systems and the demand for real-time responses.

2 Related Work

2.1 Motion Prediction and Evaluation


Motion prediction methods can be classified along three dimensions [10]: modeling approach, output
type, and situational awareness. The modeling approach includes physics-based models [20, 2]
that use physics to simulate agents’ forward motion, and learning-based models [27, 14] that learn
and predict motion patterns from data. The output type can be intention, single-trajectory [17],
multi-trajectory [11], or occupancy map [8, 9]. These outputs differ in the type of motion they
predict and how they handle the uncertainty of future states. The situational awareness includes
unawareness, interaction [1], scene, and map awareness [21]. It refers to the predictor’s ability to
incorporate environmental information, which is crucial for collision avoidance and efficient driving.
Most researchers [19, 16] and competitions [5, 3] evaluate the performance of prediction models
on real-world datasets, in which ADE/FDE and their probabilistic variants minADE/minFDE are
commonly used metrics. However, these evaluation protocols fail to capture the dynamics gap
between datasets and real-world scenarios. The actions of the ego-agent are influenced by predictors,
which affect the reactive motions of other agents in turn. In this study, we selected four model-based
and six learning-based models with varying output types and situational awareness to cover a wide

2
range of prediction models. We implement these predictors in an interactive simulation environment
to present the inadequacy of current prediction evaluation protocols on neglecting the dynamics gap.

2.2 Task-aware Motion Prediction and Evaluation


Task-aware motion prediction remains an underexplored area in research. While some studies touch
upon the subject, they still focus on proposing task-aware metrics for training or eliminating improper
predictions on datasets. One notable example of training on task-aware metrics is the Planning KL
Divergence (PKL) metric [18]. Although designed for 3D object detection, it measures the similarity
between detection and ground truth by calculating the difference in ego-plan performance. In the
context of motion prediction, Rowan et al. [15] propose a control-aware metric (CAPO) similar to
PKL. CAPO employs attention [24] to find correlations between predicted trajectories, assigning
higher weights to agents inducing more significant reactions. Similarly, the Task-Informed method
[12] assumes a set of candidate trajectories from the planner and adds the training loss for each
candidate. Another line of work focuses on designing task-aware functions to eliminate improper
predictions [13, 7]. The proposed metric can capture unrealistic predictions and better correlate with
driving performance in planner-agnostic settings. However, these works are evaluated in open-loop
manners, neglecting the impact of the dynamics gap and computational efficiency of predictors. In
this paper, we utilize the interactive, task-driven evaluation to demonstrate how these two neglected
factors influence driving performance.

3 Planning with Motion Prediction

In this section, we present the problem formulation for evaluating motion prediction with planning in
the context of autonomous driving. The main goal is to create a safe and efficient driving plan based
on the prediction of traffic participants’ motion. We denote the ego-agent, which is the Autonomous
Vehicle (AV) as A, and n surrounding traffic participants as i ∈ {1, . . . , n}.
State. The state of the system is represented by the tuple s = (sA , s1 , . . . , sn , C) for s ∈ S, where
sA denotes the state of the AV, si denotes the state of the i-th surrounding traffic participant, and C
represents the context information. The state for each vehicle includes its position, velocity, heading,
and other relevant attributes, e.g., sA = (xA , yA , vA , θA ). The context information C may include
road geometry, lane topology, centerlines, drivable area, and other relevant features.
Action. At each time step, the AV can take an action a ∈ A, where A is the action space. The action
may consist of both longitudinal and lateral components, such as acceleration and steering angle, e.g.,
a = (aaccel , asteer ).
Transition Function. The transition function f : S × A → S defines the dynamics of the system
and how the system evolves as a result of the ego vehicle’s action. In the planner, this refers to
the motion prediction model. We denote motion prediction model as a function X̂i = M (Xi , C t ),
where Xi = {st−T i
o +1
, . . . , st−1
i , sti } denotes the history of the i-th surrounding traffic participant
over the observation horizon To at timestep t, and X̂i denotes the predicted future states of the i-th
traffic participant over the prediction horizon Tp . Given the history Xi of a traffic participant i and
the context information C t , the motion prediction model computes the predicted future states X̂i as
t+T
follows: X̂i = M (Xi , C t ) = {ŝt+1 i , ŝt+2
i , . . . , ŝi p }.
Planner. The planner is represented as a mapping function a = π(s, M ) that takes the current state
s ∈ S and the motion prediction model M as input, and produces an action a ∈ A as output. The
planner is designed to find the optimal sequence of actions by minimizing the objective function
J(s, a1 , . . . , aH ), considering the predicted motion of surrounding traffic participants and the map
information.
Objective Function. The objective of the AV is to safely and efficiently reach its destination while
considering the predicted motion of the surrounding traffic participants using the motion prediction
model M . The cost function c : S × A → R maps a state-action pair to a real-valued cost. The
objective function is the accumulated cost over a planning horizon T , defined as: J(s, a1 , . . . , aT ) =
PT t t
t=1 c(s , a ). The objective is to find an optimal sequence of actions that minimizes the objective
function.

3
Table 1: Selected prediction methods.
Modeling Approach Method Output Type Interaction Aware Scene Aware Map Aware
CV [20] ST ✗ ✗ ✗
CA [20] ST ✗ ✗ ✗
Model-based KNN [5] MT ✗ ✗ ✗
S-KNN [5] MT ✓ ✗ ✗
LSTM ST ✗ ✗ ✗
S-LSTM [1] ST ✓ ✗ ✗
Data-driven HiVT [27] MT ✓ ✓ ✓
LaneGCN [14] MT ✓ ✓ ✓
HOME [8] OM ✓ ✓ ✓
DSP [26] MT ✓ ✓ ✓

*Abbreviations: ST: Single-Trajectory, MT: Multi-Trajectory, OM: Occupancy Map.

4 Study Design
Our aim is to identify the key factors involved in trajectory prediction for autonomous driving and how
they influence driving performance by simulating real-world scenarios that vehicles might encounter.
The ultimate goal is to introduce task-driven interactive evaluation metrics to assess future trajectory
prediction models. To achieve this objective, we need to determine four crucial components: 1)
motion prediction methods to be covered; 2) the planner, which employs various prediction models;
3) the simulator, which allows us to replicate interactive scenarios; 4) the evaluation protocols, which
enables us to assess the effectiveness of the key factors and evaluation metrics involved in motion
prediction with respect to real-world driving performance.

4.1 Motion Prediction Methods

We select 10 representative prediction models to achieve comprehensive coverage of mainstream


approaches, ranging from simple model-based methods to complex data-driven approaches, as
presented in Table 1. Constant Velocity (CV) and Constant Acceleration (CA) [20] assume that the
predicted agent maintains a constant speed or acceleration within the prediction horizon. K-Nearest
Neighbor (KNN) predicts an agent’s future trajectory based on most similar trajectories, while
Social-KNN (S-KNN) [5] extends it by also considering the similarity of surrounding agents. These
methods are widely used as baselines given their widespread effectiveness in simple prediction cases.
Social LSTM (S-LSTM) [1], HiVT [27], LaneGCN [14], and HOME [8] represent four distinct types
of neural networks: RNN, Transformer, GNN, and CNN. DSP [26] utilizes a hybrid design of neural
networks, representing state-of-the-art prediction models.

4.2 Planners

An ideal planner should: 1) be able to handle state and action uncertainty; 2) take into account
multiple factors of driving performance, such as safety (collision avoidance), efficiency (timely goal
achievement), and comfort (smooth driving); 3) be aware of interactions with other agents; and 4)
supports real-time execution. We select two motion planners based on these criteria: a simplistic
planner that only meets 2) and 4), and a sophisticated planner that satisfies all of 1) - 4). This allows
us to draw general conclusions and obtain planner-agnostic results.
RVO. The RVO planner [23] is a simplistic planner that solves the optimization problem in the
velocity space under collision avoidance constraints. The planner uses motion predictions to find
the future trajectories of other agents and avoid possible collisions with the deterministic motions,
thus does not consider the state and action uncertainty. As the RVO planner does not maintain states
between consecutive timesteps, it also cannot optimize its planning with respect to interactions with
other agents. The objective function of the RVO planner involves safety and efficiency within a short
time window, and the RVO planner executes in real-time.
DESPOT. The DESPOT planner [22] is a state-of-the-art belief-space planning algorithm that can
address uncertainties near-optimally. Given a reference path, the planner controls the longitudinal
acceleration of the ego-agent through an action space that comprises three modes: {Accelerate,
Decelerate, Maintain}. To account for stochastic states and actions, we adopt the bicycle model, a

4
Table 2: Trajectory prediction metrics.
Metric Name Metric Equation
1PT p
ADE T i=1 (xi − x̂i )2 + (yi − ŷi )2
p
FDE (xT − x̂T )2 + (yT − ŷT )2
PT p
minADE mink∈K T1 i=1 (xi − x̂i )2 + (yi − ŷi )2
p
minFDE mink∈K (xT − x̂T )2 + (yT − ŷT )2

kinematic model with two degrees of freedom, and introduce Gaussian noise to the displacement.
DESPOT considers the system state s and the ego-agent’s action a ∈ A to predict the future states of
other agents, excluding the context information C. This allows it to consider the interactions with
other agents. Moreover, DESPOT can efficiently execute with real-time prediction models, and its
objective function incorporates safety, efficiency, and comfort metrics, making it an ideal algorithm
for planning in complex and dynamic environments.
We use these two planners to control the speed of the ego-agent while pure-pursuit algorithm [6] for
adjusting the steering angle. Details about the planners can be found in the supplementary materials.

4.3 Simulator

In order to evaluate different prediction models, the ideal driving simulator should: 1) provide
real-world maps and agents; 2) model potential unregulated behaviors, 3) accurately mirror the
interactions between agents, and 4) provide realistic perception data for effective planning. We
choose the SUMMIT simulator for our experiments, as it meets all the criteria mentioned above.
SUMMIT is a sophisticated simulator based on the Carla framework, offering various real-world
maps and agents to create diverse and challenging scenarios. It uses a realistic motion model to
simulate interactions between agents and supports the simulation of crowded scenes and unregulated
behaviors. Therefore, it can deal with complex traffic conditions.
There are two distinct concepts of time in our experiments: simulation time and real time. The former
corresponds to the duration of actions in the simulation, while the latter represents the wall time
consumed by the planner. By default, the SUMMIT simulator runs in asynchronous mode. Under
this mode, the simulation environment and planner run individually, and the simulation time is equal
to the real-time. Even if the planner operates slowly, the simulator will not wait for the planner to
communicate with it. In this study, we employ the SUMMIT simulator in synchronous mode to
accommodate slow predictors. Under this mode, the simulator waits for the planner to communicate
with it before executing the next step. As the simulation time for each execution step keeps stable at
0.03 s, the ratio between simulation time and real-time can be manually set by varying the frequency
the planner communicates with the simulator. We refer to the communication frequency as the tick
rate. For example, when we set the tick rate to 30 Hz, the simulation time equals the real-time. But
when we set the tick rate to 3 Hz, the ratio between simulation time and the real-time becomes 0.1.
This allows us to implement different experiment settings for the evaluation of prediction models.

4.4 Evaluation Protocols

Motion Prediction Performance Metrics. Four commonly used prediction performance metrics
are employed in this study, as presented in Table 2. While ADE/FDE can be applied to evaluate
single-trajectory prediction models, their probabilistic variants minADE and minFDE can be applied
to evaluate multi-trajectory predictors.
Driving Performance Metrics. Three factors are primarily considered for driving performance:
safety, comfort, and efficiency. Assuming the total timestep of each scenario is H.
Safety is typically evaluated in terms of the collision rate. The collision is determined whether the
smallest distance between the ego-agent and surrounding agents is smaller than a threshold ϵ, that is:
1
PH t t t
PSafety = H t=1 I[min{||sA − s|| : s ∈ {s1 , ..., sn }} < ϵ], where || · || is the L2 distance between
the ego-agent’s bounding box and the exo-agent’s bounding box, I is the boolean indicator function.
We set ϵ = 1m for our experiments since the DESPOT planner rarely causes real collisions.

5
ADE(K=1) FDE(K=1) minADE(K=6) minFDE(K=6)
1.75 3.5
1.0 S-LSTM 2.0
1.50 3.0 S-LSTM
LSTM LSTM
S-KNN HOME
HOMEKNN KNN S-KNN
HOME
1.25 2.5 0.8 1.5 HOME
KNN S-KNNCA

SUMMIT

SUMMIT

SUMMIT

SUMMIT
CV CA
CV
CVCA CV KNNS-KNN
1.00 S-LSTM 2.0 S-LSTM CA 0.6 1.0
LSTM LSTM
DSP
LaneGCN
0.75 DSP HiVT
LaneGCN
1.5 DSP HiVT
LaneGCN
0.4 HiVT
0.5
LaneGCN
DSP
HiVT

0.50 1.0 0.2 0.0


0.25 0.5 0.0
1.5 2.0 2.5 3.0 3.5 3 4 5 6 7 8 0.75 1.00 1.25 1.50 1.75 1 2 3 4
Argoverse Argoverse Argoverse Argoverse

Figure 2: The prediction performance of all selected prediction methods are aligned between Argov-
erse and Alignment datasets. All data points fall within the 95% confidence interval and conform
well to linear regression.
The efficiency is evaluated by the average speed of the ego-agent over the whole scenario:
1
PH t
PEfficiency = H t=1 vA . The comfort is represented by the jerk of the ego-agent, which is
PH t
the rate of change of acceleration with respect to time: PComfort = H1 t=1 v̈A .
Since these three metrics are not on the same scale, we normalized each metric to [0, 1] before
calculating the driving performance. Additionally, we normalized the direction of these three metrics,
where higher values represent better performance. The driving performance is obtained by averaging
the normalized safety, efficiency, and comfort. Details can be found in the supplementary materials.
Experimental Design. We conduct two types of experiments for both planners in the SUMMIT
simulator: Fixed Number of Predictions and Fixed Planning Time.
1. Fixed Number of Predictions: The planner is required to perform a fixed number of predic-
tions within an interactive simulation environment, regardless of the predictor’s execution
speed. Our aim is to investigate the impact of the dynamics gap on driving performance.
2. Fixed Planning Time: The planner is allocated a fixed time budget. Upon reaching the time
limit, the planner stops and outputs the ego-agent’s action. The tick rate is set to 30 Hz, 3
Hz, and 1 Hz to conduct three sub-experiments with different time budgets. Our purpose
is to provide an in-depth analysis of the trade-off between the predictor’s computational
efficiency and prediction accuracy.
We collect 50 scenarios for each predictor in each experiment. For each scenario, we randomly select
the start and end point for the ego-agent on one of the four real-world maps provided by the SUMMIT
simulator. A reference path of 50 meters is maintained between the two points, and the ego-agent is
instructed to follow this path. A certain number of exo-agents including pedestrians, cyclists, and
vehicles are randomly placed in the environment. We implement all selected predictors in Section 4.1,
except for HOME and DSP, due to their significantly longer running time, making the closed-loop
evaluation infeasible. These two methods are only used to demonstrate the alignment between the
SUMMIT simulator and the real world. Noticeably, the RVO planner outputs the same results for the
above two experimental designs as it only performs prediction once in one timestep.

5 Experiment
The experiment part is designed and organized to answer the following questions:1) Do our exper-
iments on the SUMMIT simulator provide sufficient evidence to support our claims?; 2) Can the
current prediction evaluation system accurately reflect the driving performance?; and 3) What do we
suggest to evaluate predictors in terms of driving performance?

5.1 Sim-Real Alignment

To demonstrate the alignment between the SUMMIT simulator and the real world, we train and
evaluate all selected motion prediction models on both the Argoverse dataset [5] and the Alignment
dataset collected from the SUMMIT simulator. We collect 59,944 scenarios and separate them
into two groups: 80% training and 20% validation. Each scenario consists of about 300 steps.
Subsequently, it is filtered down to 50 steps by taking into account the number of agents and their
occurrence frequency. The nearest three agents are randomly selected to be the interested agent for
prediction.

6
R2 : 0.00 𝑝-value: 0.932 R2 : 0.42 𝑝-value: 0.080
0.72

CV
0.75
0.78 CA 0.70

Driving Performance
HiVT LaneGCN S-KNN S-KNN 0.70

Driving Performance

Driving Performance
0.76 0.68

KNN
KNN
0.65
0.66
0.74 CV

0.64 HiVT 0.60


S-LSTM
0.72 CA
30Hz
LSTM
3Hz
S-LSTM 0.62 0.55
1Hz
0.70 LSTM LaneGCN

1.75 2.00 2.25 2.50 2.75 3.00 3.25 1.75 2.00 2.25 2.50 2.75 3.00 3.25
2.0 2.5 3.0
Static ADE Static ADE Static ADE

(a) (b) (c)


Figure 3: Relationship between Static ADE and Driving Performance. (a) Fixed number of pre-
dictions/Fixed planning time for the RVO planner: the planner calls predictors once in an action,
whereby these two experiments result in the same output. (b) Fixed number of predictions for the
DESPOT planner. (c) Fixed planning time for the DESPOT planner. In total, we found no strong
correlation between driving performance and Static ADE.

Figure 2 illustrates the comparison of prediction performance between the Argoverse and Alignment
datasets. The R-squared values of the four subplots are 0.798, 0.777, 0.855, and 0.844, respectively.
These values indicate that the majority of variation can be explained by the linear relationship between
the prediction performance in these two datasets. Furthermore, the p-values are all less than 0.01,
providing strong support for the statistical significance of the alignment. The consistent results
suggest that the Argoverse and Alignment datasets share similar underlying features. Therefore, the
SUMMIT simulator can be employed to evaluate real-world performances.
In Section 5.2, we demonstrate the inadequacy of current static evaluation systems in reflecting driving
performance, which can be attributed to their ignorance of the dynamics gap and the predictor’s
computational efficiency. In Section 5.3, we suggest assessing predictors by task-driven interactive
evaluation metrics and providing the analysis.

5.2 Limitations of Current Prediction Evaluation


This section aims to illustrate the limitations of current prediction evaluation systems in accurately
reflecting realistic driving performance. We take ADE as an example, with FDE results provided in
the supplementary materials. We denote the ADE calculated from the Alignment dataset as Static
ADE, while the ADE derived from the simulation is referred to as Dynamic ADE. For the Dynamic
ADE, we log the simulation data after conducting the interactive simulation experiments and then
compute its ADE. We ensure that each scenario has at least 20 timesteps for calculation. The Dynamic
ADE is obtained by computing the average of the ADE values across all timesteps in a given scenario.
Results. Our primary focus is to discern whether Static ADE serves as a reliable indicator of driving
performance. Figure 3 presents the results of experiments conducted with fixed planning time and
fixed number of predictions for both RVO and DESPOT planners. The results suggest that, in both
experiments, there is no significant correlation between Static ADE and driving performance. Specifi-
cally, in the DESPOT experiment, we observed a counterintuitive positive relationship. According to
this finding, higher Static ADE would imply better driving performance. However, this hypothesis
lacks a realistic basis and should be rejected since it exceeds the 95% confidence interval.
The disparity between static evaluation and driving performance can be primarily attributed to the
neglect of two factors: the dynamics gap and the computational efficiency of predictors. The following
paragraphs explicate these factors and how they influence driving performance.
Dynamics Gap. The future trajectories of agents are predetermined in the datasets. However, in
AD systems, the motion of the ego-agent is determined by a planner that takes into account other
agents’ predictions, which consequently impacts the future movements of other agents. Different
predictors will lead to various environment dynamics that differ from those represented in the datasets,
resulting in the disparity between Static and Dynamic ADE, ultimately the correlation with driving
performance. To support our claim, the relationship between Dynamic ADE and driving performance
is depicted in Figure 4. Compared to Static ADE, Dynamic ADE displays a significantly stronger
correlation with driving performance for both planners in all experiments, by accounting for the

7
R2 : 0.59 𝑝-value: 0.026 R2 : 0.26 𝑝-value: 0.202 R2 : 0.68 𝑝-value: 0.000
0.72

CV
0.78 CA 0.70 0.75
HiVT S-KNN S-KNN

Driving Performance
LaneGCN

Driving Performance

Driving Performance
0.76 0.68
0.70
KNN KNN
0.66
0.74 CV
0.65
0.64 HiVT
LSTM S-LSTM
0.72 CA
30Hz
0.60 3Hz
S-LSTM 0.62
1Hz
0.70 LSTM LaneGCN

1.8 2.0 2.2 2.4 2.6 2.8 3.0 2.0 2.5 3.0 3.5 4.0 4.5 5.0
10 20 30
Dynamic ADE Dynamic ADE Dynamic ADE

(a) (b) (c)


Figure 4: Relationship between Dynamic ADE and Driving Performance. (a) Fixed number of
predictions/Fixed planning time for the RVO planner. (b) Fixed number of predictions for the
DESPOT planner. (c) Fixed planning time for the DESPOT planner. A much stronger correlation
between Dynamic ADE and driving performance is shown for both RVO and DESPOT planners,
which can be attributed to the inclusion of dynamics gap in (a), (b), as well as computational efficiency
in (c). The correlation is weaker when the planning time budget is tight.

Table 3: Correlation between computation efficiency and driving performance.


Driving Performance (↑)
Method Inference Time(s)
30Hz 3Hz 1Hz
CV 0.001 0.753 0.771 0.761
CA 0.001 0.739 0.774 0.779
LSTM 0.010 0.644 0.753 0.766
S-LSTM 0.014 0.649 0.715 0.733
HiVT 0.024 0.616 0.730 0.753
LaneGCN 0.024 0.637 0.711 0.742
KNN 0.224 0.526 0.648 0.692
S-KNN 0.248 0.530 0.633 0.677

dynamics gap. We can conclude that the dynamics gap is one of the main factors that causes the
disparity between static evaluation and realistic driving performance. The Dynamic ADE, which is
evaluated through interactive simulation environments, is capable of incorporating the dynamics gap
and displaying a significant correlation with driving performance.
Computational Efficiency of Predictors. All planners require a specific amount of predictions
to output an action. For the RVO planner, it requires only one prediction in one step, whereas the
DESPOT planner requires hundreds of predictions for initialization. Slow prediction methods require
the planner to take a longer time to plan. When the planning time budget is tight, the planner would
be incapable of providing proper actions to achieve good driving performance. On the other hand,
even if the prediction method is fast enough to support the planner to plan properly, the computational
efficiency of predictors still exerts a significant influence on the driving performance, as shown in
Figure 3c. When the DESPOT planner is granted additional time, it can explore more nodes and
conduct deeper searches, leading to a substantial improvement in driving performance. We can
conclude that the computational efficiency of predictors is also one of the main factors that causes the
disparity between static evaluation and real-world driving performance.
It should be noted that there is a trade-off between computational efficiency and dynamic prediction
accuracy for predictors to derive driving performance. As shown in Figure 4c, the correlation between
Dynamic ADE and driving performance becomes less strong when the tick rate is set higher. This
is indicated by the data points deviating further from the best-fit-line in higher tick rates. At this
time, it is the computational efficiency rather than dynamic prediction accuracy that decides the
driving performance, as shown in Table 3. We arrange all the prediction methods based on their
computational efficiency. When the tick rate is set to 30Hz, the planner cannot generate an optimal
solution, whereby the ranking of driving performance is determined by computational efficiency.
When the tick rate is set to 3Hz, the CA outperforms CV since they have got near-optimal solutions.
When the Tick Rate is set to 1Hz, the LSTM also outperforms CV. The driving performance is
determined by both dynamic prediction accuracy and computational efficiency of predictors in a
trade-off manner.

8
Table 4: Correlation between different ADE and driving performance.
Metric Planner Multi-modal Dynamic Full Observation Closest Correlation Coefficient
-0.00
✓ -0.70
RVO ✓ -0.61
✓ ✓ -0.39
✓ ✓ ✓ -0.77
ADE
+0.65
✓ +0.20
DESPOT ✓ -0.31
✓ ✓ -0.47
✓ ✓ ✓ -0.51

In summary, current prediction evaluation systems can not fully reflect the overall driving performance
due to the neglect of the dynamics gap and computational efficiency of predictors. There is also a
trade-off between computational efficiency and dynamic prediction accuracy for predictors to derive
driving performance. When the predictor is not fast enough, the higher computational efficiency of
predictors will have higher driving performance. But when there is enough time for the planner to
execute those prediction methods, it is the Dynamic ADE/FDE that decides the driving performance.

5.3 Suggestions for Future Prediction Evaluation


In this section, we first analyze factors that may affect ADE along with their correlation coefficients
with driving performance. Except for the metrics mentioned above, we also explore the impact of
two commonly concerned factors: incomplete observation and relative distance to the ego-agent.
The purpose is to establish a metric that reflects driving performance most effectively when the
time budget is relaxed. Static ADE is presented as the baseline in this study. The factors analyzed
include: "Multi-modal," which refers to the minADE, "Dynamic," indicating ADE evaluated in
closed-loop simulation, "Full Observation," which only considers agents with complete observations,
and "Closest," analyzing only the nearest agents.
Results. The comparison is shown in Table 4. The correlation between driving performance and Static
minADE is stronger than with Static ADE. However, their correlations become too weak to accurately
infer driving performance when a complex planner, such as DESPOT, is used in conjunction. For
Dynamic ADE, the correlation coefficient remains superior to that of Static ADE on both planners.
For other factors, depending solely on agents with full observations might be impractical, since agents
in close proximity can significantly impact the ego vehicle. The most reliable correlation is obtained
when only the closest agents are considered in the calculation of Dynamic ADE.
In summary, while evaluating prediction models in closed-loop testing is not always possible, it is
recommended to use minADE and minFDE for evaluation, rather than ADE and FDE. However, once
the closed-loop testing is feasible, the optimal choice to evaluate predictors is using Dynamic ADE
which only considers the closest agents. It is also essential to use computational efficiency rather than
Dynamic ADE to evaluate prediction models when the planning time budget is tight. This makes it
possible to consider both the impact of dynamics gap and computational efficiency of predictors.

6 Conclusion
In conclusion, this study highlights the limitations of current prediction evaluation systems for
assessing motion prediction models in consideration of driving performance. We have identified the
dynamics gap and computational efficiency of predictors as critical factors that contribute to this
disparity. Moreover, we recommend further research incorporating task-driven interactive evaluation
metrics in assessing motion predictors, specifically, Dynamic ADE/FDE and computational efficiency.
This enables the development of reliable and comprehensive motion prediction models. Despite these
insights, there is still work to be done in the future regarding planner selection and simulation scope.
Our study focused on sampling-based and reactive planners. Incorporating other planners such as
optimization-based or geometric planners could further substantiate our conclusions. Additionally,
we use oracle perception from the simulator, which can not fully represent the complexities of
real-world situations. It would be advantageous to use raw sensor data in future works, to understand
the comprehensive interplay within the entire AD system.

9
References
[1] Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Li Fei-Fei, and
Silvio Savarese. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 961–971, 2016.
[2] Ivo Batkovic, Mario Zanon, Nils Lubbe, and Paolo Falcone. A computationally efficient model
for pedestrian motion prediction. In 2018 European control conference (ECC), pages 374–379.
IEEE, 2018.
[3] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu,
Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal
dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, pages 11621–11631, 2020.
[4] Vinton G Cerf. A comprehensive self-driving car test. Communications of the ACM, 61(2),
2018.
[5] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew
Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and
forecasting with rich maps. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 8748–8757, 2019.
[6] R Craig Coulter. Implementation of the pure pursuit path tracking algorithm. Technical report,
Carnegie-Mellon UNIV Pittsburgh PA Robotics INST, 1992.
[7] Alec Farid, Sushant Veer, B. Ivanovic, Karen Leung, and M. Pavone. Task-Relevant Failure
Detection for Trajectory Predictors in Autonomous Vehicles. 2022.
[8] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bogdan Stanciulescu, and Fabien Moutarde.
Home: Heatmap output for future motion estimation. In 2021 IEEE International Intelligent
Transportation Systems Conference (ITSC), pages 500–507. IEEE, 2021.
[9] Thomas Gilles, Stefano Sabatini, Dzmitry Tsishkou, Bogdan Stanciulescu, and Fabien Moutarde.
Gohome: Graph-oriented heatmap output for future motion estimation. In 2022 International
Conference on Robotics and Automation (ICRA), pages 9107–9114. IEEE, 2022.
[10] Mahir Gulzar, Yar Muhammad, and Naveed Muhammad. A survey on motion prediction of
pedestrians and vehicles for autonomous driving. IEEE Access, 9:137957–137969, 2021.
[11] Agrim Gupta, Justin Johnson, Li Fei-Fei, Silvio Savarese, and Alexandre Alahi. Social gan:
Socially acceptable trajectories with generative adversarial networks. In Proceedings of the
IEEE conference on computer vision and pattern recognition, pages 2255–2264, 2018.
[12] Xin Huang, Guy Rosman, Ashkan Jasour, Stephen G McGill, John J Leonard, and Brian C
Williams. Tip: Task-informed motion prediction for intelligent vehicles. In 2022 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), pages 11432–11439. IEEE,
2022.
[13] Boris Ivanovic and Marco Pavone. Injecting planning-awareness into prediction and detection
evaluation. In 2022 IEEE Intelligent Vehicles Symposium (IV), pages 821–828. IEEE, 2022.
[14] Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Song Feng, and Raquel Urtasun.
Learning lane graph representations for motion forecasting. In Computer Vision–ECCV 2020:
16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages
541–556. Springer, 2020.
[15] Rowan McAllister, Blake Wulfe, Jean Mercat, Logan Ellis, Sergey Levine, and Adrien Gaidon.
Control-aware prediction objectives for autonomous driving. In 2022 International Conference
on Robotics and Automation (ICRA), pages 01–08. IEEE, 2022.
[16] Sajjad Mozaffari, Omar Y Al-Jarrah, Mehrdad Dianati, Paul Jennings, and Alexandros Mouza-
kitis. Deep learning-based vehicle behavior prediction for autonomous driving applications: A
review. IEEE Transactions on Intelligent Transportation Systems, 23(1):33–47, 2020.
[17] Andreas Møgelmose, Mohan M. Trivedi, and Thomas B. Moeslund. Trajectory analysis and
prediction for improved pedestrian safety: Integrated framework and evaluations. In 2015 IEEE
Intelligent Vehicles Symposium (IV), pages 330–335, June 2015.

10
[18] Jonah Philion, Amlan Kar, and Sanja Fidler. Learning to evaluate perception models using
planner-centric metrics. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 14055–14064, 2020.
[19] Andrey Rudenko, Luigi Palmieri, Michael Herman, Kris M Kitani, Dariu M Gavrila, and Kai O
Arras. Human motion trajectory prediction: A survey. The International Journal of Robotics
Research, 39(8):895–935, 2020.
[20] Christoph Schöller, Vincent Aravantinos, Florian Lay, and Alois Knoll. What the constant
velocity model can teach us about pedestrian motion prediction. IEEE Robotics and Automation
Letters, 5(2):1696–1703, 2020.
[21] Marcel Schreiber, Vasileios Belagiannis, Claudius Gläser, and Klaus Dietmayer. Dynamic
Occupancy Grid Mapping with Recurrent Neural Networks. In 2021 IEEE International
Conference on Robotics and Automation (ICRA), pages 6717–6724, May 2021.
[22] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with
regularization. Advances in neural information processing systems, 26, 2013.
[23] Jur Van den Berg, Ming Lin, and Dinesh Manocha. Reciprocal velocity obstacles for real-time
multi-agent navigation. In 2008 IEEE international conference on robotics and automation,
pages 1928–1935. Ieee, 2008.
[24] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information
processing systems, 30, 2017.
[25] Nick Webb, Dan Smith, Christopher Ludwick, Trent Victor, Qi Hommes, Francesca Favaro,
George Ivanov, and Tom Daniel. Waymo’s safety methodologies and safety readiness determi-
nations. arXiv preprint arXiv:2011.00054, 2020.
[26] Lu Zhang, Peiliang Li, Jing Chen, and Shaojie Shen. Trajectory prediction with graph-based
dual-scale context fusion. In 2022 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), pages 11374–11381. IEEE, 2022.
[27] Zikang Zhou, Luyao Ye, Jianping Wang, Kui Wu, and Kejie Lu. HiVT: Hierarchical Vector
Transformer for Multi-Agent Motion Prediction. In 2022 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 8813–8823, June 2022.

11

You might also like