Sha STA

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

ShaSTA: Modeling Shape and Spatio-Temporal Affinities for 3D Multi-Object Tracking

Tara Sadjadpour Jie Li Rares Ambrus Jeannette Bohg


Stanford University NVIDIA Toyota Research Institute Stanford University
Stanford, CA 94305 Santa Clara, CA 95051 Los Altos, CA 94022 Stanford, CA 94305
tsadja@stanford.edu jieli@nvidia.com rares.ambrus@tri.global bohg@stanford.edu
arXiv:2211.03919v2 [cs.CV] 7 Feb 2023

Fig. 1: Examples for tracking scenarios (top) and associated ground-truth affinity matrices (bottom) that our algorithm, ShaSTA, must estimate. In all affinity matrices, the rows
correlate with previous frame tracks that are represented with single-history bounding box detections, while the columns correlate with current frame detections. The affinity matrices
are augmented with two rows for newborn track (NB) and false-positive (FP) anchors that the current frame detections can match with, and two columns for dead track (DT) and
false-negative (FN) anchors that the previous frame tracks can match with. These four augmentations are learned representations that capture the essence of these four detection
and track types for each frame. The bounding box color corresponds to the track ID. At time step t = 0, both figures (a) and (b) start with affinity matrix A0 , where the yellow
car is detected and is matched with NB, thus initializing Track 1. For Figure (a), at t = 1, A1 shows that the yellow car is detected again and matched with the previous frame’s
Track 1, while the red car is detected as a newborn with Track ID 2. Then, at t = 2, there is only one detection for the red car, so it is matched with the previous frame’s Track
2, while Track 1 is matched with DT in A2 . In Figure (b), at time step t = 1, the red car is detected as a newborn with Track ID 2, while the yellow car is not detected at all so
Track 1 from t = 1 is matched with FN. Finally, at t = 2, the red car is detected again and matched with Track 2, while a detection is found in a region with no object so it is
matched with FP in A2 .

Abstract—Multi-object tracking (MOT) is a cornerstone capa- how our tracking framework may impact the ultimate goal of an
bility of any robotic system. The quality of tracking is largely de- autonomous mobile agent. We also present ablative experiments,
pendent on the quality of the detector used. In many applications, as well as qualitative results that demonstrate our framework’s
such as autonomous vehicles, it is preferable to over-detect objects capabilities in complex scenarios. Please see our project website
to avoid catastrophic outcomes due to missed detections. As a for source code and demo videos: sites.google.com/view/shasta-
result, current state-of-the-art 3D detectors produce high rates of 3d-mot/home.
false-positives to ensure a low number of false-negatives. This can
negatively affect tracking by making data association and track I. I NTRODUCTION
lifecycle management more challenging. Additionally, occasional In this work, we address the problem of 3D multi-object
false-negative detections due to difficult scenarios like occlusions
can harm tracking performance. To address these issues in tracking using LiDAR point cloud data as input. 3D multi-
a unified framework, we propose to learn shape and spatio- object tracking is an essential capability for autonomous agents
temporal affinities between tracks and detections in consecutive to navigate effectively in their environments. For example,
frames. Our affinity provides a probabilistic matching that leads autonomous cars need to understand the motion of surrounding
to robust data association, track lifecycle management, false- traffic agents to drive safely. Online tracking is achieved by
positive elimination, false-negative propagation, and sequential
track confidence refinement. Though past 3D MOT approaches (1) accurately matching uncertain detections to existing tracks
address a subset of components in this problem domain, we offer and (2) determining when to birth and kill tracks of unmatched
the first self-contained framework that addresses all these aspects observations. These two processes are referred to as data
of the 3D MOT problem. We quantitatively evaluate our method association and track lifecycle management, respectively.
on the nuScenes tracking benchmark where we achieve 1st place Most recent approaches [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
amongst LiDAR-only trackers using CenterPoint detections. Our
method estimates accurate and precise tracks, while decreasing toward 3D multi-object tracking in autonomous driving ap-
the overall number of false-positive and false-negative tracks ply the tracking-by-detection paradigm whereby off-the-shelf
and increasing the number of true-positive tracks. Unlike past detections are available from a state-of-the-art 3D detector.
works, we analyze our performance with 5 metrics, including Though this framework has been successful, there exists a
AMOTA, the most common tracking accuracy metric. Thereby, practical gap between the objectives of 3D detection and
we give a comprehensive overview of our approach to indicate
tracking. Since missing a detection can lead to catastrophic
events in self-driving, 3D detectors are evaluated on the tracking benchmark [15] where we achieve significant gains
full recall spectrum of detections (e.g., mAP [12]) and are in overall tracking accuracy and precision, while improving
deployed with the goal of having high recall. As a result, they the raw number of true-positive, false-positive, and false-
tend to generate high rates of false-positives to ensure a low negative tracks. We also include ablation studies to analyze
number of false-negatives [7, 13]. These redundant detections the contribution of each component.
lead to highly cluttered outputs that make data association and
track lifecycle management more challenging. Additionally, II. R ELATED W ORK
occasional false-negative detections due to factors such as A. LiDAR-based 3D Detection
heavy occlusion can cause significant fragmentations in the Most recent 3D multi-object tracking (3D MOT) works
tracks or lead to the premature termination of tracks. follow the tracking-by-detection paradigm [1, 16, 3, 8, 10, 5],
In this work, we address the task of LiDAR-only track- making the quality of 3D detectors a key component to track-
ing. In contrast to camera-only and camera-fusion track- ing performance [3, 17, 18]. LiDAR-based 3D detectors extend
ing [14, 11, 3, 9, 10], LiDAR tracking offers many practical the image-based 2D detection algorithms [19, 20] to 3D. The
advantages, balancing high accuracy with a less complex typical detector starts with an encoder [21, 22, 23] applied
system configuration and lower computational demands. Still, to unstructured point cloud data to extract an intermediate 3D
LiDAR-only systems face significant challenges due to the feature grid [13] or bird’s-eye-view (BEV) feature map [24, 7].
sparse, unstructured nature of LiDAR point clouds. Recent Then, a decoder is applied to extract objectness [25] and
works in LiDAR-only tracking-by-detection [4, 5, 6, 8] have detailed object attributes [26, 7]. Our approach proposes to
attempted to develop learning-based and statistical approaches leverage the intermediate BEV feature map for robust multi-
to better handle data association, false-positive elimination, object tracking, which can be generalized to most modern 3D
false-negatives propagation, and track lifecycle management. detectors. CenterPoint detections have become a standardized
However, these approaches do not effectively leverage spatio- 3D detection in the 3D MOT community, since the authors of
temporal and shape information from the LiDAR sensor input, the work released the official detection files for all data splits.
and in some cases, only rely on detected bounding box As of this writing, CenterPoint has the highest mAP of all
parameters for spatio-temporal information. publicly-released 3D detections. Thus, in this work, we apply
Our main contribution is a 3D multi-object tracking frame- CenterPoint [7] for fair comparison against existing 3D MOT
work called ShaSTA that models shape and spatio-temporal techniques.
affinities between tracks and detections in consecutive frames.
By better understanding objects’ shapes and spatio-temporal B. 3D Multi-Object Tracking
contexts, ShaSTA improves data association, false-positive With the advances in 3D detection, the 3D MOT commu-
(FP) elimination, false-negative (FN) propagation, newborn nity has been focusing on four directions to establish robust
(NB) initialization, dead track (DT) termination, and track tracking within the tracking-by-detection paradigm: motion
confidence refinement. At its core, ShaSTA learns an affinity prediction, data association, track lifecycle management, and
matrix that encodes shape and spatio-temporal information to confidence refinement.
obtain a probabilistic matching between detections and tracks. In 3D MOT, abstracted detection predictions are often used
ShaSTA also learns representations of FP, NB, FN, and DT for motion prediction and data association, which is completed
anchors for each frame. By computing the affinity between using pair-wise feature distance between tracks and detec-
detections and tracks to these learned representations, we can tions [1, 2, 7]. AB3DMOT [1] provides a baseline to combine
classify detections as newborn tracks or false-positives, while Kalman Filters and Intersection Over Union (IoU) association
tracks can be classified as dead tracks or false-negatives. Once for 3D MOT. Subsequent works in this line explore variations
we use these matches to form tracks, ShaSTA continues to of data association metrics, such as the Mahalanobis distance
leverage the affinity estimation with a novel sequential track to leverage uncertainty measurements in the Kalman Filter [2].
confidence refinement technique that represents track score As an alternative, CenterPoint [7] uses high-fidelity instance
refinement as a time-dependent problem and leverages infor- velocity predictions as a motion model to complete data
mation extracted from shape and spatio-temporal encodings to association. Unlike these works that only use low-dimensional
obtain significant improvements in overall tracking accuracy. bounding boxes and explicit motion models for tracking,
This overall approach not only aids in tracking perfor- we propose to leverage intermediate representations from a
mance but also addresses concerns for real-world deployment detection network for richer information about each object,
and downstream decision-making tasks by eliminating false- while also modeling the global relationship of all the objects
positive tracks and minimizing false-negative tracks. ShaSTA in a scene to improve data association.
offers an efficient and practical implementation that can re- Additionally, track lifecycle management has been reason-
use the LiDAR backbone of most off-the-shelf LiDAR-based ably successful with heuristics that are tuned to each dataset [1,
3D detectors. Without loss of generality, we choose the 2, 7], but these hand-tuned techniques have shortcomings
LiDAR backbone from CenterPoint [7] for a fair compari- in more complex situations. Recently, OGR3MOT [8] has
son against most previous works. We demonstrate that our proposed combining tracking and predictive modeling through
approach achieves state-of-the-art results on the nuScenes a unified graph representation that learns data associations
and track lifecycle management, FP elimination, and FN matching with current frame detections, but also considers
interpolation. Additionally, [5] proposes a statistical approach previous frame tracks matching with DT or FN anchors and
for data association under a Bayesian estimation framework. current frame detections matching with NB or FP anchors.
This work was later extended with Neural-Enhanced Belief Unlike past techniques that only use low-dimensional bound-
Propagation (NEBP) [6] to include a learning-based aspect to ing box representations, ShaSTA leverages shape and spatio-
the framework for FP suppression. Though these techniques temporal information from the LiDAR sensor input, resulting
yield competitive results compared to heuristics, their main in a robust data association technique that effectively accounts
shortfall is their inability to capture object shape information. for FP elimination, FN propagation, NB initialization, and DT
Unlike these techniques, our approach leverages raw LiDAR termination.
data to capture shape information about objects in our scene,
A. Affinity Matrix Usage for Online Tracking
which we can use to further improve tracking performance.
Finally, track confidence is a less-explored but also impor- Allowing for up to Nmax detections per frame, ShaSTA
tant aspect of the tracking problem. Track confidence indicates estimates the affinity matrix A ∈ R(Nmax +2)×(Nmax +2) . As
the quality of a track relative to tracks for the same object shown in Figure 1, the rows of our affinity matrix represent the
class within a given frame. Just as it is the case for detection previous frame tracks and the columns represent our current
confidences, the most important characteristic of a track’s frame detections. The two augmented columns are for the DT
confidence is its value relative to other tracks’ confidences. In and FN anchors, so previous frame tracks can match with DT
most 3D MOT works, algorithms assign the track confidence or FN. Similarly, because current frame detections can match
to be the same as that of the confidence score that comes with NB or FP, the affinity matrix has two augmented rows
with off-the-shelf detections matched to the track [1, 2, 7, 8]. for the NB and FP anchors. Though there can be no more than
In a recent effort to improve on this aspect of the tracking one match for each bounding box in the current and previous
problem, [4] used an existing detection score refinement frames, the same is not true for our learned anchors, i.e. more
technique from [27] to obtain new tracking scores and improve than one current frame bounding box can match with the NB
the overall tracking performance. Though this method demon- or FP anchors, respectively. To handle this situation, we create
strates improvements in ablative studies, its main weakness is a forward matching affinity matrix Af m ∈ RNmax ×(Nmax +2)
that it directly applies a detection score refinement technique by removing the two row augmentations from A and applying
to the track score, overlooking the fact that unlike detections, a row-wise softmax so that more than one previous frame track
tracks are ever-changing, time-dependent quantities that cannot can have sufficient probability to match with the DT and FN
be treated in isolation for the current time step. In this paper, anchors, respectively:
we argue that track score refinement should be a sequential Af m = softmaxrow (A1 ) . (1)
process that reflects changing environmental factors. Thus, we
propose a first-of-its-kind approach that represents track score Using similar logic, we create a backward matching affinity
refinement as a sequential problem and leverages shape and matrix Abm ∈ R(Nmax +2)×Nmax by removing the two column
spatio-temporal information encoded in our learned affinity augmentations from A and applying a column-wise softmax so
matrix to obtain significant improvement in overall tracking that more than one current frame detection can have sufficient
accuracy. probability to match with the NB and FP anchors, respectively:
Another line of 3D MOT algorithms focuses on using
Abm = softmaxcol (A2 ) . (2)
camera-LiDAR fusion to learn data association, track lifecycle
management, and FP suppression [11, 3, 9, 10]. However, in Here, forward matching indicates that we want to find the
this work we focus on LiDAR-only 3D MOT due to the prac- best current frame match for each previous frame track, while
tical advantages it offers. First, it is more cost-efficient to use backward matching refers to finding the best previous frame
fewer sensors. Additionally, LiDAR-only tracking reduces the match for each current frame detection.
risk of calibration and synchronization issues between various To form tracks, we combine the affinity matrix outputs and
sensor modalities in a more complex system configuration. the matching algorithm from [7] as follows. We take Af m and
label any previous frame track that has a probability above
III. SHASTA the threshold τdt in the DT column as a DT, and we forward
Our algorithm operates under the tracking-by-detection propagate any previous frame track into the current frame if it
paradigm and is visualized in Fig. 2. ShaSTA extracts shape has a probability above the threshold τf n in the FN column.
and spatio-temporal relationships between consecutive frames Therefore, in addition to provided detections, forward propa-
to learn affinity matrices that capture associations between gation creates a new detection for the current frame to han-
detections and tracks. ShaSTA learns bounding box and shape dle occluded objects that off-the-shelf detections missed. We
representations for FP, NB, FN, and DT anchors, where create a box in the current frame b0t := (x0 , y 0 , z, w, l, h, ry ),
each anchor captures the common features between the set by moving the 2D center coordinate of the existing track
of detections and tracks that falls under each of these four bt−1 := (x, y, z, w, l, h, ry ) to the next frame based on the
categories in a given frame. Thus, ShaSTA’s affinity matrix estimated velocity provided by our off-the-shelf 3D detector
not only estimates the probabilities of previous frame tracks and ∆t between the two frames, i.e. x0 = x + vx ∆t and
Fig. 2: Algorithm Flowchart. Going from top to bottom and left to right, ShaSTA uses Nmax low-dimensional bounding boxes from the current and previous frames to learn
bounding box representations for the FP, NB, FN, and DT anchors. The FP and NB anchors are appended to previous frame bounding boxes to form the augmented bounding
boxes B̂t−1 , while the FN and DT anchors are added to the current frame detections to form the augmented bounding boxes B̂t . Using the pre-trained frozen LiDAR backbone
from our off-the-shelf detector, we extract shape descriptors that are then used to learn shape representations for our anchors. The learned shape representations for the anchors
are appended to create augmented shape descriptors. These augmented bounding boxes and shape descriptors are then used to find a residual that captures the spatio-temporal and
shape similarities between current and previous frame detections, as well as between detections and anchors. The residual is then used to predict the affinity matrix for probabilistic
data associations, track lifecycle management, FP elimination, FN propagation, and track confidence refinement.

y 0 = y + vy ∆t. All previous frame tracks that do not qualify 1) Learning Bounding Box Representations for Anchors:
as FN or DT are kept as is. Analogously, we take Abm and Our goal is to learn bounding box representations for each of
remove any current frame detection that has a probability the FP, FN, NB, and DT anchors in a given frame pair. The
above the threshold τf p in the FP row and we label any current aim of the bounding box representation of each anchor is to
frame detection that has a value above the threshold τnb in the capture the commonalities between detection bounding boxes
NB row as an NB. All current frame detections that do not that fall under each of these four categories.
qualify as FP or NB are kept as is. See Section V-B for details We use off-the-shelf 3D detections, which provide us with
on the threshold values we choose. 3D bounding box representations that include each box’s
With these detections, we then run the greedy algorithm center coordinate, dimensions, and yaw rotation angle: b :=
from [7]. In its original form, the greedy algorithm takes all (x, y, z, w, l, h, ry ). We take the set of bounding boxes for the
unmatched detections and initializes NBs with them. However, current frame Bt ∈ RNmax ×7 and previous frame Bt−1 ∈
we check if an unmatched detection is labeled as an NB in RNmax ×7 . We fix the maximum number of bounding boxes
the affinity matrix based on the threshold τnb and whether we take per frame to be Nmax ; we zero pad the matrix if
it is outside the maximum distance threshold with respect to there are fewer than Nmax bounding boxes, and we sample
other tracks to initialize it as an NB. Otherwise, it is discarded. the top Nmax detection bounding boxes if there are greater
Additionally, we check if an unmatched track is labeled as a than Nmax boxes.
DT based on the threshold τdt and whether it is outside the Then, we create a learned bounding box representation at
maximum distance threshold with respect to other detections time step t for FP and NB anchors using the current frame
to terminate it as a DT. detections Bt as follows, where each σ represents an MLP:

B. Learning to Predict Affinity Matrices bf p = σfb p (Bt ) (3)


b
ShaSTA uses off-the-shelf 3D detections, as well as the bnb = σnb (Bt ). (4)
pre-trained LiDAR backbone used to generate the detections. Similarly, we find learned bounding box representations for
ShaSTA first learns bounding box and shape representations FN and DT anchors using previous frame detections Bt−1 :
for the anchors so we can later use this information to learn
a residual that encodes spatio-temporal and shape similarities bf n = σfb n (Bt−1 ) (5)
not only between current frame detections and previous frame bdt = b
σdt (Bt−1 ). (6)
tracks, but also between current frame detections and FP and
NB anchors or between previous frame tracks and FN and For all four cases, we apply the absolute value to the MLP
DT anchors. The residual is then used to predict affinity outputs corresponding to (w, l, h), since dimensions need to
matrices that assign probabilities for matching detections to be non-negative values. We then concatenate bf p ∈ R7 and
tracks, eliminating FPs, propagating FNs, initializing NBs, and bnb ∈ R7 to Bt−1 to get B̂t−1 ∈ R(Nmax +2)×7 , as well as
terminating DTs. bf n ∈ R7 and bdt ∈ R7 to Bt to get B̂t ∈ R(Nmax +2)×7 .
The reason we append FP and NB anchors to Bt−1 is so that called the VoxelNet and bounding box residuals, Rv and Rb
current frame detections can match to them, and the same logic respectively. We also learn one shape residual Rs between the
applies for appending FN and DT anchors to Bt . augmented shape descriptors. We obtain our overall residual
2) Learning Shape Representations for Anchors: In this R that captures the spatio-temporal and shape similarities
section, we extract shape descriptors for each 3D detection by taking a weighted sum of the three residuals, where the
using the pre-trained LiDAR backbone of our off-the-shelf weights are also learned.
detector to leverage spatio-temporal and shape information VoxelNet Residual. Our first residual Rv ∈
from the raw LiDAR data. In similar fashion to the previous R(Nmax +2)×(Nmax +2) is a variation of the VoxelNet [13]
section, our goal is to learn shape representations for the FP, bounding box residual. We adapt the VoxelNet residual for
FN, NB, and DT anchors using the shape descriptors from the end-goal of data assocation between B̂t−1 and B̂t to
existing detections. These shape descriptors will be used to capture their similarities based on 3D box center, dimension,
match detections to each other or one of the anchors if they and rotation.
share similar shape information. We define each entry (i, j) in Rv as follows:
We pass the current frame’s 4D LiDAR point cloud with 2
an added temporal dimension into the frozen pre-trained Lc (i, j) = cit−1 − cjt (11)
2
VoxelNet [13] LiDAR backbone used in the CenterPoint [7]  i   i   i 
wt−1 l h
detector. This outputs a bird’s-eye-view (BEV) map for the Ld (i, j) = log j
+ log t−1 j
+ log t−1
current frame, where each voxelized region encodes a high- wt lt hjt
dimensional volumetric representation of the shape informa- (12)
2
tion in that region. Then, for each current frame detection

j
Lr (i, j)2 = cos(ry,t−1
i
) − cos(ry,t )
bounding box, we extract a shape descriptor using bilinear
 2
interpolation from the BEV map as in [7] using the bounding i
+ sin(ry,t−1 ) − sin(ry,tj
) (13)
box center, left face center, right face center, front face center,
and back face center. We concatenate these 5 shape features Rv (i, j) = L̂c (i, j) + Ld (i, j) + Lr (i, j) (14)
to create the overall shape descriptor for each bounding box. where c := (x, y, z) is the 3D box center. For the overall
Note that in BEV, the box center, bottom face center, and residual in Equation 14, Lc (i, j) is normalized to make all the
top face center all project to the same point, so we forgo normalized elements L̂c (i, j) dimensionless like the terms in
extracting the latter two centers. We accumulate all of the Ld and Lr .
shape features extracted for the current frame detections and Bounding Box Residual. The learned bounding box resid-
call this cumulative shape descriptor St ∈ RNmax ×F . We ual Rb ∈ R(Nmax +2)×(Nmax +2) is found by first expand-
repeat the same procedure for the previous frame’s LiDAR ing B̂t−1 and B̂t and concatenating them to get a matrix
point cloud and bounding boxes to get the overall shape B̂ ∈ R(Nmax +2)×(Nmax +2)×6 . Note that for this step we only
descriptor St−1 ∈ RNmax ×F . use the box centers from B̂t−1 and B̂t . Then, we obtain our
Using the extracted shape features for the current frame St , residual Rb ∈ R(Nmax +2)×(Nmax +2) with an MLP σrb as such:
we obtain our learned shape descriptors for the FP and NB
anchors with MLPs σf p and σnb , respectively: Rb = σrb (B̂). (15)

sf p = σfs p (St ) (7) Shape Residual. We use Ŝt and Ŝt−1 to obtain the learned
s shape residual Rs between the two frames. We expand
snb = σnb (St ). (8)
Ŝt−1 and Ŝt and concatenate them to get a matrix Ŝ ∈
Likewise, we use our previous frame’s shape features St−1 R(Nmax +2)×(Nmax +2)×2F . Then, we obtain our residual Rs ∈
to learn shape descriptors for FN and DT anchors as such: R(Nmax +2)×(Nmax +2) with an MLP:

sf n = σfs n (St−1 ) (9) Rs = σrs (Ŝ). (16)


s Overall Residual. The overall residual R is a weighted
sdt = σdt (St−1 ). (10)
sum of the three residuals we previously obtained: Rv ,
Just as we did in the previous section, we concatenate Rb , and Rs . We concatenate B̂ and Ŝ that we pre-
sf p ∈ RF and snb ∈ RF to St−1 to get the augmented shape viously obtained in the early steps to create our input
descriptor Ŝt−1 ∈ R(Nmax +2)×F , as well as sf n ∈ RF and Ŵ ∈ R(Nmax +2)×(Nmax +2)×(2F +6) . Then, we pass this
sdt ∈ RF to St to get Ŝt ∈ R(Nmax +2)×F . through an MLP σα to get the learned weights α ∈
3) Residual Prediction: Using the augmented 3D bounding R(Nmax +2)×(Nmax +2)×3 as follows:
boxes and shape descriptors, we aim to find three residuals that
measure the similarities between current and previous frames’ α = σα (Ŵ ). (17)
bounding box and shape representations. Since the boxes are We split α into matrices αv , αb , αs ∈ R (Nmax +2)×(Nmax +2)
,
low-dimensional abstractions, we aim to maximize the amount and obtain our overall residual R ∈ R(Nmax +2)×(Nmax +2) :
of spatio-temporal information we extract from them. Thus, we
obtain two residuals for the augmented 3D bounding boxes R = αv Rv + α b Rb + αs Rs . (18)
Note that is the Hadamard product. captured in Equation 24.
4) Affinity Matrix Estimation: Given our overall residual, (t) (t) (t) (t−1)
ctrk,i ← 1[PF P,i < β1 ]β2 cdet,i + (1 − β2 )ctrk,i (24)
we have information about the pairwise similarities between
(t)
current frame detections and previous frame tracks, as well s.t. 0≤ cdet,i ≤1 ∀i, t
as each of the four augmented anchors. We want to use these (t)
0≤ ctrk,i ≤1 ∀i, t
spatio-temporal and shape relationships to learn probabilities
for matching the detections and tracks to each other or the 0 ≤ β1 ≤ 1
anchors. This will ultimately allow us to leverage global rela- 0 ≤ β2 ≤ 1
tionships for learning our matching probabilities, as opposed
According to Equation 24, for track ID i at time step t, we
to using greedy local searches on residuals for matching [7, 2]. (t)
We apply an MLP to get our overall affinity matrix A: have a confidence of ctrk,i . The affinity matrix gets leveraged
(t)
in the indicator function with PF P,i , which is the probability
A = σaf f (R). (19) that the detection got matched with the false-positive anchor
at time step t. We fix β1 to be a value slightly less than
C. Log Affinity Loss τF P to account for detections that are very close to getting
eliminated as FP. In other words, we view detections with
Inspired by [28, 29, 30], we train our model with the (t)
PF P,i ∈ [β1 , τF P ] to be very uncertain detections, and we take
log affinity loss Lla . We define the ground-truth affinity
this into account by reducing the overall track confidence for
matrix Agt , the estimated affinity matrix A, and the Hadamard
tracks matching to such detections. However, if the matched
product to get:
detection has a very high probability of being true-positive,
(t)
then the track confidence ctrk,i becomes a weighted average
P P
i j (Agt − log(A))
Lla = P P . (20) (t) (t−1)
between cdet,i and ctrk,i . Based on cross validation, we found
i j Agt
that this refinement technique is not very sensitive to values for
The affinity matrix prediction has corresponding ground- β1 and β2 . In fact, we set β1 = 0.5 for all object classes, and
truth matrices Agt,f m and Agt,bm for forward matching and β2 = 0.5 for all object classes except for bicycle (β2 = 0.4),
backward matching, respectively. In total, our overall loss bus (β2 = 0.7), and trailer (β2 = 0.4). In the case of bicycle
function L is defined as follows: and trailer, we choose β to be slightly lower, because we want
P P to give more weight to the fact that it has been tracked since
i j (Agt,f m − log(Af m )) 3D detectors tend not to be very good on these object types
Lf m = P P (21)
i j Agt,f m and thus skew towards lower detection confidence scores. For
the converse reason, we give bus a higher β2 .
P P
i j (Agt,bm − log(Abm ))
Lbm = P P (22) In the special case of a newborn track, the track confidence
i A
j gt,bm
is only the first term of Equation 24, i.e.
1 
L= Lf m + Lbm . (23) (t) (t) (t)
2 ctrk,i ← 1[PF P,i < β1 ]β2 cdet,i .
D. Sequential Track Confidence Refinement This works well, because the detection confidences are also
created in relative terms. Thus, if all newly initialized track
We propose a sequential track confidence refinement tech-
confidences are the detection confidences scaled with β2 , their
nique that is used to assess the quality of tracks in real-time.
relative confidence rankings will be preserved.
We first offer an intuition behind our sequential track confi-
dence refinement technique before formalizing our approach IV. DATA P REPARATION
in Equation 24. We evaluate our tracking performance on the nuScenes
As stated earlier, track confidence should accurately reflect benchmark [15]. The nuScenes dataset contains 1000 scenes,
a track’s quality relative to the quality of tracks for objects where each scene is 20 seconds long and is generated with
of the same class in a given time step. Thus, to obtain more a 2Hz frame-rate. For the tracking task, nuScenes has 7
accurate confidence scores to characterize our tracks, we want classes: bicycle, bus, car, motorcycle, pedestrian, trailer, and
to leverage our affinity matrix estimation as follows: (1) If the truck. In this section, we provide an overview of the LiDAR
affinity matrix indicates that the detection matched to our track preprocessing and ground-truth affinity matrix formation we
has a high probability of being true-positive, we want to take complete to deploy ShaSTA for the nuScenes dataset.
a weighted average between the confidence of our matched
detection and the existing confidence of our track. (2) If the A. LiDAR Point Cloud Preprocessing
affinity matrix indicates that the detection matched to our track Since CenterPoint [7] provides detections at 2Hz, but the
almost got eliminated as an FP – only marginally missing the LiDAR has a sampling rate of 20Hz, we use the nuScenes [15]
elimination threshold τF P – then we want to downscale the feature for accumulating multiple LiDAR sweeps with motion
existing confidence for our track, because it likely should not compensation. This provides us with a denser 4D point cloud
be kept alive with this matched detection. This approach is with an added temporal dimension, and makes the LiDAR
TABLE I: nuScenes Test Results with evaluation in terms of overall and individual AMOTA and AMOTP. Our method achieves the highest AMOTA, while effectively balancing
AMOTP to get the second lowest AMOTP only 0.005m shy of the first spot. We only compare on LiDAR-only methods that use CenterPoint detections to ensure a fair comparison.
The blue entry is our LiDAR-only method ShaSTA.
Metric Method Detector Input Overall Bicycle Bus Car Motorcycle Pedestrian Trailer Truck
ShaSTA (Ours) CenterPoint LiDAR 69.6 41.0 73.3 83.8 72.7 81.0 70.4 65.0
NEBP [6] CenterPoint LiDAR 68.3 44.7 70.8 83.5 69.8 80.2 69.0 59.8
AMOTA ↑ OGR3MOT [8] CenterPoint LiDAR 65.6 38.0 71.1 81.6 64.0 78.7 67.1 59.0
CenterPoint [7] CenterPoint LiDAR 65.0 33.1 71.5 81.8 58.7 78.0 69.3 62.5
ShaSTA (Ours) CenterPoint LiDAR 0.540 0.674 0.629 0.383 0.504 0.369 0.742 0.650
NEBP [6] CenterPoint LiDAR 0.624 1.026 0.677 0.391 0.557 0.393 0.781 0.541
AMOTP ↓ OGR3MOT [8] CenterPoint LiDAR 0.620 0.899 0.675 0.395 0.615 0.383 0.790 0.585
CenterPoint [7] CenterPoint LiDAR 0.535 0.561 0.636 0.391 0.519 0.409 0.729 0.5

TABLE II: nuScenes Test Results with evaluation in terms of overall TP, FP, and FN.
We only compare on LiDAR-only methods that use CenterPoint detections to ensure a Multi-Object Tracking Precision (AMOTP) [31]. AMOTA
fair comparison. The blue entry is our LiDAR-only method ShaSTA. measures the ability of a tracker to track objects correctly,
Method Detector Input TP ↑ FP ↓ FN ↓ while AMOTP measures the quality of the estimated tracks.
ShaSTA (Ours) CenterPoint LiDAR 97,799 16,746 21,293
AMOTA averages the recall-weighted MOTA, known as
NEBP [6] CenterPoint LiDAR 97,367 16,773 21,971
OGR3MOT [8] CenterPoint LiDAR 95,264 17,877 24,013
MOTAR, at n evenly spaced recall thresholds. Note that
CenterPoint [7] CenterPoint LiDAR 94,324 17,355 24,557
MOTA is a metric that penalizes FPs, FNs, and ID switches
(IDS). GT indicates the number of ground-truth tracklets.
input match the detection sampling rate. We use 10 LiDAR 1 X
AM OT A = M OT AR (25)
sweeps to generate the point cloud. n−1 1 , 1 ,...,1}
r∈{ n−1 n−2
B. Ground-Truth Affinity Matrix. !
IDSr + F Pr + F Nr − (1 − r) · GT
We find the true-positive, false-positive, and false- M OT AR = max 0, 1 −
r · GT
negative detections using the matching algorithm from the (26)
nuScenes [15] dev-kit. For each frame, we define a ground-
truth affinity matrix At ∈ R(Nmax +2)×(Nmax +2) between the Additionally AMOTP averages MOTP over n evenly spaced
current frame and previous frame off-the-shelf detections. recall thresholds. Note that MOTP measures the misalignment
For all true-positive previous and current frame detections, between ground-truth and predicted bounding boxes. For the
tp
Bt−1 and Bttp , we say they are matched if the ground-truth definition of AMOTP below, di,t indicates the distance error
boxes they each correspond to have the same ground-truth for track i at time step t and T Pt indicates the number of
tracking IDs. If there is a match between true-positive boxes true-positive (TP) matches at time step t.
btp tp tp tp
t−1,i ∈ Bt−1 and bt,j ∈ Bt , then At (i, j) = 1. For all false-
P
1 di,t
Pi,t
X
positive previous frame detections and true-positive previous AM OT P = (27)
n−1 1 1 t Pt
T
frame detections whose tracking IDs do not appear in any of r∈{ n−1 , n−2 ,...,1}
the current frame ground-truth tracking IDs, we call them dead
dt The nuScenes benchmark [15] defines TP tracks as estimated
tracks Bt−1 . For all bdt dt
t−1,i ∈ Bt−1 , we set At (i, Nmax + 1) = tracks that are within an L2 distance of 2m from ground-truth
1. For all true-positive previous frame detections that do not
tracks. This allows for a margin of error that can become
have a match in the current frame but whose tracking IDs
unsafe in certain driving scenarios where small distances can
do appear in the current frame ground-truth tracking IDs, we
fn make a dramatic difference in safety. Thus, even though the
call them false-negatives Bt−1 . For all bft−1,i
n fn
∈ Bt−1 , we
official nuScenes leaderboard uses AMOTA solely to rank
set At (i, Nmax + 2) = 1. Additionally, for all true-positive
trackers, we emphasize AMOTP as well due to its safety
current frame detections whose tracking IDs do not appear in
implications.
the previous frame ground-truth tracking IDs, we call them
We also present secondary metrics, including the total num-
newborn tracks Btnb , and we set At (Nmax + 1, j) = 1 for all
ber of TP, FP, and FN tracks to better analyze the effectiveness
bnb nb
t,j ∈ Bt . Finally, we define the set of false-positive current of our approach for downstream decision-making tasks.
frame detections as Btf p , and we set At (Nmax + 2, j) = 1 for
all bft,jp ∈ Btf p . Unmatched (i, j) entries are set to 0. B. Training Specifications
V. E XPERIMENTAL R ESULTS We train a different network for each object category follow-
This section provides an overview of evaluation metrics, ing [3, 4]. The value of Nmax depends on the object type since
training details, comparisons against state-of-the-art 3D multi- some classes are more common than others. We choose the
object tracking techniques, and ablation studies to analyze our value for Nmax based on the per-object detection frequency
technique. we gather in the training set. The smallest is Nmax = 20 and
the largest is Nmax = 90. We use the Adam optimizer with
A. Evaluation Metrics a constant learning rate of 10−4 and L2 weight regularization
The primary metrics for evaluating nuScenes are Aver- of 10−2 . We use the pre-trained VoxelNet weights from
age Multi-Object Tracking Accuracy (AMOTA) and Average CenterPoint [7] and freeze them for our LiDAR backbone
TABLE III: Ablation on Confidence Refinement. We assess the effect of confidence refinement on tracking performance for each object. The version of ShaSTA without refinement
indicates that we apply the CenterPoint detection confidence scores to our tracks. Evaluation is on nuScenes validation set in terms of overall AMOTA. The blue entry is our full
method.
Metric Method Overall Bicycle Bus Car Motorcycle Pedestrian Trailer Truck
Without Refinement 70.2 49.4 85.2 85.3 70.1 80.9 50.1 70.1
AMOTA ↑
With Refinement 72.8 58.8 85.6 85.6 74.5 81.4 53.1 70.3
Difference +2.6 +9.4 +0.4 +0.3 +4.4 +0.5 +3.0 +0.2

so that the shape information correlates with our detection are cars and pedestrians, and we get highly competitive results
bounding boxes from [7]. Additionally, there is a significant in both of those categories, which tend to suffer the most
class imbalance between FPs and TPs since state-of-the-art from crowded scenes. Thus, our FPs reflect our effectiveness
detectors function in low precision-high recall regimes. Thus, at dealing with these sorts of scenes. Another interesting
we downsample the number of FP detections during training. interplay between these metrics is that often times decreasing
Finally, we fix the thresholds τf p for FP elimination, τf n for FPs comes at the risk of increasing FNs and decreasing
FN propagation, τnb for NB initialization, and τdt for DT TPs. Thus, it is quite challenging to tackle the FP, FN, and
termination with cross-validation. We found that τf p = 0.7, TP problem simultaneously, yet our approach does so most
τf n = 0.5, τnb = 0.5, and τdt = 0.5 works best for all classes. effectively out of all existing tracking methods.

C. Comparison with State-of-the-Art LiDAR Tracking TABLE IV: Ablation on Shape Descriptors. We assess the effect of the number of
3D bounding box points used to extract shape descriptors. Evaluation is on nuScenes
Our nuScenes test results can be seen in Tables I and II. validation set in terms of overall AMOTA. The blue entry is our full method.
Because previous works have shown that the standard AMOTA Box Center Used Face Centers Used AMOTA ↑
X 7 68.8
tracking metric heavily depends on the upstream 3D detec- 7 X 70.2
tions [3, 17, 18], all tracking methods reported use CenterPoint X X 72.8
detections to ensure a fair comparison1 . For more information
on this evaluation protocol and the AMOTA metric, please
refer to Section V-A.
TABLE V: Ablation on Residuals. We assess the effect of isolating each residual on
The CenterPoint tracking algorithm [7] is our baseline, and the tracking algorithm’s performance. Evaluation is on nuScenes validation set in terms
we include recent, peer-reviewed works cited in Section II that of overall AMOTA. The blue entry is our full method.

tackle data association, track lifecycle management, FPs, and Residual Used AMOTA ↑
VoxelNet Only 68.9
FNs in the LiDAR-only tracking domain. Most notably, before Bounding Box Only 69.1
ShaSTA was developed, NEBP [6] was the #1 LiDAR-only Shape Only 69.5
tracker using CenterPoint detections. All (ShaSTA) 72.8
In Table I, we can see that ShaSTA balances correct track-
ing (AMOTA) with precise, high-quality tracking (AMOTP).
Most notably ShaSTA not only achieves the highest overall TABLE VI: Ablation on Learned Affinity Matrix. We assess the effect of each
AMOTA, but also does so for 6 out of 7 object classes. The augmentation from our learned affinity matrix on track formation. Evaluation is on
nuScenes validation set in terms of overall AMOTA. The blue entry is our full method.
only class we are not first in AMOTA is the bicycle class, with
Affinity Augmentation Method AMOTA ↑
NEBP taking the lead. However, it is interesting to note that False-Positive Elimination Only 72.7
our precision is much better for bicycles than NEBP is. NEBP False-Negative Propagation Only 70.4
averages over 1 meter in distance from the ground-truth tracks Newborn Initialization Only 72.6
Dead Track Termination Only 70.4
on bicycles, which can be very dangerous in real-world traffic All (ShaSTA) 72.8
scenes. Compared to NEBP, ShaSTA has a 34% improvement
in AMOTP while only having an 8% reduction in AMOTA.
Thus, the trade-off for the bicycles is clearly in ShaSTA’s
favor. For the overall AMOTP, we achieve the second best D. Ablation Studies
value only trailing CenterPoint by 0.005 meters, which is a We complete four ablative experiments on the nuScenes
favorable trade-off when compared to our significant boost in validation set to measure the impacts of (1) the sequential track
overall AMOTA. confidence refinement, (2) the number of shape descriptors
Examining further metric breakdowns in Table II, it be- from the BEV map, (3) each of the residuals used to make
comes clear that ShaSTA is effective in de-cluttering envi- the overall residual prediction, and (4) including FP elimina-
ronments suffering from FPs and compensating for missed tion, FN propagation, NB initialization, and DT termination
detections that lead to missing TPs and rising FNs. The two in the learned affinity matrix. In total, our ablative studies
most observed classes that constitute over 80% of the test set demonstrate that all of the components in ShaSTA contribute
1 At the time of this writing, CenterPoint detections achieve the highest mAP
to the overall success of this tracking framework.
of all publicly-available 3D detections. Higher-scoring detections appear on First, we analyze the effect of the sequential tracking confi-
the nuScenes leaderboard, but are proprietary. dence refinement as seen in Table III. When we do not use our
(a) CenterPoint: Scene 270

(b) ShaSTA: Scene 270

Fig. 3: Visualization of tracking results for car object class projected onto front camera. We show two consecutive frames from Scene 270 of the validation set for CenterPoint [7]
and ShaSTA. In frame 29, CenterPoint has 3 false-positive tracks as can be seen in frame 29, and one of the false-positives persists into frames 30 and 31. However, ShaSTA
perfectly tracks the car in this sequence without any false-positives or false-negatives. Note that both methods are starting with the CenterPoint detections as input.

refinement technique, we apply the CenterPoint detection con- or Bounding Box only) or shape (Shape only) information, we
fidence scores to our tracks as in past works [1, 6, 2, 3, 7, 8]. achieve the most optimal results when leveraging both types
We note a significant boost in the overall AMOTA for the of information.
validation set when using the proposed refinement method. Lastly, in Table VI, we see that the most impactful aug-
All object classes benefited, but the ones that improved the mentations are FP elimination and NB initialization. Elimi-
least are large objects like trucks, cars, and buses. This makes nating FPs is expected to boost our accuracy since off-the-
sense since these types of objects tend to provide the best shelf detections suffer from such high rates of FP detections.
LiDAR point clouds and are less likely to suffer from full NB initialization can be seen as a second reinforcement on
occlusion, allowing the detection confidence to be a good eliminating FP detections because unmatched detections that
indicator for tracking. Conversely, smaller objects like bicycles do not have a probability of at least τnb for being NBs do not
and motorcycles that tend to suffer the most from LiDAR get initialized as tracks and are discarded. Finally, even though
point cloud sparsity due to increased distance from the ego FNs and DT termination have not been as detrimental to
vehicle and occlusion in traffic scenes benefit the most from tracking algorithms in the past, these augmentations alone give
this technique. In these cases, the pure detection confidences us competitive accuracies, showing that every augmentation
can drop dramatically when there are scenarios such as full contributes to effective tracking.
occlusion of these small objects, but with our sequential
refinement we take into account the existing confidence that E. Qualitative Results
has accumulated for a track before such scenarios occur. As indicated in Table I, there are significant increases in
Additionally, in Table IV, we can see that as we add more test accuracy over the CenterPoint [7] baseline, particularly
bounding box face centers for extracting shape descriptors, for objects like cars and pedestrians that are often found
the accuracy of our model monotonically increases. This in crowded scenes and thus, have increased susceptibility to
result supports our argument that leveraging raw LiDAR point experiencing high rates of FP tracks.
clouds to get shape information is integral to the success of At high recall thresholds, FP tracks tend to appear more
our tracking framework. Most notably, when we have less frequently, but the AMOTA metric does not strongly penalize
shape information, more objects within the same class are FP tracks at higher recall thresholds, diminishing their signifi-
more likely to appear similar, leading to more ambiguity in cant effect on real-world deployment for downstream decision-
distinguishing between TPs and FPs. Such ambiguity makes making tasks. To showcase ShaSTA’s ability to eliminate
the track confidence refinement less effective, specifically in FPs effectively, we present tracking results in Figure 3. We
the indicator function in Equation 24. only show the tracking results on the car object class from
Furthermore, our ablation on the individual residuals in Ta- validation scene 270. (a) shows the CenterPoint [7] baseline
ble V demonstrates that even though we can achieve compet- which has FP tracks in all three frames. On the other hand,
itive tracking accuracy solely with spatio-temporal (VoxelNet (b) demonstrates ShaSTA’s ability to declutter the scene and
accurately track the car, while ignoring background informa- [13] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud
tion that should not be tracked. Typically, past works [4] use based 3d object detection,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2018, pp. 4490–4499.
non-maximal suppression heuristics such as intersection over [14] S. Chen, X. Wang, T. Cheng, Q. Zhang, C. Huang, and W. Liu, “Polar
union (IoU) to remove FP tracks, but in more complex settings parametrization for vision-based surround-view 3d detection,” arXiv
like scene 270, such heuristics would fail because none of the preprint arXiv:2206.10965, 2022.
[15] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu,
FPs overlap with other track predictions. As a result, ShaSTA A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A
proves to be especially effective at using shape and spatio- multimodal dataset for autonomous driving,” in Proceedings of the
temporal information from LiDAR point clouds to localize IEEE/CVF conference on computer vision and pattern recognition, 2020,
pp. 11 621–11 631.
and track actual objects of interest in the scene. [16] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun,
“Pnpnet: End-to-end perception and prediction with tracking in the
VI. C ONCLUSION loop,” in CVPR, 2020.
[17] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé,
We present ShaSTA, a 3D multi-object tracking framework and B. Leibe, “Hota: A higher order metric for evaluating multi-object
that leverages shape and spatio-temporal information from tracking,” International journal of computer vision, vol. 129, pp. 548–
578, 2021.
LiDAR sensors to learn affinities for robust data association, [18] L. Wang, X. Zhang, W. Qin, X. Li, L. Yang, Z. Li, L. Zhu, H. Wang,
track lifecycle management, FP elimination, FN propagation, J. Li, and H. Liu, “Camo-mot: Combined appearance-motion opti-
and track confidence refinement. Our approach effectively ad- mization for 3d multi-object tracking with camera-lidar fusion,” arXiv
preprint arXiv:2209.02540, 2022.
dresses false-positives and missing detections against cluttered [19] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
scenes and occlusion, yielding state-of-the-art performance. conference on computer vision, 2015, pp. 1440–1448.
Since ShaSTA offers a flexible framework, we can easily [20] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” in arXiv
preprint arXiv:1904.07850, 2019.
extend it in future work to fuse various sensor modalities. [21] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on
Other interesting future directions include the usage of our point sets for 3d classification and segmentation,” in Proceedings of the
learned affinities in prediction and planning. IEEE conference on computer vision and pattern recognition, 2017, pp.
652–660.
[22] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner,
VII. ACKNOWLEDGMENT “Vote3deep: Fast object detection in 3d point clouds using efficient
Toyota Research Institute provided funds to support this convolutional neural networks,” in 2017 IEEE International Conference
on Robotics and Automation (ICRA). IEEE, 2017, pp. 1355–1361.
work. [23] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection
from point clouds,” in Proceedings of the IEEE conference on Computer
R EFERENCES Vision and Pattern Recognition, 2018, pp. 7652–7660.
[1] X. Weng, J. Wang, D. Held, and K. Kitani, “3D Multi-Object Tracking: [24] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
A Baseline and New Evaluation Metrics,” IROS, 2020. “Pointpillars: Fast encoders for object detection from point clouds,”
[2] H.-k. Chiu, A. Prioletti, J. Li, and J. Bohg, “Probabilistic 3d multi-object in Proceedings of the IEEE/CVF conference on computer vision and
tracking for autonomous driving,” arXiv preprint arXiv:2001.05673, pattern recognition, 2019, pp. 12 697–12 705.
2020. [25] H. Choi, M. Kang, Y. Kwon, and S.-e. Yoon, “An objectness score
[3] H.-k. Chiu, J. Li, R. Ambruş, and J. Bohg, “Probabilistic 3d multi- for accurate and fast detection during navigation,” arXiv preprint
modal, multi-object tracking for autonomous driving,” in 2021 IEEE arXiv:1909.05626, 2019.
International Conference on Robotics and Automation (ICRA). IEEE, [26] C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for
2021, pp. 14 227–14 233. 3d object detection in point clouds,” in proceedings of the IEEE/CVF
[4] C. Stearns, D. Rempe, J. Li, R. Ambrus, S. Zakharov, V. Guizilini, International Conference on Computer Vision, 2019, pp. 9277–9286.
Y. Yang, and L. J. Guibas, “Spot: Spatiotemporal modeling for 3d object [27] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and
tracking,” arXiv preprint arXiv:2207.05856, 2022. P. Kontschieder, “Disentangling monocular 3d object detection,” in
[5] F. Meyer, T. Kropfreiter, J. L. Williams, R. Lau, F. Hlawatsch, P. Braca, Proceedings of the IEEE/CVF International Conference on Computer
and M. Z. Win, “Message passing algorithms for scalable multitarget Vision, 2019, pp. 1991–1999.
tracking,” Proceedings of the IEEE, vol. 106, no. 2, pp. 221–259, 2018. [28] X. Guo and Y. Yuan, “Joint class-affinity loss correction for ro-
[6] M. Liang and F. Meyer, “Neural enhanced belief propagation for bust medical image segmentation with noisy labels,” arXiv preprint
data association in multiobject tracking,” in 2022 25th International arXiv:2206.07994, 2022.
Conference on Information Fusion (FUSION). IEEE, 2022, pp. 1–7. [29] M. Hayat, S. Khan, W. Zamir, J. Shen, and L. Shao, “Max-margin
[7] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection class imbalanced learning with gaussian affinity,” arXiv preprint
and tracking,” in Proceedings of the IEEE/CVF conference on computer arXiv:1901.07711, 2019.
vision and pattern recognition, 2021, pp. 11 784–11 793. [30] S. Sun, N. Akhtar, H. Song, A. Mian, and M. Shah, “Deep affinity
[8] J.-N. Zaech, A. Liniger, D. Dai, M. Danelljan, and L. Van Gool, network for multiple object tracking,” IEEE transactions on pattern
“Learnable online graph representations for 3d multi-object tracking,” analysis and machine intelligence, vol. 43, no. 1, pp. 104–119, 2019.
IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5103–5110, [31] X. Weng and K. Kitani, “A baseline for 3d multi-object tracking,” arXiv
2022. preprint arXiv:1907.03961, vol. 1, no. 2, p. 6, 2019.
[9] A. Kim, A. Ošep, and L. Leal-Taixé, “Eagermot: 3d multi-object
tracking via sensor fusion,” in 2021 IEEE International Conference on
Robotics and Automation (ICRA). IEEE, 2021, pp. 11 315–11 321.
[10] X. Wang, C. Fu, Z. Li, Y. Lai, and J. He, “Deepfusionmot: A 3d
multi-object tracking framework based on camera-lidar fusion with deep
association,” arXiv preprint arXiv:2202.12100, 2022.
[11] X. Weng, Y. Wang, Y. Man, and K. Kitani, “Gnn3dmot: Graph neural
network for 3d multi-object tracking with multi-feature learning,” in
CVPR, 2020.
[12] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
man, “The pascal visual object classes (voc) challenge,” International
journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.

You might also like