Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

ShaSTA-Fuse: Camera-LiDAR Sensor Fusion to Model Shape and Spatio-Temporal

Affinities for 3D Multi-Object Tracking

Tara Sadjadpour1 , Rares Ambrus2 , and Jeannette Bohg1

Abstract— 3D multi-object tracking (MOT) is essential for an which leverages the complementary nature of the camera
autonomous mobile agent to safely navigate a scene. In order and LiDAR sensors to benefit from their combined ability
to maximize the perception capabilities of the autonomous to perceive depth, distant objects, and rich sensory signals.
agent, we aim to develop a 3D MOT framework that fuses
camera and LiDAR sensor information. Building on our prior The vast majority of state-of-the-art trackers rely on
LiDAR-only work [1] that models shape and spatio-temporal the tracking-by-detection paradigm. In this model for ap-
arXiv:2310.02532v1 [cs.CV] 4 Oct 2023

affinities for 3D MOT, we propose a novel camera-LiDAR proaching the 3D multi-object tracking problem, tracking
fusion approach for learning affinities. At its core, this work algorithms use off-the-shelf, state-of-the-art 3D detectors.
proposes a fusion technique that generates a rich sensory signal Though this is a practical choice due to the high computa-
incorporating information about depth and distant objects to
enhance affinity estimation for improved data association, track tional cost for training detectors, it is often difficult to com-
lifecycle management, false-positive elimination, false-negative pare the quality of trackers when they use different detectors
propagation, and track confidence score refinement. Our main to achieve their results. This is particularly relevant, since
contributions include a novel fusion approach for combining the primary tracking accuracy metric called Average Multi-
camera and LiDAR sensory signals to learn affinities, and Object Tracking Accuracy (AMOTA) is strongly biased
a first-of-its-kind multimodal sequential track confidence
refinement technique that fuses 2D and 3D detections. toward detection quality [3–5]. Even though CenterPoint [6]
Additionally, we perform an ablative analysis on each fusion detections tend to be the standard for LiDAR-only track-
step to demonstrate the added benefits of incorporating the ing [1, 6–9], there is no clear standard in prior multimodal
camera sensor, particular for small, distant objects that tracking works [5, 10–15]. Nonetheless, due to the standard-
tend to suffer from the depth-sensing limits and sparsity of ization of CenterPoint detections, we use them in this work.
LiDAR sensors. In sum, our technique achieves state-of-the-art
performance on the nuScenes benchmark amongst multimodal Besides detection quality, the keys to successful 3D multi-
3D MOT algorithms using CenterPoint detections. object tracking are accurate data association, track lifecycle
management, false-positive elimination, false-negative
I. I NTRODUCTION propagation, and track confidence score assignment. Data
3D multi-object tracking is an essential component of any association is the process of matching existing tracks to
autonomous mobile agent’s perception module, allowing incoming detections. When tracks and detections remain
the agent to perform safe, well-informed motion planning unmatched, there are a wide range of outcomes we must
and navigation by localizing surrounding objects in 3D investigate to correctly handle them. First, unmatched tracks
space and time. We address the problem of leveraging representing objects that are no longer in the scene must be
multimodal perception to accomplish 3D multi-object terminated, while unmatched detections representing new ob-
tracking for autonomous driving. In particular, we focus jects in the scene must be initialized as newborn tracks. This
on camera-LiDAR sensor fusion due to the prevalence of is referred to as track lifecycle management. Even though
both sensors on autonomous mobile agents, as well as these possibilities exist, unmatched tracks and detections
the complementary nature of these two sensor modalities. can also be a residual effect of poor detection quality. Some
Though LiDAR point clouds offer depth perception, they unmatched tracks are not meant to be terminated, but rather,
are limited by the LiDAR sensor’s depth-sensing range the detector has failed to capture a detection for the track due
and the sparse, unstructured nature of point clouds. On the to difficult conditions such as occlusion, so we must interpo-
other hand, cameras offer rich, structured signals and often late that missing detection with false-negative propagation.
capture distant objects, but they lack the depth information Moreover, unmatched detections can be a result of false-
that is necessary for high-functioning perception systems. positive detections that must be eliminated. Though tracking
Though many 3D multi-object tracking algorithms focus algorithms tend to focus on a subset of these problems, our
on LiDAR-only perception, numerous recent works in this prior work ShaSTA [1] is the first tracking algorithm that
field focus on camera-LiDAR sensor fusion due to the addresses all of these issues through an affinity-based frame-
increased advantages of using both sensors. In fact, the work. Thus, we aim to extend that approach in this work.
top 5 trackers for the nuScenes benchmark [2] are all Finally, track confidence scores indicate the quality of the
camera-LiDAR trackers. As a result, we aim to extend our track relative to other tracks in a given time step. ShaSTA [1]
previous LiDAR-only work ShaSTA [1] with ShaSTA-Fuse, is the first-ever tracking algorithm that proposes a sequential
track confidence refinement technique, rather than directly
*Toyota Research Institute provided funds to support this work.
1 Stanford University, {tsadja, bohg}@stanford.edu using detection confidence scores for tracks. This proves
2 Toyota Research Institute, rares.ambrus@tri.global to work exceptionally well, and for this reason, we extend
this technique to leverage camera and LiDAR information Since the tracking task is largely dependent on detection
through a multimodal track confidence refinement. quality, we opt for a two-stage detector in Cascade R-
Amongst multimodal camera-LiDAR trackers, there are CNN [20] for our algorithm. Additionally, we leverage
two general approaches for integrating camera information a Cascade R-CNN network that has been pre-trained on
into the 3D multi-object tracking algorithm. The first nuScenes image data to extract salient appearance features
is a late fusion technique that extracts features from from our camera data, freezing the entire network except
images produced by camera sensors and combines them for the Bounding Box Head which we fine-tune. Past 3D
with LiDAR features [5, 10, 14], while the second uses multi-object tracking works have not characterized objects
abstracted information, such as 2D bounding boxes in the localized by their projected 3D detection locations using
image plane, from 2D camera-based detections [12, 15, 16]. features extracted in the Bounding Box Head.
Though each direction has proved to be successful, this
work differs from past approaches by leveraging both types B. 3D Object Detection
of camera-based information. There is a vast array of approaches in the 3D object
Thus, we propose a multimodal 3D multi-object tracking detection domain, which faces increased complexity
algorithm called ShaSTA-Fuse, which fuses camera and compared to 2D object detection. For 3D object detection
LiDAR sensor signals to model shape and spatio-temporal we see detectors of various modalities, including LiDAR-
affinities. Our core contributions can be summarized in only, camera-LiDAR, and the lesser used camera-only.
three main points: (i) we propose a novel late fusion 1) LiDAR-Only 3D Detectors: In the LiDAR-only do-
approach for combining camera and LiDAR features in an main, older works [13, 22–25] hand-craft dense feature
affinity-based framework; (ii) we offer a simple yet effective representations from the sparse LiDAR point cloud. Using
multimodal sequential track confidence refinement technique these feature maps, the 3D detectors extend the image-based
that improves confidence score prediction by fusing 2D and R-CNN [17] to 3D with an RPN followed by regression
3D detections; (iii) we perform a thorough analysis on the heads to improve the initial 3D bounding box estimates.
most effective ways to fuse camera and LiDAR information. Building on the momentum of works that learn point-wise
In sum, ShaSTA-Fuse achieves state-of-the-art performance features from point clouds using deep learning [26, 27], Vox-
on the nuScenes tracking benchmark. elNet [28] was the first LiDAR-only 3D detector to unify fea-
ture extraction and bounding box prediction into an end-to-
II. R ELATED W ORK
end learned, one-stage detection framework. Following this
A. 2D Object Detection work, it became typical for 3D detectors to start with a deep
Image-based 2D object detection has rapidly advanced learning-based point cloud encoder [26, 29, 30] to extract an
in recent years with the rise of deep learning. The vast intermediary 3D feature grid or BEV feature map [5, 6, 31,
majority of state-of-the-art 2D detectors use deep learning 32]. Then, a decoder is applied to extract object attributes in-
networks to extract features from input images in order to cluding classification, localization, and detection confidence
classify and localize objects of interest in the scene. There score. In this work, we explore fusing information from
are two main branches of image-based object detectors: 2D detections produced by Cascade R-CNN [20] with 3D
two-stage and one-stage detectors. detections. We explore this fusion approach with the LiDAR-
The most well-known iterations of two-stage detectors are only detector CenterPoint [6]. Additionally, we leverage the
based on R-CNN [17], which was followed by improvements pre-trained VoxelNet [28] LiDAR backbone from [6] to
known as Fast R-CNN [18], Faster R-CNN [19], Cascade R- extract an intermediate BEV feature map which we process
CNN [20], and many more. In this paradigm, we begin with for late fusion with our image-based appearance features.
feature extraction followed by two-stage detection. In the 2) Camera-Based 3D Detectors: For improved results
first stage, the Region Proposal Network (RPN) proposes to compensate for the sparse signal from LiDAR sensors,
candidate object bounding boxes located in Regions the 3D detection community has developed increased
of Interest (RoI). Then, in the second stage, features are interest in camera-LiDAR fusion. Proposal-level fusion
extracted for each candidate bounding box using RoI Pooling methods [13, 33–37] are object-centric techniques that try
and these features are passed into one or more Bounding Box to fuse object proposals in either image or 3D space. The
Heads to further improve the initial bounding box prediction methods tend not to generalize to other tasks as well as point-
by regression on the object classification and bounding box. level fusion approaches [11, 38–40], which are both object-
Unlike two-stage detectors, one-stage detectors directly centric and geometric-centric. In this framework, background
predict object classifications and bounding boxes from input image features are painted onto foreground LiDAR points
images. The advent of this class of detectors came with the and LiDAR-based detection is completed on these augmented
YOLO [21] (You Only Look Once) detector, which was point clouds. In this work, we do not use such detectors,
followed by improved iterations much like R-CNN [17]. since there is no standard multimodal detector used in 3D
The choice between these two approaches can be encapsu- multi-object tracking. For example, the best camera-LiDAR
lated as a speed-accuracy tradeoff. While the two-stage de- 3D detector to date is BEVFusion [11] which also produces
tectors maximize object classification and localization accu- state-of-the-art results on the nuScenes leaderboard [2].
racy, the one-stage detectors achieve faster inference speed. Based on how past works have performed, it is expected
that the tracking results on the validation set are at least as work is the first to our knowledge that fuses camera and Li-
accurate as those on the test set, but the authors’ open-source DAR features in a neural network to enrich object-level geo-
code do not match their reported tracking results, with a metric representations, while fusing information from 2D and
significant drop in accuracy (AMOTA) of 3.7 points. The au- 3D detection bounding boxes for a novel multimodal track
thors have not explained this discrepancy, and for this reason, confidence refinement technique during track formation.
we avoid using these detections altogether, since there is no
fair way to assess our tracking results with these detections1 . III. SHASTA-FUSE
Finally, there is a line of work dedicated to camera-only In this section, we begin by defining and offering intuition
3D detectors in which 3D bounding boxes are estimated of the general affinity-based 3D multi-object tracking (MOT)
from 2D images. However, these detections yield poor framework developed in ShaSTA [1]. Then, we describe how
accuracy due to poor depth estimation, and we do not we have effectively extended this affinity-based framework
conduct experiments with camera-only 3D detections. for multimodal 3D MOT.

C. 3D Multi-Object Tracking A. Review of ShaSTA [1]: Affinity Estimation for LiDAR-


In the 3D multi-object tracking domain, there have been Only 3D MOT
significant efforts in LiDAR-only tracking [1, 6–9, 41, 42]. Given a set of unordered point sets generated by a
Much of these works use Kalman Filters, explicit motion LiDAR sensor from the current and previous time steps Lt
models, and various distance metrics, including Intersection and Lt−1 , and 3D detection bounding boxes estimated by
Over Union (IoU) and Mahalanobis distance, to complete LiDAR-only 3D detectors from the current and previous
data association. Additionally, much of the algorithms time steps Bt and Bt−1 , one can define a function fi for
use low-dimensional bounding boxes as abstracted object each object class i that learns an affinity matrix At :
representations, since it is challenging to extract expressive
features from sparse LiDAR point clouds. However, recent fi (Lt−1 , Lt , Bt−1 , Bt ) = At (1)
works in this field attempt to better leverage the LiDAR
For Nmax objects in a frame, the affinity matrix
data for improved results. SpOT [42] maintains an extended
At ∈ R(Nmax +2)×(Nmax +2) not only matches existing tracks
temporal history beyond the Markov assumption with LiDAR
and current frame detections to complete data association, but
point clouds accumulated over several key frames. This
also has augmented columns and rows to handle special case
proves to be especially beneficial for objects that experience
situations. The two additional columns are for dead-track
occlusion like pedestrians. Additionally, ShaSTA [1]
(DT) and false-negative (FN) anchors. This allows for exist-
leverages intermediate representations of LiDAR data from a
ing tracks to match with either of these anchors to terminate
detection network’s LiDAR backbone for richer information
the track or propagate it forward using its estimated velocity
about each object, while also modeling the global relation-
to compensate for an FN detection. Additionally, the two
ship of all the objects in a scene to improve data association.
additional rows are for false-positive (FP) and newborn (NB)
We build on this paradigm by leveraging intermediate
anchors, so extraneous detections can be eliminated as FPs
representations of both camera and LiDAR data from 2D and
or incoming detections can be initialized as NB tracks when
3D detection networks’ backbones, respectively, with a novel
they do not match with existing tracks. Instead of applying
late fusion approach to get richer object-level representations.
heuristics, we are matching to frame-specific anchors that
In the multimodal 3D multi-object tracking domain for have been learned by understanding the distribution of FPs,
camera-LiDAR fusion, we see two main paradigms that are FNs, DTs, and NBs over the entire training set. For example,
followed. On one hand, the tracking algorithms in this area the FP anchor’s bounding box center that ShaSTA learns
process camera images to extract appearance features that are for frame t is a centroid of where most FP detections are
fused with LiDAR features for better tracking [5, 10, 14, 43]. in that frame. Therefore, we expect the current frame’s FP
On the other hand, an under-explored, yet effective method detections to match with the FP anchor in the affinity matrix.
for developing such algorithms is by fusing information from
While there is a unique one-to-one correspondence
2D detection bounding boxes and 3D detection bounding
between bounding boxes in two consecutive frames, more
boxes to improve tracking accuracy [12, 16]. The latter line
than one current frame bounding box can match with the
of work often yields significantly higher accuracy gains for
NB and FP anchors, and more than one existing track
small object like bicycles without incurring the high compu-
can match with the DT and FN anchors. To handle this
tational expense of image feature extraction. In this work, we
situation, we define the forward and backward matching
do not treat these two approaches as mutually exclusive. This
affinity matrices. Here, forward matching indicates that we
1 Link to GitHub issue about discrepancies in BEVFusion tracking
want to find the best current frame match for each previous
results. We tried one of the remedies suggested by a GitHub user to frame track, while backward matching refers to finding the
filter out low-scoring detections (< 0.025), but this is still 1.8 points best previous frame match for each current frame detection.
lower in AMOTA than the reported test set results. In the CAMO-MOT Thus, we create a forward matching affinity matrix
paper [5], the authors suggest that the BEVFusion detections used to
generate the BEVFusion tracking results originate from “a much more At,f m ∈ RNmax ×(Nmax +2) by removing the two row aug-
powerful detector” than the ones provided on the authors’ GitHub page. mentations from At and applying a row-wise softmax so that
Fig. 1: Algorithm Flowchart. We first describe the two modules at the top of the figure before describing the main flowchart that utilizes these modules. In the Shape
Descriptor Extraction module, we show how we extract shape descriptors from the LiDAR point cloud using the pre-trained frozen LiDAR backbone from our off-the-shelf 3D
detector. This technique is also used in [1]. In the Appearance Cue Extraction module, we show how appearance cues are extracted for each projected 3D detection in an image.
Each blue bounding box represents a projected 3D CenterPoint [6] detection in the image. A pre-trained feature extraction backbone is used from our off-the-shelf 2D detector
to get the feature map, from which we do RoI pooling for each projected bounding box. Then we pass the RoI features into the Bounding Box Head, which we fine-tune to
get the intermediary feature representation before the final bounding box prediction. These intermediary features are used to represent the appearance of each detected object.
In the main flowchart, we use Nmax low-dimensional bounding boxes from the current and previous frames to learn bounding box representations for the FP, NB, FN, and
DT anchors. The FP and NB anchors are appended to previous frame bounding boxes to form the augmented bounding boxes B̂t−1 , while the FN and DT anchors are added
to the current frame detections to form the augmented bounding boxes B̂t . Similarly, we extract shape descriptors and appearance cues that are then used to learn shape and
appearance representations, respectively, for our anchors. Using the augmented bounding boxes, shape descriptors, and appearance cues, we find a residual that captures the
spatio-temporal and shape similarities between current and previous frame detections, as well as between detections and anchors. This residual is used to predict the affinity
matrix for probabilistic data associations, track lifecycle management, FP elimination, FN propagation, and track confidence refinement.

more than one previous frame track can have sufficient prob- learning with the log affinity loss defined in [1]. For each time
ability to match with the DT and FN anchors, respectively: step t, we define the ground-truth affinity matrices Agt,f m
and Agt,bm for forward matching and backward matching,
At,f m = softmaxrow (At,1 ) . (2) respectively. Thus, our loss function L is defined as follows:
Using similar logic, we create a backward matching
P P
i j (Agt,f m ⊙ − log(Af m ))
affinity matrix At,bm ∈ R(Nmax +2)×Nmax by removing Lf m = P P (4)
i j Agt,f m
the two column augmentations from A and applying a P P
column-wise softmax so that more than one current frame i j (Agt,bm ⊙ − log(Abm ))
Lbm = P P (5)
detection can have sufficient probability to match with the i j Agt,bm
NB and FP anchors, respectively: 1 
L= Lf m + Lbm . (6)
2
At,bm = softmaxcol (At,2 ) . (3)
Finally, the information encoded in the affinity matrix
In order to learn the affinity matrix, we use supervised prediction is used for sequential track confidence score
refinement. cues Ct−1 :
cf n = σfc n (Ct−1 ) (9)
B. Camera-LiDAR Sensor Fusion to Estimate Affinities for c
cdt = σdt (Ct−1 ). (10)
Multimodal 3D MOT
We then concatenate cf p ∈ RFC and cnb ∈ RFC to Ct−1
We create ShaSTA-Fuse, the camera-LiDAR variation of to get Ĉt−1 ∈ R(Nmax +2)×FC , as well as cf n ∈ RFC and
ShaSTA, by integrating appearance cues into our affinity cdt ∈ RFC to Ct to get Ĉt ∈ R(Nmax +2)×FC .
matrix estimation and using camera-based 2D detection We use Ĉt and Ĉt−1 to obtain the learned appearance
information to create a multimodal track confidence cue residual Rc between the two frames. The residual
refinement technique. These two additions constitute measures the similarities between current and previous
ShaSTA-Fuse, and we describe them below. A visualization frames’ appearance cue representations. We expand
of ShaSTA-Fuse can be found in Figure 1. Ĉt−1 and Ĉt and concatenate them to get a matrix
Appearance Cue Integration. We integrate camera sensor Ĉ ∈ R(Nmax +2)×(Nmax +2)×2FC . Then, we obtain our
information into the ShaSTA framework by extracting residual Rc ∈ R(Nmax +2)×(Nmax +2) with an MLP:
appearance cues from images.
The autonomous vehicles in the nuScenes benchmark Rc = σrc (Ĉ). (11)
are equipped with 6 camera sensors and 1 LiDAR sensor. In addition to the appearance cue residual Rc , we use
Therefore, for each time step, we have 1 LiDAR point cloud the VoxelNet, bounding box, and shape residuals Rv , Rb ,
and 6 images. Since the 3D detections are obtained with and Rs that encode shape and spatio-temporal information
the LiDAR sensor in the world coordinate system, we map from the LiDAR sensor. These residuals are formed in an
each 3D detection to all 6 image planes using each of the analogous way to the appearance cue residual Rc . We refer
6 camera transforms to see which camera the 3D detection readers to [1] for further details, as well as Figure 1 which
corresponds to. If a projected 3D detection appears in more shows how we incorporate all of the residuals.
than one image plane, we choose the image in which it has The overall residual R is a weighted sum of the VoxelNet,
the largest area with its projected bounding box defined by bounding box, and shape residuals we obtained in ShaSTA,
bproj := (ximg , yimg , w, h). For any time step t, we define known as Rv , Rb , and Rs , and the new appearance cue
the set of projected 3D detections as Bproj,t ∈ RNmax ×4 , residual we obtain Rc . We concatenate B̂, Ŝ, and Ĉ to
where we have up to Nmax objects per frame. If there are create our input Ŵ ∈ R(Nmax +2)×(Nmax +2)×(2FS +2FC +6) .
fewer than Nmax objects in a frame, then we zero-pad the Then, we pass this through an MLP σα to get the learned
remaining entries, and if there are more than Nmax objects weights α ∈ R(Nmax +2)×(Nmax +2)×4 as follows:
then we sample the top Nmax detections.
For each time step, we pass the 6 images into a pre-trained α = σα (Ŵ ). (12)
Cascade R-CNN detector [20]. After the feature map is We split α into matrices αv , αb , αs , αc ∈
created for each image, we forgo the RPN, and treat the pro- R(Nmax +2)×(Nmax +2) , and obtain our overall residual
jected 3D detections Bproj,t as our RoIs. Thus, we perform R ∈ R(Nmax +2)×(Nmax +2) :
RoI pooling with the projected 3D detections Bproj,t , and
pass the extracted features corresponding to each RoI into our R = αv ⊙ Rv + αb ⊙ Rb + αs ⊙ Rs + αc ⊙ Rc . (13)
Bounding Box Head. Since the Bounding Box Head outputs
Note that ⊙ is the Hadamard product.
bounding box predictions, we only use its initial layers that
Given our overall residual, we have information about the
process the RoIs to get our appearance cues, which are
pairwise similarities between current frame detections and
FC -dimensional. We do not freeze the Bounding Box Head
previous frame tracks, as well as each of the four augmented
layers, and allow them to be fine-tuned, since the RoIs we
anchors. In order to leverage these spatio-temporal and
pass in are projections from the 3D detector and not Cascade
shape relationships to learn probabilities for matching the
R-CNN detections. For each pair of frames at the current and
detections and tracks to each other or the anchors, we apply
previous time steps t and t − 1, we obtain appearance cues
an MLP σaf f to get our overall affinity matrix At for time
Ct ∈ RNmax ×FC and Ct−1 ∈ RNmax ×FC , respectively.
step t:
Then, we create a learned appearance cue representation
at time step t for FP and NB anchors using the current frame At = σaf f (R). (14)
appearance cues Ct as follows, where each σ represents an
Multimodal Track Confidence Refinement. We build
MLP:
off of the success of the LiDAR-only sequential track
confidence refinement from [1] by integrating both camera
cf p = σfc p (Ct ) (7) and LiDAR information at this stage.
c
cnb = σnb (Ct ). (8) For the sake of completeness, we remind readers that
track confidence is a relative measure of track quality with
Similarly, we find learned appearance cue representations respect to objects of the same class in a given time step.
for FN and DT anchors using previous frame appearance Just as was done in [1], we treat track confidence score
refinement as a sequential process that reflects changing Algorithm 1 Multimodal Track Confidence Refinement
environmental factors. For example, if a track has been Inewborn , Imatched , τiou , β1 , β2
maintained for an extended period of time but the object for i ∈ Imatched do
suddenly experiences occlusion, we do not let the confidence β2,i ← β2
abruptly drop; rather, we take into account that history and (t)
if IoUi > τiou  then  
use it to inform our confidence scores. (t) (t) (t) (t)
cdet,i ← fˆ max c3D−det,i , c2D−det,i , IoUi , τiou
In order to extend this technique to camera-LiDAR sensor  
(t)
fusion, we not only leverage infromation encoded in our β2,i ← fˆ β2 , IoUi , τiou
 
learned affinity matrix just as in [1], but we also use two (t)
PF P,i ← 1 − fˆ PT P,i , IoUi , τiou
detectors. For each time step t, we use the 2D detections
end if
from Cascade R-CNN [20] B2D,t and the projected 3D (t) (t) (t) (t−1)
ctrk,i ← 1[PF P,i < β1 ]β2,i cdet,i + (1 − β2,i )ctrk,i
detections Bproj,t that we created in the previous section.
end for
Additionally, for each object class, we choose an Intersection
for i ∈ Inewborn do
over Union (IoU) threshold τiou . For each camera frame,
β2,i ← β2
we get the pairwise IoUs between B2D,t and Bproj,t . We (t)
greedily match 2D detections and projected 3D detections if IoUi > τiou  then  
(t) (t) (t) (t)
that achieve an IoU over the threshold τiou . Thus, each cdet,i ← fˆ max c3D−det,i , c2D−det,i , IoUi , τiou
 
detection from B2D,t is matched with at most one detection (t)
β2,i ← fˆ β2 , IoUi , τiou
from Bproj,t , and visa versa. 
(t)

(t)
For each 3D detection bi at time step t that matches PF P,i ← 1 − fˆ PT P,i , IoUi , τiou
with an existing track, we add the index i to the set Imatched end if
(t) (t) (t)
for matched detections. Similarly, for each 3D detection ctrk,i ← 1[PF P,i < β1 ]β2,i cdet,i
(t)
bi that is designated as a newborn, we add the index i end for
to the set Inewborn for detections that should be initialized
(t) classes that we evaluate: bicycles, buses, cars, motorcycles,
as newborn tracks. For each 3D detection bi , we have
pedestrians, trailers, and trucks.
the probability that it matched to the false-positive anchor
PF P,i which is learned in the affinity matrix Abm . Note that In the nuScenes benchmark, the key frames from the
the probability that a 3D detection is true-positive is defined camera are generated with a 2Hz frame-rate. We only use the
as PT P,i = 1 − PF P,i . key frames for our image data, as the input detections from
We then perform our multimodal track confidence CenterPoint [6] and Cascade R-CNN [20] are provided for
refinement as shown in Algorithm 1. Note that we fix β1 the key frames only. Since the LiDAR sensor uses a sampling
to be a value slightly less than τF P to account for very rate of 20Hz, we accumulate 10 LiDAR sweeps to generate
uncertain detections that have been matched to tracks, the point cloud in order to match the camera sampling rate.
and we effectively reduce the overall track confidence for This also makes our LiDAR point clouds become a denser
tracks matching to such detections. However, if the matched 4D point cloud with an added temporal dimension.
detection has a very high probability of being true-positive, Moreover, we form our ground-truth affinity matrices for
(t) training using the procedure described in [1].
then the track confidence ctrk,i becomes a weighted average
(t) (t−1)
between cdet,i and ctrk,i . We set β1 = 0.5 for all object
classes, and β2 = 0.5 for all object classes except for
bicycles and cars (β2 = 0.4), bus (β2 = 0.7), and trailer V. E XPERIMENTAL R ESULTS
(β2 = 0.4). Additionally, we define the function fˆ as follows:
! This section provides an overview of evaluation metrics,
ˆ IoUi training details, comparisons against state-of-the-art 3D
f (x, IoUi , τiou ) = min · x, 1 . (15)
τiou multi-object tracking techniques, and an ablation study to
This function is meant to reward 3D detections that match analyze our technique.
with a 2D detection based on the IoU thresholding by
increasing their confidence scores, increasing the weight
they receive in the weighted average, and increasing their A. Evaluation Metrics
probability of being true-positive to avoid reducing the
overall confidence of the tracks they match to. The primary metrics for evaluating nuScenes are Average
Multi-Object Tracking Accuracy (AMOTA) and Average
IV. DATA P REPARATION Multi-Object Tracking Precision (AMOTP) [3].
Our tracking performance is evaluated using the nuScenes AMOTA averages the recall-weighted MOTA, known as
MOTAR, at n evenly spaced recall thresholds. Note that
benchmark [2]. This dataset is comprised of 700 training MOTA is a metric that penalizes FPs, FNs, and ID switches
scenes, 150 validation scenes, and 150 test scenes. Each (IDS). GT indicates the number of ground-truth tracklets. In
scene is 20 seconds long. Additionally, there are 7 object essence, AMOTA measures the tracker’s ability to correctly
TABLE I: Comparison with Multimodal Trackers. We compare our results with other multimodal detectors that use CenterPoint detections. Evaluation is on nuScenes
validation set in terms of overall AMOTA and AMOTP for all object classes. The green entry is our method.
Tracking Method Feature Modalities Detection Modalities Detector(s) AMOTA ↑ AMOTP ↓
MultimodalTracking 2D+3D 3D CenterPoint 68.7 -
InterTrack 2D+3D 3D CenterPoint 72.1 0.566
EagerMOT N/A 2D+3D CenterPoint + Cascade R-CNN 71.2 0.569
ShaSTA-Fuse (Ours) 2D+3D 2D+3D CenterPoint + Cascade R-CNN 75.6 0.548

TABLE II: Ablation on Camera-LiDAR Fusion. We assess the effect of different camera-LiDAR fusion techniques on tracking accuracy. First, we begin with LiDAR-only shape
and spatio-temporal information without any form of confidence refinement. In the second row, we fuse appearance features extracted from the camera’s image data with the
LiDAR information from the previous row without any confidence refinement. Then, in the third row, we have the fused camera-LiDAR features but only use information from
the LiDAR-based 3D detections for the confidence refinement. Finally, in the last row we integrate 2D and 3D detection information for the full multimodal confidence refinement
with multimodal feature extraction. Evaluation is on the nuScenes validation set in terms of overall AMOTA for a subset of object classes. The green entry is our full method.
Method Detector(s) Bicycle Car Motorcycle Pedestrian
LiDAR-only w/o Confidence Refinement CenterPoint 49.4 85.3 70.1 80.9
+ Appearance Residual CenterPoint 49.8 85.3 71.7 80.9
+ LiDAR-only Confidence Refinement CenterPoint 57.7 85.7 73.6 81.4
+ Multimodal Confidence Refinement CenterPoint + Cascade R-CNN 72.3 85.6 78.4 82.8

track objects of interest. use 0.4 for small objects (bicycle, motorcycle, pedestrian),
1 X 0.6 for cars and trucks, and 0.9 for buses and trailers.
AM OT A = M OT AR (16)
n−1 1 , 1 ,...,1}
r∈{ n−1 n−2
! C. Comparison with State-of-the-Art Multimodal Tracking
IDSr + F Pr + F Nr − (1 − r) · GT
M OT AR = max 0, 1 − Our nuScenes validation results can be seen in Table I.
r · GT
(17)
Evidently, our technique gives significant boosts for tracking
accuracy, while retaining the best precision. Thus, ShaSTA-
AMOTP averages MOTP over n evenly spaced recall Fuse is effective in identifying correct tracks, while ensuring
thresholds. For the definition of AMOTP below, di,t indicates precise track localization for increased safety.
the distance error for track i at time step t and T Pt indicates
the number of true-positive (TP) matches at time step t.
D. Ablation Studies
Since MOTP measures the misalignment between ground-
truth and predicted bounding boxes, AMOTP is effectively In this section, we conduct ablative analysis on our
measuring the quality or precision of our estimated tracks tracking algorithm to assess the benefits of camera-LiDAR
in terms of their distance from the ground-truth tracks. sensor fusion. In Table II, we examine the effects of adding
P different types of camera information. It is evident that
1 di,t
Pi,t
X
AM OT P = (18) fusing LiDAR and camera information through combining
n−1 1 1 t T Pt 2D and 3D detectors yields far more gains in accuracy
r∈{ n−1 , n−2 ,...,1}
than extracting appearance cues from the camera images
In the nuScenes benchmark [2], estimated tracks within an in a neural network. This is consistent with results we
L2 distance of 2m from ground-truth tracks are defined as have seen in past literature on this topic, where fusing
TP. This margin of error is impractical for deployment in LiDAR and camera features in a neural network can yield
real-world driving scenarios. Thus, we emphasize AMOTP as little as less than 0.1% improvement in accuracy for each
as well due to its safety implications even though the object [5, 10], while bounding box fusion yields significant
nuScenes leaderboard ranking is purely based on AMOTA. gains particularly for objects like bicycles [12, 16].

B. Training Specifications
VI. C ONCLUSION
We train a different network for each object category
following [1, 10, 42]. The value of Nmax depends on the In summary, ShaSTA-Fuse extends the state-of-the-
object type since some classes are more common than art affinity-based 3D MOT framework to accommodate
others; this values ranges from 20 to 90 depending on multimodal camera-LiDAR tracking. This work explores
the object class. We use the pre-trained VoxelNet LiDAR pertinent questions about the benefits of camera-LiDAR
backbone from [6] and the pre-trained Cascade R-CNN sensor fusion, particularly the tradeoff between accuracy and
detection [20]. The LiDAR backbone is frozen, and the computational efficiency. Based on our experimental results
Cascade R-CNN is frozen except for the Bounding Box and the outcomes of past works in this field, we have found
Head that we fine-tune for this task. Additionally, due to the that the inexpensive fusion of abstracted camera-based 2D
class imbalance between FPs and TPs, we downsample the detections and LiDAR-based 3D detections is far more
number of FP detections during training. Finally, we fix the computationally efficient, while yielding significantly higher
thresholds to τf p = 0.7 for FP elimination, τf n = 0.5 for FN gains in accuracy. Thus, we encourage the community
propagation, τnb = 0.5 for NB initialization, and τdt = 0.5 to focus more on developing such techniques in this
for DT termination across all object classes. For τiou , we under-explored aspect of multimodal fusion.
R EFERENCES [23] S. Song and J. Xiao, “Sliding shapes for 3d object detection in
depth images,” in Computer Vision–ECCV 2014: 13th European
[1] T. Sadjadpour, J. Li, R. Ambrus, and J. Bohg, “Shasta: Modeling Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings,
shape and spatio-temporal affinities for 3d multi-object tracking,” Part VI 13. Springer, 2014, pp. 634–651.
arXiv preprint arXiv:2211.03919, 2022. [24] ——, “Deep sliding shapes for amodal 3d object detection in rgb-d
[2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, images,” in Proceedings of the IEEE conference on computer vision
A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A and pattern recognition, 2016, pp. 808–816.
multimodal dataset for autonomous driving,” in Proceedings of the [25] B. Li, “3d fully convolutional network for vehicle detection in point
IEEE/CVF conference on computer vision and pattern recognition, cloud,” in 2017 IEEE/RSJ International Conference on Intelligent
2020, pp. 11 621–11 631. Robots and Systems (IROS). IEEE, 2017, pp. 1513–1518.
[3] X. Weng and K. Kitani, “A baseline for 3d multi-object tracking,” [26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning
arXiv preprint arXiv:1907.03961, vol. 1, no. 2, p. 6, 2019. on point sets for 3d classification and segmentation,” in Proceedings
[4] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, of the IEEE conference on computer vision and pattern recognition,
and B. Leibe, “Hota: A higher order metric for evaluating multi-object 2017, pp. 652–660.
tracking,” International journal of computer vision, vol. 129, pp. [27] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep
548–578, 2021. hierarchical feature learning on point sets in a metric space,”
[5] L. Wang, X. Zhang, W. Qin, X. Li, L. Yang, Z. Li, L. Zhu, H. Wang, Advances in neural information processing systems, vol. 30, 2017.
J. Li, and H. Liu, “Camo-mot: Combined appearance-motion [28] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud
optimization for 3d multi-object tracking with camera-lidar fusion,” based 3d object detection,” in Proceedings of the IEEE conference
arXiv preprint arXiv:2209.02540, 2022. on computer vision and pattern recognition, 2018, pp. 4490–4499.
[6] T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object [29] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner,
detection and tracking,” in Proceedings of the IEEE/CVF conference “Vote3deep: Fast object detection in 3d point clouds using efficient
on computer vision and pattern recognition, 2021, pp. 11 784–11 793. convolutional neural networks,” in 2017 IEEE International
[7] H.-k. Chiu, A. Prioletti, J. Li, and J. Bohg, “Probabilistic 3d Conference on Robotics and Automation (ICRA). IEEE, 2017, pp.
multi-object tracking for autonomous driving,” arXiv preprint 1355–1361.
arXiv:2001.05673, 2020. [30] B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object
[8] M. Liang and F. Meyer, “Neural enhanced belief propagation for detection from point clouds,” in Proceedings of the IEEE conference
data association in multiobject tracking,” in 2022 25th International on Computer Vision and Pattern Recognition, 2018, pp. 7652–7660.
Conference on Information Fusion (FUSION). IEEE, 2022, pp. 1–7. [31] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
[9] J.-N. Zaech, A. Liniger, D. Dai, M. Danelljan, and L. Van Gool, “Pointpillars: Fast encoders for object detection from point clouds,”
“Learnable online graph representations for 3d multi-object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and
IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5103–5110, pattern recognition, 2019, pp. 12 697–12 705.
2022. [32] Y. Wang and J. M. Solomon, “Object dgcnn: 3d object detection
[10] H.-k. Chiu, J. Li, R. Ambruş, and J. Bohg, “Probabilistic 3d using dynamic graphs,” Advances in Neural Information Processing
multi-modal, multi-object tracking for autonomous driving,” in 2021 Systems, vol. 34, pp. 20 745–20 758, 2021.
IEEE International Conference on Robotics and Automation (ICRA). [33] X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai,
IEEE, 2021, pp. 14 227–14 233. “Transfusion: Robust lidar-camera fusion for 3d object detection
[11] Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, with transformers,” in Proceedings of the IEEE/CVF Conference on
“Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye Computer Vision and Pattern Recognition, 2022, pp. 1090–1099.
view representation,” arXiv preprint arXiv:2205.13542, 2022. [34] X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “Futr3d: A
[12] A. Kim, A. Ošep, and L. Leal-Taixé, “Eagermot: 3d multi-object unified sensor fusion framework for 3d detection,” arXiv preprint
tracking via sensor fusion,” in 2021 IEEE International Conference on arXiv:2203.10642, 2022.
Robotics and Automation (ICRA). IEEE, 2021, pp. 11 315–11 321. [35] R. Nabati and H. C. Qi, “Center-based radar and camera fusion for
[13] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object 3d object detection. arxiv 2020,” arXiv preprint arXiv:2011.04841.
detection network for autonomous driving,” in Proceedings of the [36] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets
IEEE conference on Computer Vision and Pattern Recognition, 2017, for 3d object detection from rgb-d data,” in Proceedings of the IEEE
pp. 1907–1915. conference on computer vision and pattern recognition, 2018, pp.
[14] J. Willes, C. Reading, and S. L. Waslander, “Intertrack: 918–927.
Interaction transformer for 3d multi-object tracking,” arXiv preprint [37] Z. Wang and K. Jia, “Frustum convnet: Sliding frustums to aggregate
arXiv:2208.08041, 2022. local point-wise features for amodal 3d object detection,” in 2019
[15] A. Osep, W. Mehner, M. Mathias, and B. Leibe, “Combined IEEE/RSJ International Conference on Intelligent Robots and Systems
image-and world-space tracking in traffic scenes,” in 2017 IEEE (IROS). IEEE, 2019, pp. 1742–1749.
International Conference on Robotics and Automation (ICRA). IEEE, [38] S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting:
2017, pp. 1988–1995. Sequential fusion for 3d object detection,” in Proceedings of the
[16] X. Wang, C. Fu, Z. Li, Y. Lai, and J. He, “Deepfusionmot: A 3d IEEE/CVF conference on computer vision and pattern recognition,
multi-object tracking framework based on camera-lidar fusion with 2020, pp. 4604–4612.
deep association,” arXiv preprint arXiv:2202.12100, 2022. [39] C. Wang, C. Ma, M. Zhu, and X. Yang, “Pointaugmenting: Cross-
[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature modal augmentation for 3d object detection,” in Proceedings of the
hierarchies for accurate object detection and semantic segmentation,” IEEE/CVF Conference on Computer Vision and Pattern Recognition,
in Proceedings of the IEEE conference on computer vision and 2021, pp. 11 794–11 803.
pattern recognition, 2014, pp. 580–587. [40] T. Yin, X. Zhou, and P. Krähenbühl, “Multimodal virtual point 3d
[18] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international detection,” Advances in Neural Information Processing Systems,
conference on computer vision, 2015, pp. 1440–1448. vol. 34, pp. 16 494–16 507, 2021.
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards [41] X. Weng, J. Wang, D. Held, and K. Kitani, “3D Multi-Object
real-time object detection with region proposal networks,” Advances Tracking: A Baseline and New Evaluation Metrics,” IROS, 2020.
in neural information processing systems, vol. 28, 2015. [42] C. Stearns, D. Rempe, J. Li, R. Ambrus, S. Zakharov, V. Guizilini,
[20] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality Y. Yang, and L. J. Guibas, “Spot: Spatiotemporal modeling for 3d
object detection,” in Proceedings of the IEEE conference on computer object tracking,” arXiv preprint arXiv:2207.05856, 2022.
vision and pattern recognition, 2018, pp. 6154–6162. [43] X. Weng, Y. Wang, Y. Man, and K. Kitani, “Gnn3dmot: Graph neural
[21] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look network for 3d multi-object tracking with multi-feature learning,” in
once: Unified, real-time object detection,” in Proceedings of the IEEE CVPR, 2020.
conference on computer vision and pattern recognition, 2016, pp.
779–788.
[22] D. Z. Wang and I. Posner, “Voting for voting in online point cloud
object detection.” in Robotics: science and systems, vol. 1, no. 3.
Rome, Italy, 2015, pp. 10–15.

You might also like