Professional Documents
Culture Documents
Fast and Furious
Fast and Furious
Fast and Furious
Abstract
1. Introduction
when dealing with occluded or far away objects. False pos-
Modern approaches to self-driving divide the problem itives can also be reduced by accumulating evidence over
into four steps: detection, object tracking, motion forecast- time. Furthermore, our approach is very efficient as it shares
ing and motion planning. A cascade approach is typically computations between all these tasks. This is extremely im-
used where the output of the detector is used as input to portant for autonomous driving where latency can be fatal.
the tracker, and its output is fed to a motion forecasting We take advantage of 3D sensors and design a network
algorithm that estimates where traffic participants are go- that operates on a bird-eye-view (BEV) of the 3D world.
ing to move in the next few seconds. This is in turn fed This representation respects the 3D nature of the sensor
to the motion planner that estimates the final trajectory of data, making the learning process easier as the network can
the ego-car. These modules are usually learned indepen- exploit priors about the typical sizes of objects. Our ap-
dently, and uncertainty is usually rarely propagated. This proach is a one-stage detector that takes a 4D tensor created
can result in catastrophic failures as downstream processes from multiple consecutive temporal frames as input and per-
cannot recover from errors that appear at the beginning of forms 3D convolutions over space and time to extract ac-
the pipeline. curate 3D bounding boxes. Our model produces bounding
In contrast, in this paper we propose an end-to-end boxes not only at the current frame but also multiple times-
fully convolutional approach that performs simultaneous tamps into the future. We decode the tracklets from these
3D detection, tracking and motion forecasting by exploiting predictions by a simple pooling operation that combine the
spatio-temporal information captured by a 3D sensor. We evidence from past and current predictions.
argue that this is important as tracking and prediction can We demonstrate the effectiveness of our model on a very
help object detection. For example, leveraging tracking and large scale dataset captured from multiple vehicles driving
prediction information can reduce detection false negatives in North-america and show that our approach significantly
1
outperforms the state-of-the-art. Furthermore, all tasks take tracking with correlation [18] or regression [29, 9]. Wang
as little as 30ms. and Yeung [30] used an autoencoder to learn a good fea-
ture representation that helps tracking. Tao et al. [27] used
2. Related Work siamese matching networks to perform tracking. Nam and
Han [21] finetuned a CNN at inference time to track object
2D Object Detection: Over the last few years many within the same video.
methods that exploit convolutional neural networks to pro-
duce accurate 2D object detections, typically from a sin-
gle image, have been developed. These approaches typi- Motion Forecasting: This is the problem of predicting
cally fell into two categories depending on whether they ex- where each object will be in the future given multiple past
ploit a first step dedicated to create object proposals. Mod- frames. Lee et al. [14] proposed to use recurrent networks
ern two-stage detectors [24, 8, 4, 11], utilize region pro- for long term prediction. Alahi et al. [1] used LSTMs to
posal networks (RPN) to learn the region of interest (RoI) model the interaction between pedestrian and perform pre-
where potential objects are located. In a second stage the fi- diction accordingly. Ma et al. [19] proposed to utilize con-
nal bounding box locations are predicted from features that cepts from game theory to model the interaction between
are average-pooled over the proposal RoI. Mask-RCNN [8] pedestrian while predicting future trajectories. Some work
also took this approach, but used RoI aligned features ad- has also focussed on short term prediction of dynamic ob-
dressing theboundary and quantization effect of RoI pool- jects [7, 22]. [28] performed prediction for dense pixel-
ing. Furthermore, they added an additional segmentation wise short-term trajectories using variational autoencoders.
branch to take advantage of dense pixel-wise supervision, [26, 20] focused on predicting the next future frames given a
achieving state-of-the-art results on both 2D image detec- video, without explicitly reasoning about per pixel motion.
tion and instance segmentation. On the other hand one-
stage detectors skip the proposal generation step, and in- Multi-task Approaches: Feichtenhofer et al. [5] pro-
stead learn a network that directly produces object bound- posed to do detection and tracking jointly from video. They
ing boxes. Notable examples are YOLO [23], SSD [17] and model the displacement of corresponding objects between
RetinaNet [16]. One-stage detectors are computationally two input images during training and decode them into ob-
very appealing and are typically real-time, especially with ject tubes during inference time.
the help of recently proposed architectures, e.g. MobineNet Different from all the above work, in this paper we pro-
[10], SqueezeNet [31]. One-stage detectors were outper- pose a single network that takes advantage of temporal in-
formed significantly by two stage-approaches until Lin et formation and tackles the problem of 3D detection, short
al. [16] shown state-of-the-art results by exploiting a focal term motion forecasting and tracking in the scenario of au-
loss and dense predictions. tonomous driving. While temporal information provides us
with important cues for motion forecasting, holistic reason-
ing allows us to better propagate the uncertainty through-
3D Object Detection: In robotics applications such as
out the network, improving our estimates. Importantly, our
autonomous driving we are interested in detecting objects
model is super efficient and runs real-time at 33 FPS.
in 3D space. The ideas behind modern 2D image detec-
tors can be transferred to 3D object detection. Chen et al.
[2] used stereo images to perform 3D detection. Li [15] 3. Joint 3D Detection, Tracking and Motion
used 3D point cloud data and proposed to use 3D convolu- Forecasting
tions on a voxelized representation of point clouds. Chen
In this work, we focus on detecting objects by exploit-
et al. [3] combined image and 3D point clouds with a fu-
ing a sensor which produces 3D point clouds. Towards this
sion network. They exploited 2D convolutions in BEV,
goal, we develop a one-stage detector which takes as input
however, they used hand-crafted height features as input.
multiple frames and produces detections, tracking and short
They achieved promising results on KITTI [6] but only ran
term motion forecasting of the objects’ trajectories into the
at 360ms per frame due to heavy feature computation on
future. Our input representation is a 4D tensor encoding an
both 3D point clouds and images. This is very slow, partic-
occupancy grid of the 3D space over several time frames.
ularly if we are interested in extending these techniques to
We exploit 3D convolutions over space and time to produce
handle temporal data.
fast and accurate predictions. As point cloud data is inher-
ently sparse in 3D space, our approach saves lots of compu-
Object Tracking: Over the past few decades many ap- tation as compared to doing 4D convolutions over 3D space
proaches have been develop for object tracking. In this sec- and time. We name our approach Fast and Furious (FaF), as
tion we briefly review the use of deep learning in tracking. it is able to create very accurate estimates in as little as 30
Pretrained CNNs were used to extract features and perform ms.
eration will waste most computation since the grid is very
sparse, i.e., most of the voxels are not occupied. Instead, we
performed 2D convolutions and treat the height dimension
as the channel dimension. This allows the network to learn
to extract information in the height dimension. This con-
trast approaches such as MV3D [3], which perform quan-
tization on the x-y plane and generate a representation of
the z-dimension by computing hand-crafted height statis-
tics. Note that if our grid’s resolution is high, our approach
is equivalent to applying convolution on every single point
without loosing any information. We refer the reader to
Fig. 2 for an illustration of how we construct the 3D ten-
Figure 2: Voxel Representation: using height directly as sor from 3D point cloud data.
input feature.
Adding Temporal Information: In order to perform mo-
tion forecasting, it is crucial to consider temporal informa-
tion. Towards this goal, we take all the 3D points from the
past n frames and perform a change of coordinates to rep-
resent then in the current vehicle coordinate system. This
is important in order to undo the ego-motion of the vehicle
where the sensor is mounted. After performing this trans-
formation, we compute the voxel representation for each
frame. Now that each frame is represented as a 3D ten-
sor, we can append multiple frames’ along a new temporal
dimension to create a 4D tensor. This not only provides
more 3D points as a whole, but also gives cues about ve-
hicle’s heading and velocity enabling us to do motion fore-
Figure 3: Overlay temporal & motion forecasting data. casting. As shown in Fig. 3, where for visualization pur-
Green: bbox w/ 3D point. Grey: bbox w/o 3D point poses we overlay multiple frames, static objects are well
aligned while dynamic objects have ‘shadows’ which rep-
resents their motion.
In the following, we first describe our data parameter-
ization in Sec. 3.1 including voxelization and how we in- 3.2. Model Formulation
corporate temporal information. In Sec. 3.2, we present Our single-stage detector takes a 4D input tensor and re-
our model’s architecture, follow by the objective we use for gresses directly to object bounding boxes at different times-
training the network (Sec. 3.3). tamps without using region proposals. We investigate two
different ways to exploit the temporal dimension on our 4D
3.1. Data Parameterization
tensor: early fusion and late fusion. They represent a trade-
In this section, we first describe our single frame repre- off between accuracy and efficiency, and they differ on at
sentation of the world. We then extend our representation which level the temporal dimension is aggregated.
to exploit multiple frames.
Early Fusion: Our first approach aggregates temporal in-
Voxel Representation: In contrast to image detection formation at the very first layer. As a consequence it runs as
where the input is a dense RGB image, point cloud data is fast as using the single frame detector. However, it might
inherently sparse and provides geometric information about lack the ability to capture complex temporal features as
the 3D scene. In order to get a representation where con- this is equivalent to producing a single point cloud from
volutions can be easily applied, we quantize the 3D world all frames, but weighting the contribution of the different
to form a 3D voxel grid. We then assign a binary indica- timestamps differently. In particular, as shown in Fig. 4,
tor for each voxel encoding whether the voxel is occupied. given a 4D input tensor, we first use a 1D convolution with
We say a voxel is occupied if there exists at least one point kernel size n on temporal dimension to reduce the temporal
in the voxel’s 3D space. As the grid is a regular lattice, dimension from n to 1. We share the weights among all fea-
convolutions can be directly used. We do not utilize 3D ture maps, i.e., also known as group convolution. We then
convolutions on our single frame representation as this op- perform convolution and max-pooling following VGG16
(a) Early fusion (b) Later fusion
Table 1: Detection performance on 144 × 80 meters region, with object having ≥ 3 number 3D points
Table 2: Ablation study, on 144 × 80 region with vehicles having ≥3 number 3D points
[3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-
3d object detection network for autonomous driving. arXiv CNN. arXiv preprint arXiv:1703.06870, 2017. 2
preprint arXiv:1611.07759, 2016. 2, 3
[9] D. Held, S. Thrun, and S. Savarese. Learning to track at 100
[4] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection fps with deep regression networks. In European Conference
via region-based fully convolutional networks. In Advances on Computer Vision, pages 749–765. Springer, 2016. 2
in neural information processing systems, pages 379–387,
2016. 2 [10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-
[5] C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track
cient convolutional neural networks for mobile vision appli-
and track to detect. arXiv preprint arXiv:1710.03958, 2017.
cations. arXiv preprint arXiv:1704.04861, 2017. 2, 5, 6
2
[6] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- [11] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara,
tonomous driving? the kitti vision benchmark suite. In A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al.
Conference on Computer Vision and Pattern Recognition Speed/accuracy trade-offs for modern convolutional object
(CVPR), 2012. 2, 5, 7 detectors. arXiv preprint arXiv:1611.10012, 2016. 2
[7] H. Gong, J. Sim, M. Likhachev, and J. Shi. Multi-hypothesis [12] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J.
motion planning for visual object tracking. In Computer Vi- Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy
sion (ICCV), 2011 IEEE International Conference on, pages with 50x fewer parameters and¡ 0.5 mb model size. arXiv
619–626. IEEE, 2011. 2 preprint arXiv:1602.07360, 2016. 5, 6
[13] D. Kingma and J. Ba. Adam: A method for stochastic opti- [30] N. Wang and D.-Y. Yeung. Learning a deep compact im-
mization. arXiv preprint arXiv:1412.6980, 2014. 5 age representation for visual tracking. In Advances in neural
[14] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, information processing systems, pages 809–817, 2013. 2
and M. Chandraker. Desire: Distant future prediction in [31] B. Wu, F. Iandola, P. H. Jin, and K. Keutzer. Squeezedet:
dynamic scenes with interacting agents. arXiv preprint Unified, small, low power fully convolutional neural net-
arXiv:1704.04394, 2017. 2 works for real-time object detection for autonomous driving.
[15] B. Li. 3d fully convolutional network for vehicle detection arXiv preprint arXiv:1612.01051, 2016. 2
in point cloud. arXiv preprint arXiv:1611.08069, 2016. 2
[16] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár.
Focal loss for dense object detection. arXiv preprint
arXiv:1708.02002, 2017. 2
[17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
Fu, and A. C. Berg. SSD: Single shot multibox detector. In
ECCV, 2016. 2, 4, 5, 6
[18] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical
convolutional features for visual tracking. In Proceedings
of the IEEE International Conference on Computer Vision,
pages 3074–3082, 2015. 2
[19] W.-C. Ma, D.-A. Huang, N. Lee, and K. M. Kitani. Forecast-
ing interactive dynamics of pedestrians with fictitious play.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 774–782, 2017. 2
[20] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale
video prediction beyond mean square error. arXiv preprint
arXiv:1511.05440, 2015. 2
[21] H. Nam and B. Han. Learning multi-domain convolutional
neural networks for visual tracking. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 4293–4302, 2016. 2
[22] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll
never walk alone: Modeling social behavior for multi-target
tracking. In Computer Vision, 2009 IEEE 12th International
Conference on, pages 261–268. IEEE, 2009. 2
[23] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger.
arXiv preprint arXiv:1612.08242, 2016. 2
[24] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
wards real-time object detection with region proposal net-
works. In Neural Information Processing Systems (NIPS),
2015. 2
[25] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014. 4
[26] N. Srivastava, E. Mansimov, and R. Salakhudinov. Unsuper-
vised learning of video representations using lstms. In Inter-
national Conference on Machine Learning, pages 843–852,
2015. 2
[27] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance
search for tracking. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 1420–
1429, 2016. 2
[28] J. Walker, C. Doersch, A. Gupta, and M. Hebert. An uncer-
tain future: Forecasting from static images using variational
autoencoders. In European Conference on Computer Vision,
pages 835–851. Springer, 2016. 2
[29] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual track-
ing with fully convolutional networks. In Proceedings of the
IEEE International Conference on Computer Vision, pages
3119–3127, 2015. 2