Professional Documents
Culture Documents
CVPRW 2022 Aicity
CVPRW 2022 Aicity
Fei Li* Zhen Wang* Ding Nie* Shiyi Zhang Xingqun Jiang Xingxing Zhao Peng Hu
BOE Technology Group, China
{lifei-cto,wangzhen-cto,nied,zhangshiyi,hupeng-hq}@boe.com.cn
Abstract
c042 c041
1. Introduction
uncertainty caused by detection noises such as false
Multi-Target Multi-Camera (MTMC) [1, 9, 11, 12, 14, 21, positive or false negative results, which in turn affect
27, 29] tracking plays an important role in the extraction of the matching process and ultimately result in unstable
actionable insights from the fast-expanding sensors around or broken tracks.
the world. Among its branches, city scale Multi-Camera
Vehicle Tracking (MCVT) is attracting an increasing num- 2. Most tracking-by-detection SCT algorithms underesti-
ber of researchers. Its primary goal is to calculate tra- mate broken tracklets caused by long traffic jam, which
jectories of vehicles across multiple cameras, as shown in is shown in Fig. 6. that afterwards
Fig. 1. Following a general approach, MCVT can be broken
down to three sub-tasks, including Single-Camera Tracking 3. Most tracking-by-detection SCT algorithms extract
(SCT), vehicle re-identification (Re-ID), and Multi-Camera impure features when vehicle is occluded by another
Tracklets Matching (MCTM). As the initial condition, SCT object, such that the bounding box may contain many
can have significant impact on the correctness of succes- pixels from adjacent object.
sive steps. A few broken tracklets or ID switches in SCT 4. In multi-camera tracklet matching, many vehicles with
are enough to cause massive chaos and chain reaction in similar appearance cannot be distinguished, thus cre-
multi-camera tracklet matching, forming an invisible bot- ating ID switches across camera.
tleneck on recall and precision scores. We can observe the
challenges seen in traditional MCVT with some actual sce- Because of these problems, we design a system that la-
narios: bels and handles three types of detection results using con-
fidence score and Intersection Over Union (IOU) ratios,
1. Most tracking-by-detection SCT algorithms face the
which are low-quality-occluded, low-quality-non-occluded,
* Equal contribution. and high-quality-non-occluded vehicles. Among these,
high-quality-non-occluded vehicles have the highest prior- fixed-sized feature maps. Finally, softmax and bounding
ity when matching, and are used for track initialization. box regression layer flatten the feature maps and output the
Low-quality ones, on the other hand, have lower priori- location of the object in the image.
ties when matching. Moreover, the Re-ID features of low- Single Shot MultiBox Detector (SSD) [18]and You Only
quality-occluded vehicles will be dropped out. By sorting Look Once (YOLO) [4, 22–24, 30] are both famous one-
them out, vehicle tracks can be more accurately performed, stage object detection algorithms. SSD [18] predicts the
which enables a high recall rate, and filtered Re-ID features object classes and locations with a series bounding boxes in
can make the process easier for tracklet matching later on. various aspect ratio and size. YOLO tries to accomplish the
For the missing pieces of a vehicle, we introduce two real-time object detection task with anchors derived from
more SOT algorithms other than the Kalman Filter, so that Faster R-CNN. It fiendishly turns a object detection task
stable vehicle trajectory can be synthesised even if one is in into a classification task. Meanwhile, it balances prediction
uncertainty. Then, the results will go through multi-level accuracy and inference performance.
matching and clustering, within and across cameras. As
for trajectory disruptions caused by heavy occlusion or ap-
pearance changes, we propose a zone-based tracklet merg- 2.2. Multiple Object Tracking
ing strategy, piecing together most tracklet fragments within
cameras. 2.2.1 Single-Target Single-Camera Tracking
As for multi-camera matching, we propose a direction
based spatial-temporal strategy that significantly reduces single object tracking (SOT) is one of the foremost assign-
the searching space and an aggregation strategy that solves ments in computer vision that has numerous applications
edge cases like U-turns. The main contribution of this paper such as intelligent video analysis, robotics, autonomous ve-
are summarized as follows: hicle tracking, and so on. It generally has three kinds of
• We propose multi-level detection handler, augmented methods to approach the solutions.
tracks prediction method, and multi-level association The algorithms related with first method are based on
to cope with broken tracklets and ID switches. correlation filter, such as Kernelized Correlation Filters
(KCF) [10], Continuous Convolution Operators for Vi-
• We propose zone-based singe-camera tracklet merging sual Tracking(C-COT) [8] and Efficient convolution oper-
strategy, which not only links tracklet fragments, but ators(ECO) [7], which were commonly used on signal pro-
also enrich vehicle features in order to increase the re- cessing to describe the correlation or similarity of two sig-
call rate. nals. The general procedures to apply the correlation filter
on SOT have three steps. Firstly, obtain the initialized tar-
• We propose spatial-temporal matching and aggrega-
get position based on first frame of stream. Then, extract
tion strategy that significantly reduces the searching
the target position features as filter template like Histogram
space and solves edge cases like U-turns.
of Oriented Gradients (HOG). Finally, obtain the predicted
target position of current frame by executing correlation op-
2. Related Work eration based on the previous filter template.
2.1. Vehicle Detection The second class of algorithms is based on optical flow,
Object detection is a traditional assignment of computer such as MedianFlow [13]. This method extracts the tar-
vision and image processing which deals with detecting in- get position features with Harris Corner or SIFT in the first
stances of semantic objects of some certain classes, such as frame. Then, it predicts the corresponding positions of the
human, bikes or cars, in digital images and videos. feature points for the following frames by optical flow algo-
Object detection algorithms typically leverage machine rithms. It finally gives the predicted target position based
learning and deep learning to output reasonable results. on the predicted feature-point positions. optical flow al-
Commonly, object detection algorithms are classified into gorithms mainly focus on local features predictions which
two categories, two-stage object detection and one-stage leads to robust results in occlusion challenge.
object detection. The last class of tracking algorithms is based on CNN
Faster R-CNN is a representative two-stage object de- Siamese Network, which is a kind of offline object tracking,
tection algorithm [5, 25]. Its architecture contains two net- such as siamFc [2], siamRPN [16], siamRPN++ [15], etc.
works, region proposal network (RPN) and object detection The main principle is like the second method but replac-
network. RPN firstly generates a series of anchors on the in- ing the Harris Corner or SIFT feature extractor with CNN.
put feature map. Then, Region of interesting pooling (ROI) These methods have all achieved satisfactory results in the
takes the output of RPN as input and generate a series of field of SOT.
2.2.2 Multi-Target Single-Camera Tracking classes to only cars, trucks, and buses by setting the classes
parameter. The agnostic parameter is used to perform non-
Multiple Object Tracking (MOT) is a challenging task. It maximum suppression (NMS) for all detected vehicles in
not only involves the difficulties derived from SOT such as the inference stage.
occlusion, illumination variation, object deformation, etc.
but also concerns about the relative positions of each target 3.3. Vehicle Re-ID
to reduce ID switching possibilities.
Tracking-by-detection is the main paradigm to achieve Following existing Re-ID work [17], we use ResNet50-
the goal of MOT. Simple Online and Realtime Tracking IBN-a, ResNet101-IBN-a and ResNeXt101-IBN-a models
(SORT) [3] is a typical bare-bone implementation of MOT that are pre-trained on the CityFlow dataset to extract fea-
framework. It predicts the object locations of the following ture of vehicles, without introducing external data. Each
frames by Kalman Filter and then computes the overlaps Re-ID model outputs a 2048-dimensional feature vector,
with the detection. In the end, it uses Hungarian algorithm and the final feature of each detected car is the average out-
to assign detection to tracklets. However, it always causes put of the three models.
unsatisfied tracking results due to lacking feature extraction 3.4. Single-Camera Vehicle Tracking
especially in crowded and fast motion scenes. To solve the
previous problem, Simple online and realtime tracking with For single-camera vehicle tracker, we follow the gen-
a deep association metric (DeepSort) [31] was proposed in eral framework of Simple Online and Realtime Tracking
2017 and integrate appearance information to improve the (SORT) [3]. To deal with the limitations of SORT, we pro-
performance of SORT. It effectively reduces the number of pose further improvements to the tracking method. First,
ID switches with acceptable computational complexity in- relying on predictions from Kalman Filter often produces
creasing. ID switches when direction of movement changes. There-
In support of SORT and DeepSort, there are more works fore, we utilize two more SOT, namely Efficient Convo-
that joints the detection and tracking. For example, Fair- lution Operators (ECO) [6] and MedianFlow, and propose
MOT [33] accomplishes integrating object detection and an augmented tracks prediction method. Next, inspired by
Re-ID in the same backbone to improve inference speed. DeepSort, we include vehicle appearance features, which
then goes through a feature dropout filter and a multi-
2.3. Multi-Camera Vehicle Tracking level matching process. Finally, in order to make sure the
completeness of tracklets, we add another post-process for
Recently, MCVT [19] has become a fascinating research
tracklet merging within a single camera.
area due to the demands of the city-scale traffic manage-
ment increasing. With the progression of multi-object track-
ing techniques, the vehicle Re-ID techniques and the inte- 3.4.1 Vehicle Track Prediction
grated framework, MCVT can be settled in better results.
To enhance the the limitation of Kalman Filter, we first
Liu. Et al [17] proposes an integrated framework for MCVT
include MedianFlow, which use the current location of a
guided by crossroad zones, which achieves the top perfor-
vehicle to take sample pixels, and then predict the next
mance in AI City Challenge 2021.
frame’s location based on optical flows. MedianFlow can
effectively locate vehicles that are occluded by another ve-
3. Method hicle moving in parallel, thus becomes more occlusion re-
3.1. Overview silient. Secondly, when a vehicle moves extremely fast or
makes sharp turns, it usually undergoes dramatic appear-
The proposed MCVT system is shown in Fig. 2, which ance changes. In this case, MedianFlow might not work
includes vehicle detection, Re-ID feature, feature dropout well, so we can adjust our prediction using ECO. There-
based multi-level detection handler, single-camera multi- after, every vehicle detection box will have a much better
level tracks and merging strategy, and multi-camera spatial- match in our multi-level association method, as shown in
temporal matching and aggregation strategy. Fig. 3.
3.2. Vehicle Detection
3.4.2 Multi-Level Detection Handler
Vehicle detection is the first and essential step in MTMC
tracking. As most of the MTMC tracking methods, we For our method, vehicle Re-ID features plays an important
follow the tracking-by-detection paradigm, such as using role in both single and multiple camera vehicle tracking.
the state-of-the-art network YOLOv5 [30] and more specif- Therefore, it is imperative to extract accurate features from
ically the YOLOv5x6 model, which is pre-trained on the Re-ID models, which act as the enabler for the rest of pro-
COCO dataset to detect vehicles. We tune our detection cesses down the pipeline. We propose multi-level detection
Vehicle
Detection Detection Single Camera Multi-level Tracklet Tracklet
Video Set
Results Track Merging Filtering
Re-ID Feature
Drop Out
Figure 2. Pipeline of our MCVT system. The system first uses the detector to obtain all bounding boxes of vehicles from video set; then
uses Re-ID models to extract features, with some Re-ID features being dropped out; then single-camera algorithms generate tracklets;
finally, tracklets across cameras are synchronized by matching and clustering strategies.
Low quality,
blocked Level 1 match Level 2 match Level 3 match Level 4 match ECO/MedianFlow
Low quality, not Unmatched Unmatched Unmatched
blocked Unmatched tracks
detection detection detection Failed Age ++
High quality, not Unmatched
Unmatched tracks Unmatched tracks Unmatched tracks
blocked detection
Successed
Matched tracks Matched tracks Matched tracks Matched tracks
Tracker
3. associate the detection still unmatched with tracks with world scenario, there are still split pieces made by unex-
age greater than 1, yielding: pected objects such as the traffic light, which can be seen in
Fig. 6. Due to these cases, we propose a zone-based tracklet
F =A (3) merging method.
where A represents feature cosine cost matrix. This
step is inspired by DeepSort, assuming that tracks with Tracklet Selection We divide crossroad images into 9 ef-
lower age should have higher priority when matching. fective zones and 1 traffic zone, which can be determined
by specific cases, as shown in Fig. 7. These 9 zones can be
4. match the remaining low quality vehicles (occluded or sort into starting zones (1, 3, 5), middle zone (10), and end-
not) with tracks of age 1, with similar matrix calcula- ing zones(2, 4, 6). Before merging, we pick some tracklets
tion to step 2. This step aims to save the boxes pre- under the criteria:
dicted by tracks from the low quality detection, in or-
der to ensure the completeness of our tracklets. • tracklet that starts normally and end in either the same
zone or middle zone.
3.4.4 Tracks Life-cycle • tracklet that starts in either middle zone or traffic zone.
After four rounds of associations, if there are still some un- These tracklets are thought to be abnormal and will become
matched high quality detection boxes, then they are thought candidates for merging. Then, they will go through a filter
to be new, and new tracks will be initialized, including to remove noises such as negligibly short ones (less or equal
Kalman Filter model, ECO tracks, and sampling pixels for to 4 frames), stationary ones, and small pixel count ones.
optical flow. For matched detection and tracks, the tracks From there, these candidate will go to tracklet merging in
will be updated accordingly. First the type of detection the next step.
box is determined, and if the vehicle is occluded, then only
the Kalman Filter is updated, else if the IOU distance be- Tracklet Merging Considering the fact that a tracklet
tween ECO prediction and matched detection is lower than might have multiple fragments within one camera, this pa-
a threshold, the track is reinitialized with the matched de- per cope with abnormal tracklet fragments using hierarchi-
tection box rather than being updated, and optical flow sam- cal clustering. Assuming there are n fragmented tracklets,
pling points as well as feature are updated accordingly. For first the mean feature for each tracklet are calculated, then
tracks not matched with detection boxes, we try to salvage we have:
them by using predictions from MedianFlow or ECO, while
updating the Kalman Filter model with those predictions to cos(T1 , T1 ) · · · cos(T1 , Tn )
Hn×n =
.. .. ..
(4)
compensate for the missing parts. . . .
cos(Tn , T1 ) · · · cos(Tn , Tn )
3.4.5 Zone-Based Tracklet Merging
Where Ti represents the tracklets fragments to be merged,
Although we make many attempts to capture fragments of and Hn×n represents the cost matrix of the tracklets. Be-
every tracklet to make sure of its completeness, in real- cause of one tracklet cannot merge with itself, we set the
Feature extraction
400
frames
Occlusion
Long time occlusion
After occlusion
After occlusion
(a) Vehicles stop for traffic light, and then moving under occlusion. After the
(b) The track is deleted due to long time occlusion, when the vehicle finally
reappearance of the vehicle, because of the part of feature occluded differ from
shows up, a new ID will be assigned.
that afterwards, the two features cannot be reliably matched.
(44,218)
c045 c044
(46,54) (45,160)
c046 c045
3.5.3 Tracklet Clustering The AIC22 benchmark (CityFlowV2) [20, 28] is cap-
tured by 46 cameras in real-world traffic surveillance en-
Inspired by [17], our method contains two rounds of multi- vironment. A total of 880 vehicles are annotated in 6 dif-
camera tracklet clustering. The first round is directional ferent scenarios. 3 of the scenarios are used for training.
based. For instance, tracklets going from camera 41 to 42 2 scenarios are for validation. And the last scenario is for
and the other way could be clustered together, but with re- testing. There are 215.03 minutes of videos in total. The
liable single-camera tracking result, we can compare only length of the training videos is 58.43 minutes, that of vali-
those tracklets going in the same direction, reducing the dation videos is 136.60 minutes, and that of testing videos is
Method IDF1 IDP IDR References
Baseline 0.7522 0.8173 0.6967 [1] Pedro Antonio Marin-Reyes, Andrea Palazzi, Luca
+Feature Dropout Filter 0.8192 0.8871 0.7610 Bergamini, Simone Calderara, Javier Lorenzo-Navarro, and
+Multi-level matching 0.8312 0.8799 0.7877 Rita Cucchiara. Unsupervised vehicle re-identification using
+Tracklet Merging 0.8362 0.8732 0.8022 triplet networks. In Proceedings of the IEEE Conference on
+Tracklet Selection 0.8437 0.8900 0.8020 Computer Vision and Pattern Recognition Workshops, pages
166–171, 2018. 1
[2] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea
Table 1. IDF1 score changes after several optimizations.
Vedaldi, and Philip HS Torr. Fully-convolutional siamese
networks for object tracking. In European conference on
Team ID IDF1 score computer vision, pages 850–865. Springer, 2016. 2
28 0.8486 [3] Alex Bewley, Zongyuan Ge, Lionel Ott, Fabio Ramos, and
59 0.8437 Ben Upcroft. Simple online and realtime tracking. 2016
IEEE International Conference on Image Processing (ICIP),
37 0.8371
2016. 3
50 0.8348
[4] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-
70 0.8251
Yuan Mark Liao. Yolov4: Optimal speed and accuracy of
object detection. arXiv preprint arXiv:2004.10934, 2020. 2
Table 2. Comparison of our method with other teams.
[5] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delving
into high quality object detection. In 2018 IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
20.00 minutes. For MTMC tracking, we use the IDF1 [26] 6154–6162, 2018. 2
score as evaluation indicators. IDF1 measures the ratio of
[6] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and
correctly identified detection over the average number of Michael Felsberg. Eco: Efficient convolution operators for
ground-truth and computed detection. The evaluation sys- tracking. 2017 IEEE Conference on Computer Vision and
tem of the AI City Challenge will display IDF1,IDP, IDR, Pattern Recognition (CVPR), 2017. 3
Precision (detection) and Recall (detection). [7] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and
Michael Felsberg. Eco: Efficient convolution operators
4.2. Results for tracking. In Proceedings of the IEEE conference on
We use the evaluation opportunity provided by the eval- computer vision and pattern recognition, pages 6638–6646,
uation system to verify the effect of our algorithm, and op- 2017. 2
timize our algorithm according to the IDF1,IDP, IDR, Pre- [8] Martin Danelljan, Andreas Robinson, Fahad Shahbaz Khan,
cision and Recall results. We record the changes of IDF1 and Michael Felsberg. Beyond correlation filters: Learn-
ing continuous convolution operators for visual tracking. In
after several optimizations which are shown in Tab. 1.
European conference on computer vision, pages 472–488.
In the final ranking of track 1 of AI City Challenge 2022,
Springer, 2016. 2
we rank the second place among all participating teams.
[9] Zhiqun He, Yu Lei, Shuai Bai, and Wei Wu. Multi-camera
The comparison of our method with other teams’ on the vehicle tracking with powerful visual features and spatial-
evaluation system is shown in Tab. 2. Our code will be re- temporal cue. In CVPR Workshops, pages 203–212, 2019.
leased later. 1
[10] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge
5. Conclusion Batista. High-speed tracking with kernelized correlation fil-
ters. IEEE transactions on pattern analysis and machine in-
In this paper, we propose an accurate multi-camera vehi-
telligence, 37(3):583–596, 2014. 2
cle tracking system. For single-camera tracking, we incor-
[11] Yunzhong Hou, Heming Du, and Liang Zheng. A local-
porate three reinforcing methods for augmented tracks pre- ity aware city-scale multi-camera vehicle tracking system.
diction, and multi-level association and zone-based singe- In Proceedings of the IEEE/CVF Conference on Computer
camera tracklet merging strategy. For multi-camera track- Vision and Pattern Recognition Workshops, pages 167–174,
ing, we develop a spatial-temporal strategy that reduces 2019. 1
search space when matching, and improve on hierarchical [12] Hung-Min Hsu, Tsung-Wei Huang, Gaoang Wang, Jiarui
clustering that captures U-turn tracklets. Our results show Cai, Zhichao Lei, and Jenq-Neng Hwang. Multi-camera
that these methods improve both IDR and IDP, with a final tracking of vehicles based on deep features re-id and
IDF1 score of 0.8437. trajectory-based camera link models. In CVPR Workshops,
pages 416–424, 2019. 1
[13] Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas.
Tracking-learning-detection. IEEE transactions on pattern
analysis and machine intelligence, 34(7):1409–1422, 2011. [26] Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara,
2 and Carlo Tomasi. Performance measures and a data set for
[14] Young-Gun Lee, Jenq-Neng Hwang, and Zhijun Fang. Com- multi-target, multi-camera tracking. In European conference
bined estimation of camera link models for human track- on computer vision, pages 17–35. Springer, 2016. 8
ing across nonoverlapping cameras. In 2015 IEEE Interna- [27] Jakub Španhel, Vojtech Bartl, Roman Juránek, and Adam
tional Conference on Acoustics, Speech and Signal Process- Herout. Vehicle re-identification and multi-camera tracking
ing (ICASSP), pages 2254–2258. IEEE, 2015. 1 in challenging city-scale environment. In Proc. CVPR Work-
[15] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, shops, volume 2, page 1, 2019. 1
and Junjie Yan. Siamrpn++: Evolution of siamese vi- [28] Zheng Tang, Milind Naphade, Ming-Yu Liu, Xiaodong
sual tracking with very deep networks. In Proceedings of Yang, Stan Birchfield, Shuo Wang, Ratnesh Kumar, David
the IEEE/CVF Conference on Computer Vision and Pattern Anastasiu, and Jenq-Neng Hwang. CityFlow: A city-scale
Recognition, pages 4282–4291, 2019. 2 benchmark for multi-target multi-camera vehicle tracking
[16] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. and re-identification. In Proc. CVPR, pages 8797–8806,
High performance visual tracking with siamese region pro- Long Beach, CA, USA, 2019. 7
posal network. In Proceedings of the IEEE conference on [29] Zheng Tang, Gaoang Wang, Hao Xiao, Aotian Zheng, and
computer vision and pattern recognition, pages 8971–8980, Jenq-Neng Hwang. Single-camera and inter-camera vehicle
2018. 2 tracking and 3d speed estimation based on fusion of visual
[17] Chong Liu, Yuqi Zhang, Hao Luo, Jiasheng Tang, Weihua and semantic features. In Proceedings of the IEEE confer-
Chen, Xianzhe Xu, Fan Wang, Hao Li, and Yi-Dong Shen. ence on computer vision and pattern recognition workshops,
City-scale multi-camera vehicle tracking guided by cross- pages 108–115, 2018. 1
road zones. 2021 IEEE/CVF Conference on Computer Vi- [30] Ultralytics. Yolov5. https : / / github . com /
sion and Pattern Recognition Workshops (CVPRW), 2021. 3, ultralytics/yolov5. Accessed in 2022.04. 2, 3
7 [31] Nicolai Wojke, Alex Bewley, and Dietrich Paulus. Simple
[18] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian online and realtime tracking with a deep association metric.
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C In 2017 IEEE International Conference on Image Processing
Berg. Ssd: Single shot multibox detector. In European con- (ICIP), pages 3645–3649, 2017. 3
ference on computer vision, pages 21–37. Springer, 2016. 2 [32] Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Zehuan
[19] Pedro Antonio Marı́n-Reyes, Luca Bergamini, Javier Yuan, Ping Luo, Wenyu Liu, and Xinggang Wang. Byte-
Lorenzo-Navarro, Andrea Palazzi, Simone Calderara, and track: Multi-object tracking by associating every detection
Rita Cucchiara. Unsupervised vehicle re-identification using box. arXiv preprint arXiv:2110.06864, 2021. 4
triplet networks. In 2018 IEEE/CVF Conference on Com- [33] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng,
puter Vision and Pattern Recognition Workshops (CVPRW), and Wenyu Liu. Fairmot: On the fairness of detection and
pages 166–1665, 2018. 3 re-identification in multiple object tracking. International
[20] Milind Naphade, Shuo Wang, David C. Anastasiu, Zheng Journal of Computer Vision, 129(11):3069–3087, 2021. 3
Tang, Ming-Ching Chang, Xiaodong Yang, Yue Yao, Liang
Zheng, Pranamesh Chakraborty, Christian E. Lopez, Anuj
Sharma, Qi Feng, Vitaly Ablavsky, and Stan Sclaroff. The
5th AI City Challenge. In Proc. CVPR Workshops, pages
4263–4273, Virtual, 2021. 7
[21] Yijun Qian, Lijun Yu, Wenhe Liu, and Alexander G Haupt-
mann. Electricity: An efficient multi-camera vehicle track-
ing system for intelligent city. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops, pages 588–589, 2020. 1
[22] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali
Farhadi. You only look once: Unified, real-time object de-
tection. In 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 779–788, 2016. 2
[23] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster,
stronger. In 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pages 6517–6525, 2017. 2
[24] Joseph Redmon and Ali Farhadi. Yolov3: An incremental
improvement. arXiv preprint arXiv:1804.02767, 2018. 2
[25] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster r-cnn: Towards real-time object detection with region
proposal networks. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 39(6):1137–1149, 2017. 2