Towards Robust Real-Time Visual Tracking For UAV With Joint Scale and Aspect Ratio Optimization

Towards Robust Real-Time Visual Tracking for UAV with Joint Scale
and Aspect Ratio Optimization

Fangqiang Ding1 , Changhong Fu1,∗ , Yiming Li1 , Jin Jin1 and Chen Feng2
Target label Size domain sample Feature maps

Abstract— Current unmanned aerial vehicle (UAV) visual Vectorization
tracking algorithms are primarily limited with respect to: (i) the Feature
extraction Filter
Feature
extraction
Filter
Target label training
kind of size variation they can deal with, (ii) the implementation training
speed which hardly meet the real-time requirement. In this Translation

Feature map
work, a real-time UAV tracking algorithm with powerful size Translation filter training
filter
Size filter training Size filter
estimation ability is proposed. Specifically, the overall tracking Multi-scale search Search region Translation
Feature Feature map Translation response
task is allocated to two 2D filters: (i) translation filter for extraction
filter
location prediction in the space domain, (ii) size filter for Translation and
scale estimate
Localizations
scale and aspect ratio optimization in the size domain. Besides, Feature Size response
Translation
an efficient two-stage re-detection strategy is introduced for filter
extraction Scale and aspect
ratio estimation
long-term UAV tracking tasks. Large-scale experiments on four Feature maps Response maps
Size filter
UAV benchmarks demonstrate the superiority of the presented Detection of baseline Detection of JSAR
method which has computation feasibility on a low-cost CPU.
Fig. 1. Comparison of overall flow in baseline [13] and JSAR. Baseline
I. INTRODUCTION only trains a translation filter for translation estimation in detection phase
and update object scale by brute-force multi-scale search strategy. JSAR
Equipped with visual perception capability, robots can proposes to train size filter in size domain by multi-size sampling and jointly
estimate the scale and aspect ratio of object in detection stage.
have flourishing real-world applications, e.g., visual object
tracking has stimulated broad practical utilities like human-
robot collaboration [1], robotic arm manipulation [2] and ance are frequently introduced in filter training because of
aerial filming [3]. Tracking onboard unmanned aerial vehicle the imprecise size estimation, leading to filter degradation.
(UAV) has many advantages over general object tracking, for Inspired by 1D scale filter [14] aiming to handle ineffi-
instance, broad view scope, high flexibility and mobility. Yet ciency of brute-force multi-scale search [12], [15], [16], this
more difficulties are introduced such as aspect ratio change work proposes an joint scale and aspect ratio optimization
(ARC)1 , out-of-view, exiguous calculation resources, etc. tracker (JSAR) to achieve accurate scale and aspect ratio
Hence, a robust-against-ARC, low-cost and energy-efficient estimation. As displayed in Figure 1, the training procedure
tracking algorithm applicable in long short-term tasks is is two-fold: (i) training a transition filter with single patch
highly desirable in UAV tracking. [17] for location prediction, and (ii) training a size filter with
In literature, although deep feature [4]–[6] or deep archi- exponential-distributed samples for scale and aspect ratio
tecture [7]–[9] can exceedingly boost the tracking robustness, estimation. Consequently, location, scale and aspect ratio
the complex convolution operations have hampered their are calculated simultaneously, i.e., the object bounding box
practical utility. Another research direction in visual tracking can be estimated in the 4-DoF (degree of freedom) space,
is discriminative correlation filters (DCF) [10]–[12]. With promoting the tracking accuracy without losing much speed.
only hand-crafted features, DCF-based trackers mostly have Recently, combining visual tracking with re-detection
real-time speed on a single CPU thanks to their transforming framework has raised precision in the long-term tracking
intractable spatial convolution into element-wise multiplica- scenarios where objects frequently suffer from out-of-view
tion in the Fourier domain. While most researches focus on or full occlusion [18], [19]. Yet the speed is mostly sacrificed
location and scale estimation, scarce of them focus on aspect due to the intractable object detection methods. In this work,
ratio. Current DCF-based trackers commonly fix the aspect a CPU-friendly re-detection strategy is proposed to enable
ratio of the object during tracking. Consequently, in UAV long-term tracking. An effective tracking failure monitoring
tracking scenarios with extensive ARC, erroneous appear- mechanism and an efficient re-detection method based on
EdgeBoxes [20] collaboratively contribute to the smooth
1 Fangqiang Ding, Changhong Fu, Yiming Li and Jin Jin are with the
long-term tracking. Our main innovations are three-fold:
School of Mechanical Engineering, Tongji University, 201804 Shanghai,
China. changhongfu@tongji.edu.cn • A novel robust tracking method with real-time speed is
2 Chen Feng is with the Tandon School of Engineering, New York
proposed with joint scale and aspect ratio optimization.
University, NY 11201 New York, United States. cfeng@nyu.edu
The code and tracking video are respectively released on http:// • A new CPU-friendly re-detection framework is devel-
github.com/vision4robotics/JSAR-Tracker and https:// oped to accomplish long-term tracking task efficiently.
youtu.be/ME2KtMgHKhc. • Large-scale experiments conducted on three short-term
1 Caused by rapid attitude alteration and strong motion of UAV, ARC
is generally brought with the form of large viewpoint variation, intense UAV benchmarks and one long-term benchmark vali-
rotation, deformation, to name a few. date the excellent performance the proposed method.
II. RELATED WORKS III. PROPOSED METHOD
A. Discriminative correlation filter tracking algorithm A. Discriminative correlation filter
In literature, discriminative tracking algorithms train a In frame t, a multi-channel correlation filter Wt ∈
classifier to differentiate the tracked object from the back- RM ×N ×D is trained by restricting its correlation result with
ground by maximizing the classification score. Recent inves- training samples Xt ∈ RM ×N ×D to the given target label
tigations focus on discriminative correlation filters since D. S. g ∈ RM ×N . The minimized objective E(Wt ) is formulated
Bolme et al. [21] proposed to learn robust filters by mapping as the sum of least square term and regularization term:
the training samples to a desired output. J. F. Henriques X D 2 XD
d 2

E(Wt ) = wtd ? xdt − g + λ wt , (1)

et al. [10] presented to solve rigid regression equation in 2 2
d=1 d=1
the Fourier domain, and established the basic structure of
modern DCF methods. Afterwards, several attempts are made where xdt ∈ RM ×N and wtd ∈ RM ×N respectively indicate
to promote tracking performance within DCF framework, the d-th channel feature representations and filter, and ?
e.g., spatial penalization [16], [22], multi-feature fusion [11], denotes the cyclic correlation operator. M and N denote the
[23] and real negative sampling [12], [24]. However, most re- width and height of single channel sample while D denotes
search highlight improvement of localization accuracy rather the number of feature channels. λ is a hyper parameter for
than amelioration in the size estimation. avoiding over-fitting. Minimizing the objective in the Fourier
domain, a close-form solution of filter Wt is obtained:
B. Prior works in object size estimation
ex
g ed∗
Pioneer DCF trackers [10], [21], [25] fix the object size e td = P
w t
, (2)
D d
d=1 (x x
f
t ed∗ ) + λ
t
and only estimate the trajectory in the 2D space. Presetting
a scaling pool, [15] sampled on different scales to find where and ·· denote the element-wise multiplication
the optimal one in the detection phase. [26] proposed a and division, respectively. e· means discrete Fourier transform
separate scale correlation filter to estimate scale variance (DFT) and ·∗ means complex conjugation. The appearance
in 1D scale domain. To enable aspect ratio estimation, [27] model is updated by linear interpolation with a predefined
tackled scale and aspect ratio variation by embedding detec- learning rate θ. Use F −1 to denote inverse discrete Fourier
tion proposals generator in tracking pipeline. [5] enforced transform (IDFT) and mdt as the d-th feature representation
near-orthogonality constraint on center and boundary filters. of the search region, the response map Rt is obtained by:
Despite bringing more freedoms in object tracking, these two X D
methods bring heavy computation burden for DCF trackers, Rt = F −1 w d∗
e t−1 e dt
m . (3)
d=1
and are hence not satisfactory alternatives for UAV tracking.
B. Translation estimation
C. Re-detection in object tracking For translation estimation, most trackers [10], [11], [23]
Tracking-learning-detection (TLD) [28] is proposed to val- learn a 2D translation filter Wt,trans in the space domain
idate tracking results and decide whether to enable learning by Eq. (2). In the training phase, the region of interest (ROI)
and detection. Among DCF trackers, [19] introduced an is cropped centered at object location with fixed proportion
online random fern to generate candidates and score each to the object scale. In the detection phase, the feature of
of them for re-detection. In despite of the effectiveness, it is ROI centered at the location of the last frame Mt,trans
time-consuming due to scanning window strategy. [18] pre- is extracted. The object is localized by finding the peak
sented a novel multi-threading framework in which offline- position of the response map generated by Eq. (3). For
trained Siamese network is used as a verifier. However, scale estimation, classical brute-force search in multi-scale
the speed is largely decreased. This work utilizes Edge- hierarchical structure is inefficient due to repetitive feature
Boxes [20] to quickly generate proposals, and then a decision extraction on large image patches.
filter is applied to select the most possible bounding box
C. Size estimation
for tracker’s re-initialization. The proposed two-stage re-
detection strategy is more efficacious and light-weight. Motivated by [26] which trains a 1D scale filter in scale
domain for scale-adaptive tracking, we propose to train a 2D
D. UAV tracking size filter Wt,size . Different to 2D translation filter learned in
In UAV tracking scenarios, the tracked objects possess space domain which is composed of horizontal and vertical
higher motion flexibility than in tracking based on hand- axis, the samples are extracted in size domain consisting of
held or fixed surveillance camera. Therefore, UAV tracking scale axis and aspect ratio axis.
is confronted with more difficulties. In literature, aberrance 1) Sampling in size domain: During size filter training, we
repression [29], intermittent context learning [30] and multi- crop S ×A patches centered at current object location, where
frame verification [31] are proposed to improve tracking A and S represent the number of aspect ratios and scales in
precision. Despite obtaining appealing results, they cannot training sample. The size of these patches is calculated by:
estimate aspect ratio variation. Adaptive to ARC, JSAR has {Wts,a , Hts,a } = {Wt γ Ns φNa , Ht γ Ns /φNa }
, (4)
better robustness and real-time tracking speed. (s = 1, 2, · · · , S, a = 1, 2, · · · , A)
Normal tracking Continuous re-detection
Increasing scope
Proposals
Full occlusion Re-initialization
Fig. 2. Overall flow of re-detection strategy. When the maximum value of the response map in frame t (ζt ) is larger than the threshold value (ζe ),
tracking procedure is normally implemented, if not, re-detection mechanism is activated. When the peak value of the response map ηb generated by selected
proposal and decision filter exceeds the descending threshold ηd , the bounding box is re-initialized and re-starts to be tracked normally. It is note that the
values remarked above proposals are confidence scores ki and the search scope is increasing in continuous re-detection.
where Wt and Ht is the object width and height in frame D. Re-detection strategy
t, and {s, a} denotes the index of patches with various As displayed in Fig. 2, the re-detection will be imple-
scale and aspect ratio. To maintain samples symmetry, we mented when tracking failure is observed. The proposed
set N s = − S+1 2 + s and N
a
= − A+1
2 + a. Otherwise, re-detection strategy has two stages, i.e., object proposals
γ and φ is hyper parameters to control sampling step. To generation and candidates scoring. Object proposal method
make sure the dimension consistency of cropped patches and EdgeBoxes [22] is applied to generate candidates and de-
reduce computation burden, all patches are downsampled to a cision filter scores them for object re-initialization. The
presetting model size {Wmodel , Hmodel }. Afterwards, feature detailed illustration is as follows.
Wmodel Hmodel
map Vts,a ∈ R C × C ×K is extracted on each patch 1) Tracking failure monitoring mechanism: Ideally, the
with K feature channels. Here, C denotes the side length re-detection is enabled when the object is lost or the output
of single cell for feature extraction. Different to translation deviates greatly against real object location. Related to the
filter which utilizes original feature map for training, for each tracking confidence, the peak value ζt = max(Rt,trans ) of
patch, the extracted feature representations are vectorized to response map generated in translation estimation of frame
Wmodel Hmodel K
1D vector vts,a = vec(Vts,a ) ∈ R C2 . In this t is adopted to decide whether to activate the re-detection.
process, the number of feature channel changes from original Presetting a threshold ζe , the re-detection mechanism is
K to C = WmodelCH2model K . By stacking column vectors from enabled when ζt < ζe .
different patches, the final sample Xt,size ∈ RS×A×C can 2) Object proposals generation: When re-detection be-
be denoted by: gins, EdgeBoxes [20] is utilized to generate class-agnostic
object proposals within surrounding square area at √ the first
vt1,1 vt1,2 vt1,3 vt1,A
 
···
 v2,1 vt2,2 vt2,3 ··· vt2,A  stage. The side length of this surrounding area is ω Wt Ht
 t
in this work. Each proposal bi generated by EdgeBoxes has

 3,1
v vt3,2 vt3,3 ··· vt3,A  .

Xt,size =
 t. (5)
 . .. .. .. .. 
 five variables, i.e., bi = [xi , y i , wi , hi , k i ]. The first four
 . . . . .  variables denote the location and size of the proposal while
vtS,1 S,2
vt S,3
vt ··· S,A
vt
the last value ki is the confidence score. Depending on the
2) Size estimation: After sample extraction, Equation (2) confidence score, we choose the top Ne proposals for re-
is applied to learn size filter Wt,size . In estimation stage, detection.
we assume the size is unchanged and estimate the location 3) Object proposals scoring: During the second stage, for
translation at first when new frame comes. Centering at the each proposal, the feature from ROI in frame t is extracted
predicted location, the feature representation of search region with K feature channels, which is denoted by Pit (i =
in size domain Mt,size is extracted for size estimation. It 1, 2, · · · , Ne ). To make a final decision for re-initialization,
is noted that Mt,size has the same dimension as training a decision filter Mdeci is trained along translation filter with
sample Xt,size . By Eq. (3), the current scale and aspect ratio selected pure samples2 . After feature extraction of proposals
are obtained by maximizing the response score, and then
2 The selection of pure samples also depends on the peak value of
the object size is optimized. In a word, our method can be
response map in frame t: if the peak value ζt > ζs (ζs is a predefined
generally applicable in DCF framework, and work in the 4- threshold), the sample for translation filter training is adopted to update the
DoF (degree of freedom) space. decision filter because larger peak value indicates better tracking quality.
Algorithm 1: JSAR-Re [37], ASRCF [22], MCCT [23], ADNet [38], TADT [8],
Input: Object location and size at the first frame UDT+ [39], and 16 hand-crafted trackers, i.e., MCCT-H [23],
Subsequent images in the video sequence KCF [10], DSST [26], fDSST [14], ECO-HC [6], DCF [10],
Output: Location and size of object in frame t BACF [12], ARCF [29], SRDCF [16], STAPLE-CA [24],
1 if t = 1 or re-detection enabled then ARCF-H [29], STAPLE [11], SRDCFdecon [35], CSR-DCF
2 Extract training samples Xi,trans and Xi,size [34], KCC [40], STRCF [17].
3 Use Eq. (2) to initialize Wi,trans and Wi,size
4 Initialize Mdeci by Xi,trans , disable re-detection A. Implementation details
5 else To test the size estimation ability of JSAR, first of all,
6 Extract search region feature maps Mt,trans experiments are conducted on three short-term UAV bench-
7 Generate Rt,trans by Eq. (3) and find ζt marks [13], [32], [33], compared with both deep and hand-
8 if ζt > ζe then crafted trackers. Then, the re-detection module is added to
9 Estimate object translation and extract Mi,size cope with tracking failure, generating JSAR-Re. We evaluate
10 Estimate object size using Eq. (3) JSAR-Re with SOTA trackers on UAV20L dataset [33].
11 Use Eq. (2) to update Wt,trans and Wt,size 1) Platform: All experiments are implemented with
12 if ζt > ζs then MATLAB R2017a and all experimental results are obtained
13 Update filter Mdeci using Mt,trans on a computer with a single i7-8700K (3.70GHz) CPU,
14 end 32GB RAM, and a NVIDIA RTX 2080 GPU for fair
15 else comparisons.
16 Generate proposals with search area 2) Baseline: In this work, spatial-temporal regularized
17 Scoring proposals using Mdeci by Eq. (3) correlation filters (STRCF) [17] is selected as our baseline
18 Find the largest peak value ηb tracker which adds spatio-temporal regularized term to train-
19 if ηb > ηd then ing objective for improving robustness and adopts a multi-
20 Enable re-detection, initialize the object scale search strategy for scale adaptivity. Discarding the
21 else hierarchical scale searching, JSAR separately trains a size
22 Increase ω and reduce ηd filter using Eq. (2) to estimate the scale and aspect ratio
23 Continue to re-detect next frame variations and follows the translation estimation in [17].
24 end 3) Features: To guarantee the real-time performance on a
25 end low-cost CPU, we only apply hand-crafted features in the ex-
26 end periments. Gray-scale, histogram of oriented gradient (HOG)
[10] and color name (CN) [41] are applied in translation
filter, while size filter only uses HOG.
Pit (i = 1, 2, · · · , Ne ), the corresponding Ne response maps 4) Hyper Parameters: Main parameters in this work are
are calculated through the correlation of decision filter and listed in Table I. For impartial comparison, all the parameters
feature maps by Eq. (3), and then the proposal with the are fixed in the experiments.
largest peak value is selected. However, in the scenarios 5) Criteria: Following one-pass evaluation protocol
of out-of-view and full occlusion, the selected proposal is (OPE) [42], we evaluate all trackers by two measures, i.e.,
generally fallacious. To this end, we set a threshold ηd to precision and success rate. Precision plots can exhibit the
decide whether to re-initialize: if the selected proposal’s peak percentage of all input images in which the distance of
value ηb > ηd , the re-initialization will be enabled, or else, predicted location with ground truth one is smaller than
re-detection is continued.
In this work, the scale of search area is increased and the TABLE I
re-initialization threshold ηd is reduced frame-by-frame in F OR IMPARTIAL COMPARISON , THESE PARAMETERS ARE FIXED
re-detection failure cases to make sure the re-initialization IN ALL EVALUATION OF OUR TRACKERS
ultimately works. The overall flow of the proposed method
Symbol Value Meaning
is presented in Algorithm 1.
S 13 The number of scales
IV. EXPERIMENTS A 13 The number of aspect ratios
γ 1.03 Scale sampling step
In this section, the proposed method is evaluated on three φ 1.02 Aspect ratio sampling step
challenging short-term UAV benchmarks, i.e., UAVDT [32], θsize 0.014 The learning rate of size filter
Wmodel 16 The width of model size
UAV123@10fps [33], DTB70 [13] and one long-term dataset Hmodel 32 The height of mode size
UAV20L [33], including over 149K images overall captured C 4 The side length of feature cell
by drone camera in all kinds of harsh aerial scenarios. The ζe 0.0105 Re-detection enablement threshold
ζs 0.013 Decision filter update threshold
experimental result of our method is compared with 30 state- ηd 0.02 Re-initialization threshold
of-the-art (SOTA) approaches including 14 deep trackers, i.e., ω 5 The side length factor of re-detection area
SiameseFC [7], DSiam [9], IBCCF [5], ECO [6], C-COT Ne 30 The number of proposals for re-detection
[4], GOTURN [36], PTAV [18], DeepSTRCF [5], CFNet
Precision plots on UAVDT Precision plots on UAV123@10fps Precision plots on DTB70
0.8 JSAR [0.721] JSAR [0.675] 0.8 JSAR [0.670]
ARCF-H [0.705] ECO-HC [0.634] STRCF [0.649]
Staple-CA [0.695] 0.6 STRCF [0.627] ECO-HC [0.643]
0.6 BACF [0.686] ARCF-H [0.612] 0.6 ARCF-H [0.607]
Precision
Precision
DSST [0.681] MCCT-H [0.596]
Precision
MCCT-H [0.604]
ECO-HC [0.681] Staple-CA [0.587] BACF [0.590]
0.4 MCCT-H [0.667] 0.4 Staple [0.573] fDSST [0.534]
0.4
fDSST [0.666] BACF [0.572] Staple-CA [0.504]
Staple [0.665] KCC [0.531] KCF [0.468]
0.2 KCC [0.649] 0.2 fDSST [0.516] DCF [0.467]
STRCF [0.629] DSST [0.448]
0.2
DSST [0.463]
KCF [0.571] DCF [0.408] KCC [0.440]
DCF [0.559] KCF [0.406] Staple [0.365]
0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold Location error threshold
Success plots on UAVDT Success plots on UAV123@10fps Success plots on DTB70
0.8 JSAR [0.469] JSAR [0.482] 0.8 ECO-HC [0.453]
BACF [0.433] ECO-HC [0.462] JSAR [0.443]
ARCF-H [0.413] 0.6 STRCF [0.457] STRCF [0.437]
STRCF [0.411] ARCF-H [0.434] ARCF-H [0.416]
Success rate
0.6
Success rate
Success rate
0.6 MCCT-H [0.405]
ECO-HC [0.410] MCCT-H [0.433]
MCCT-H [0.402] Staple-CA [0.420] BACF [0.402]
Staple-CA [0.394] 0.4 Staple [0.415] fDSST [0.357]
0.4 0.4
KCC [0.389] BACF [0.413] Staple-CA [0.351]
fDSST [0.383] fDSST [0.379] KCC [0.291]
Staple [0.383] 0.2 KCC [0.374] KCF [0.280]
0.2 0.2 DCF [0.280]
DSST [0.354] DSST [0.286]
KCF [0.290] DCF [0.266] DSST [0.276]
DCF [0.288] KCF [0.265] Staple [0.265]
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Overlap threshold Overlap threshold Overlap threshold
Fig. 3. Overall performance of hand-crafted real-time trackers on (a) UAVDT [32] (b) UAV123@10fps [33] (c) DTB70 [13]. JSAR has a notable
improvement of 8.3% and 4.3% in terms of AUC on UAVDT and UAV123@10fps compared with the second best trackers, respectively.
TABLE II
AVERAGE PRECISION , AUC AND SPEED COMPARISON OF TOP 10 HAND - CRAFTED TRACKERS ON UAVDT [32], UAV123@10 FPS [33] AND DTB70
[13]. R ED , GREEN AND BLUE RESPECTIVELY MEAN THE FIRST, SECOND AND THIRD PLACE .
Real time Non-real time
Tracker
JSAR MCCT-H [23] STRCF [17] ARCF-H [29] BACF [12] ECO-HC [6] CSR-DCF [34] SRDCF [16] ARCF-HC [29] SRDCFdecon [35]
AUC 0.465 0.413 0.435 0.421 0.416 0.442 0.426 0.416 0.468 0.397
Precision 0.689 0.622 0.635 0.641 0.616 0.653 0.654 0.616 0.693 0.577
Speed(fps) 32.2 59.7 28.5 51.2 56.0 69.3 12.1 14.0 15.3 7.5
Conference This work. CVPR’18 CVPR’18 ICCV’19 CVPR’17 CVPR’17 CVPR 17 ICCV’15 ICCV’19 CVPR’16
TABLE III
B. JSAR vs. deep trackers
P RECISION , AUC AND SPEED COMPARISON BETWEEN 14 RECENT DEEP
TRACKERS ON UAVDT [32]. R ED , GREEN , BLUE AND ORANGE
We firstly compare the tracking performance of JSAR
RESPECTIVELY MEAN THE FIRST, SECOND , THIRD AND FOURTH PLACE .
with 14 recently proposed SOTA deep trackers, i.e., deep
features based trackers and deep convolution neural networks
Tracker AUC Precision Speed(fps) CPU/GPU Conference
(DCNN) based trackers, on UAVDT benchmark [32]. As
JSAR 0.469 0.721 35 CPU This work. shown in Table III, JSAR has taken the first place in both
GOTURN [36] 0.451 0.702 17 GPU ECCV’16
IBCCF [5] 0.388 0.603 3 GPU CVPR’17 precision and AUC, while coming fourth in tracking speed
TADT [8] 0.431 0.677 35 GPU CVPR’19 running on a low-cost CPU. Without robust deep features, the
DSiam [9] 0.457 0.704 16 GPU ICCV’17 remarkable improvement (7.3% and 8.1% than DeepSTRCF
PTAV [18] 0.384 0.675 27 GPU ICCV’17
ECO [6] 0.454 0.700 16 GPU CVPR’17 in terms of AUC and precision) can be attributed to the ARC
ASRCF [22] 0.437 0.700 24 GPU CVPR’19 adaption, because UAVDT mainly addresses vehicle tracking
MCCT [23] 0.437 0.671 9 GPU CVPR’18 and the viewpoint change can easily lead to ARC of the
CFNet [37] 0.428 0.680 41 GPU CVPR’17
C-COT [4] 0.406 0.656 1 GPU ECCV’16 tracked vehicle in the image, as shown in Figure 4.
ADNet [38] 0.429 0.683 8 GPU CVPR’17
UDT+ [39] 0.416 0.697 60 GPU CVPR’19 C. JSAR vs. hand-crafted trackers
SiameseFC [7] 0.465 0.708 38 GPU ECCV’16
DeepSTRCF [17] 0.437 0.667 6 GPU CVPR’18 1) Overall evaluation: Restricted by scarce computation
resource, deep trackers have difficulties meeting real-time
tracking speed on UAV. Hand-crafted trackers, i.e., using
various thresholds, and success plots can reflect the propor- hand-crafted features in DCF framework, are ideal choices in
tion of frames in which the intersection over union (IoU) UAV tracking for its calculation efficiency. In this subsection,
between estimated bounding box and ideal one is greater than twelve SOTA and real-time hand-crafted trackers are used
distinctive thresholds. The score at 20 pixel and area under for comparison with JSAR at first. JSAR outperforms other
curve (AUC) are respectively used to rank the trackers. real-time trackers in terms of precision and AUC. Notably,
compared with the baseline STRCF [17], JSAR respectively
improves the AUC by 14.1%, 5.5% and the precision by
#00001 #00066 #00180 #00001 #00040 #00112
boat3 Surfing03
#00001 #00075 #00175 #00001 #00064 #00100
S0701 Surfing06
#00001 #00091 #00135 #00001 #00176 #00346
wakeboard1 wakeboard5
#00001 #00152 #00210 #00001 #00104 #00150
ChasingDrone SpeedCar4
#00001 #00126 #00185 #00001 #00105 #00367
boat4 S0301
#00001 #00031 #00100 #00001 #00123 #00232
SnowBoarding4 S0602
JSAR MCCT-H STRCF STAPLE-CA SRDCF BACF ECO-HC ARCF-H
Fig. 4. Display of tracking results from eight hand-crafted trackers on twelve UAV video, i.e., S0301, S0602, S0701 of UAVDT [32], boat3, boat4,
wakeboard1, wakeboard5 of UAV123@10fps [33] and ChasingDrones, SnowBoarding4, Surfing03 Surfing06, SpeedCar4 of DTB70 [13].
UAV123@10fps: Scale variation (109) UAV123@10fps: Partial occlusion (73) DTB70: Aspect ratio variation (25) DTB70: In-plane rotation (47)
JSAR [0.634] JSAR [0.606] JSAR [0.569] JSAR [0.617]
ECO-HC [0.587] 0.6 STRCF [0.559] ECO-HC [0.506] STRCF [0.586]
0.6 STRCF [0.580] ECO-HC [0.556] MCCT-H [0.495] ECO-HC [0.568]
ARCF-H [0.570] MCCT-H [0.542] 0.6 STRCF [0.492] 0.6 MCCT-H [0.551]
Precision
Precision
Precision
MCCT-H [0.547] ARCF-H [0.531] ARCF-H [0.431] Precision BACF [0.547]

0.4 Staple-CA [0.534] 0.4 Staple-CA [0.518] fDSST [0.405] ARCF-H [0.547]
BACF [0.525] Staple [0.507] 0.4 Staple-CA [0.400] 0.4 fDSST [0.489]
Staple [0.519] BACF [0.467] BACF [0.392] Staple-CA [0.439]
KCC [0.478] KCC [0.455] KCC [0.389] DCF [0.416]
0.2 fDSST [0.471] 0.2 fDSST [0.451] 0.2 DSST [0.367] 0.2 KCF [0.416]
DSST [0.424] DSST [0.384] KCF [0.355] DSST [0.400]
DCF [0.375] DCF [0.346] DCF [0.355] KCC [0.397]
KCF [0.374] KCF [0.344] Staple [0.286] Staple [0.309]
0 0 0 0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
Location error threshold Location error threshold Location error threshold Location error threshold
UAVDT: Scale variations (29) UAVDT: Small object (23) UAV123@10fps: Viewpoint change (60) UAV123@10fps: Low resolution (48)
0.8 JSAR [0.439] JSAR [0.494] JSAR [0.420] JSAR [0.339]
BACF [0.408] 0.8 BACF [0.428] ECO-HC [0.400] 0.5 ECO-HC [0.299]
ECO-HC [0.389] STRCF [0.421] 0.6 STRCF [0.394] STRCF [0.291]
0.6 STRCF [0.388] ARCF-H [0.402] Staple-CA [0.362] 0.4 ARCF-H [0.269]
Success rate
Success rate
Success rate
Success rate
ARCF-H [0.386] 0.6 KCC [0.390] MCCT-H [0.361] MCCT-H [0.257]

MCCT-H [0.383] MCCT-H [0.389] Staple [0.360] BACF [0.248]
0.4 0.3
0.4 Staple-CA [0.366] fDSST [0.380] ARCF-H [0.359] Staple-CA [0.244]
Staple [0.358] 0.4 Staple-CA [0.379] BACF [0.353] fDSST [0.227]
KCC [0.340] ECO-HC [0.375] KCC [0.326] 0.2 Staple [0.223]
0.2 fDSST [0.337] Staple [0.370] 0.2 fDSST [0.304] KCC [0.214]
DSST [0.296] 0.2 DSST [0.360] DSST [0.231] DSST [0.171]
0.1
KCF [0.254] DCF [0.252] DCF [0.213] KCF [0.147]
DCF [0.249] KCF [0.251] KCF [0.210] DCF [0.146]
0 0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Overlap threshold Overlap threshold Overlap threshold Overlap threshold
Fig. 5. Attribute-oriented comparison with hand-crafted real-time trackers. Precision plots of four attributes, i.e., scale variation, partial occlusion, aspect
ratio variation, in-plane rotation, and success plots of four attribute , i.e., scale variations, small object, viewpoint change and low resolution are presented.
14.6%, 7.7% on UAVDT and UAV123@10fps. We further Averagely, JSAR gains an improvement of 6.9% in AUC and
compare the average performance of best 10 hand-crafted 8.5% in precision compared with the baseline method, i.e.,
trackers on three benchmarks [13], [32], [33], as shown in STRCF.
Table II. It can be seen JSAR obtained the second place 2) Attribute-oriented evaluation: Figure 5 exhibits the
in both AUC and precision, however, JSAR has a tiny precision and success plots of real-time trackers on eight
gap compared with the best tracker ARCF-HC [29] (0.003 challenging attributes from UAV123@10fps [33], UAVDT
and 0.007 in AUC and precision), and it has remarkably [32] and DTB70 [13]. It can be seen that JSAR has re-
improved the speed by 110%. Hence, compared to ARCF- spectively improved the precision by 8%, 12.5% and 5.3%
HC, JSAR performs comparably with much higher efficiency. compared with the second best trackers in the attributes
of scale variation, aspect ratio variation and in-plane ro- #00001 #00001 #00001
tation. As for AUC, JSAR gains 7.6% improvement in
scale variations and 5.0% improvement in viewpoint change.
The remarkable improvements demonstrate the effectiveness person17 person7 group2
of the size filter in scale and aspect ratio change cases. #00215 #00728 #00148
Besides, in partial occlusion, small object and low resolution,
JSAR still outperforms other real-time trackers dramatically,
exhibiting its excellent generality in various aerial scenarios.
D. Hyper parameters analysis #00216 #00741 #00208
We analyze the impacts of five core hyper parameters in
the proposed size filter, including the sampling step γ and φ,
learning rate of size filter θ and the number of scale as well as
aspect ratio (S/A). The impacts on AUC and precision of the JSAR-Re ECO UDT C-COT UDT+ CSR-DCF DeepSTRCF ECO-HC
first three parameters are exhibited in Figure 6, from which it

can be seen they have relatively small influence on tracking
Fig. 7. Qualitative tracking performance of JSAR and seven SOTA trackers
performance (with precision from 0.672 to 0.721, AUC from on person14, person7 and group2 of UAV20L dataset [33].
0.418 to 0.469), which demonstrates the strong robustness of
JSAR. The comparison of tracking performance and speed of Precision plots on UAV20L
JSAR-Re[20fps][0.594]
various S/A configurations is displayed on Table IV. From 7 ECO [16fps*] [0.589]
0.6 DeepSTRCF [5fps*] [0.588]
to 21, the number of scales/aspect ratios has little influence UDT+ [51fps*] [0.585]
Precision
on both AUC (from 0.452 to 0.469) and precision (from CCOT [1fps*] [0.561]
JSAR[32fps][0.528]
0.4
0.704 to 0.721). Yet the results rapidly fall off when the value ECO_HC [73fps] [0.522]
CSR-DCF [11fps] [0.515]
of S/A is 5. This situation can be explained by insufficient 0.2
UDT [30fps*] [0.514]
SRDCF [9fps] [0.507]
samples for size filter training. Staple-CA [48fps] [0.497]
CF2 [14fps*] [0.490]
SRDCFdecon [6fps] [0.443]
0
E. Re-detection evaluation 0 10 20 30 40 50
Location error threshold
To testify the effectiveness of our proposed re-detection
strategy, we conduct experiments on JSAR-Re and JSAR
with eleven SOTA trackers on long-term UAV20L bench- Fig. 8. Precision plots with tracking speed of JSAR-Re, JSAR and eleven
SOTA trackers on UAV20L dataset [33]. * denotes this tracker is tested on
marks, which consists of 20 long-term image sequences with GPU.
TABLE IV
over 2.9K frames per sequence averagely. The precision plot
D EMOGRAPHIC P REDICTION PERFORMANCE COMPARISON BY THREE
is reported in Figure 8. JSAR-Re ranks No.1 and improve
EVALUATION METRICS .
the tracking precision by 11.3% compared with JSAR, with
S/A 5 7 9 11 13 15 17 19 21 a speed of 20fps on a low-cost CPU. Some qualitative results
AUC 0.373 0.397 0.452 0.465 0.469 0.454 0.455 0.452 0.454 are exhibited in Figure 7.
Precison 0.696 0.685 0.715 0.719 0.721 0.704 0.712 0.709 0.713
Speed(fps) 50.2 46.9 43.1 38.5 35.1 31.1 27.7 24.0 21.7 V. CONCLUSIONS
In this work, a novel UAV tracking framework of joint
Analysis of
scale and ARC estimation is proposed. Also, an object
0.73 0.48
proposal based re-detection algorithm is introduced to
Precision
0.72 Precision 0.46

AUC
AUC
0.71
0.7
0.44
0.42
achieve long-term tracking. Experimental comparison with
0.69
1.01 1.015 1.02 1.025 1.03 1.035 1.04 1.045 1.05
0.4
1.055
30 SOTA trackers exhibits the superiority of our method.
Most tellingly, our method can outperform SOTA deep
0.74
Analysis of
0.47
trackers on UAVDT [32] with only hand-crafted features.
Precision
0.72 Precision 0.46 Using C++ implementation can further raise the tracking
AUC
AUC
0.7 0.45
0.68 0.44 speed for real-world as well as real-time UAV applications.
0.66 0.43
1.005 1.01 1.015 1.02 1.025 1.03 1.035 1.04 1.045 1.05
ACKNOWLEDGMENT
Analysis of
0.73 0.47 This work is supported by the National Natural Science
Precision
Precision 0.465
Foundation of China (No. 61806148).
AUC
0.72 AUC
0.46
0.71 0.455
0.7
0.011 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019
0.45
0.02
R EFERENCES
[1] O. Palinko, F. Rea, G. Sandini, and A. Sciutti, “Robot reading human
gaze: Why eye tracking is better than head tracking for human-
Fig. 6. Sensitivity analysis of four parameters (γ, φ, ,θ and S/A) on robot collaboration,” in 2016 IEEE/RSJ International Conference on
UAVDT [32]. It is noted that we fixed the untested parameters in analysis. Intelligent Robots and Systems (IROS), Oct 2016, pp. 5048–5054.
[2] M. Hofer, L. Spannagl, and R. D’Andrea, “Iterative Learning Control [22] K. Dai, D. Wang, H. Lu, C. Sun, and J. Li, “Visual Tracking via Adap-
for Fast and Accurate Position Tracking with an Articulated Soft tive Spatially-Regularized Correlation Filters,” in 2019 IEEE/CVF
Robotic Arm,” in 2019 IEEE/RSJ International Conference on Intel- Conference on Computer Vision and Pattern Recognition (CVPR), June
ligent Robots and Systems (IROS), Nov 2019, pp. 6602–6607. 2019, pp. 4665–4674.
[3] R. Bonatti, C. Ho, W. Wang, S. Choudhury, and S. Scherer, “Towards [23] N. Wang, W. Zhou, Q. Tian, R. Hong, M. Wang, and H. Li, “Multi-cue
a Robust Aerial Cinematography Platform: Localizing and Tracking Correlation Filters for Robust Visual Tracking,” in 2018 IEEE/CVF
Moving Targets in Unstructured Environments,” in 2019 IEEE/RSJ Conference on Computer Vision and Pattern Recognition, June 2018,
International Conference on Intelligent Robots and Systems (IROS), pp. 4844–4853.
Nov 2019, pp. 229–236. [24] M. Mueller, N. Smith, and B. Ghanem, “Context-Aware Correlation
[4] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond cor- Filter Tracking,” in 2017 IEEE Conference on Computer Vision and
relation filters: Learning continuous convolution operators for visual Pattern Recognition (CVPR), July 2017, pp. 1387–1395.
tracking,” in European Conference on Computer Vision. Springer, [25] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the
2016, pp. 472–488. circulant structure of tracking-by-detection with kernels,” in European
[5] F. Li, Y. Yao, P. Li, D. Zhang, W. Zuo, and M. Yang, “Integrating Conference on Computer Vision. Springer, 2012, pp. 702–715.
Boundary and Center Correlation Filters for Visual Tracking with [26] M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate scale
Aspect Ratio Variation,” in 2017 IEEE International Conference on estimation for robust visual tracking,” in British Machine Vision
Computer Vision Workshops (ICCVW), Oct 2017, pp. 2001–2009. Conference, Nottingham, September 1-5, 2014, 2014.
[6] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: Efficient [27] D. Huang, L. Luo, W. Wen, Z. Chen, and C. Zhang, “Enable scale and
Convolution Operators for Tracking,” in 2017 IEEE Conference on aspect ratio adaptability in visual tracking with detection proposals,” in
Computer Vision and Pattern Recognition (CVPR), July 2017, pp. British Machine Vision Conference, Swansea, September 7-10, 2015.
6931–6939. BMVA Press, 2015.
[7] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. [28] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,”
Torr, “Fully-convolutional siamese networks for object tracking,” in IEEE Transactions on Pattern Analysis and Machine Intelligence,
European conference on computer vision. Springer, 2016, pp. 850– vol. 34, no. 7, pp. 1409–1422, July 2012.
865. [29] Z. Huang, C. Fu, Y. Li, F. Lin, and P. Lu, “Learning Aberrance
[8] X. Li, C. Ma, B. Wu, Z. He, and M. Yang, “Target-Aware Deep Repressed Correlation Filters for Real-time UAV Tracking,” in Pro-
Tracking,” in 2019 IEEE/CVF Conference on Computer Vision and ceedings of the IEEE International Conference on Computer Vision,
Pattern Recognition (CVPR), June 2019, pp. 1369–1378. 2019, pp. 2891–2900.
[9] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning [30] Y. Li, C. Fu, Z. Huang, Y. Zhang, and J. Pan, “Keyfilter-Aware Real-
Dynamic Siamese Network for Visual Object Tracking,” in 2017 IEEE Time UAV Object Tracking,” in Proceedings of IEEE International
International Conference on Computer Vision (ICCV), Oct 2017, pp. Conference on Robotics and Automation (ICRA), 2020.
1781–1789. [31] C. Fu, Z. Huang, Y. Li, R. Duan, and P. Lu, “Boundary Effect-
[10] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-Speed Aware Visual Tracking for UAV with Online Enhanced Background
Tracking with Kernelized Correlation Filters,” IEEE Transactions on Learning and Multi-Frame Consensus Verification,” in 2019 IEEE/RSJ
Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, International Conference on Intelligent Robots and Systems (IROS),
March 2015. Nov 2019, pp. 4415–4422.
[11] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. [32] D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang,
Torr, “Staple: Complementary Learners for Real-Time Tracking,” in and Q. Tian, “The unmanned aerial vehicle benchmark: Object de-
2016 IEEE Conference on Computer Vision and Pattern Recognition tection and tracking,” in Proceedings of the European Conference on
(CVPR), June 2016, pp. 1401–1409. Computer Vision (ECCV), 2018, pp. 370–386.
[12] H. K. Galoogahi, A. Fagg, and S. Lucey, “Learning Background- [33] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator
Aware Correlation Filters for Visual Tracking,” in 2017 IEEE Interna- for uav tracking,” in European Conference on Computer Vision.
tional Conference on Computer Vision (ICCV), Oct 2017, pp. 1144– Springer, 2016, pp. 445–461.
1152. [34] A. Lukežic, T. Vojír, L. C. Zajc, J. Matas, and M. Kristan, “Dis-
[13] S. Li and D.-Y. Yeung, “Visual object tracking for unmanned aerial criminative Correlation Filter with Channel and Spatial Reliability,” in
vehicles: A and new motion models,” in Thirty-First AAAI Conference 2017 IEEE Conference on Computer Vision and Pattern Recognition
on Artificial Intelligence, 2017. (CVPR), July 2017, pp. 4847–4856.
[14] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Discriminative [35] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Adaptive
Scale Space Tracking,” IEEE Transactions on Pattern Analysis and Decontamination of the Training Set: A Unified Formulation for Dis-
Machine Intelligence, vol. 39, no. 8, pp. 1561–1575, Aug 2017. criminative Visual Tracking,” in 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2016, pp. 1430–1438.
[15] Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker
[36] D. Held, S. Thrun, and S. Savarese, “Learning to track at 100 fps
with feature integration,” in European Conference on Computer Vision.
with deep regression networks,” in European Conference on Computer
Springer, 2014, pp. 254–265.
Vision. Springer, 2016, pp. 749–765.
[16] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Learning
[37] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H.
Spatially Regularized Correlation Filters for Visual Tracking,” in 2015
Torr, “End-to-end representation learning for Correlation Filter based
IEEE International Conference on Computer Vision (ICCV), Dec 2015,
tracking,” in Proceedings of the IEEE Conference on Computer Vision
pp. 4310–4318.
and Pattern Recognition, 2017, pp. 2805–2813.
[17] F. Li, C. Tian, W. Zuo, L. Zhang, and M. Yang, “Learning Spatial-
[38] S. Yun, J. Choi, Y. Yoo, K. Yun, and J. Y. Choi, “Action-Decision
Temporal Regularized Correlation Filters for Visual Tracking,” in 2018
Networks for Visual Tracking with Deep Reinforcement Learning,” in
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2017 IEEE Conference on Computer Vision and Pattern Recognition
June 2018, pp. 4904–4913.
(CVPR), July 2017, pp. 1349–1358.
[18] H. Fan and H. Ling, “Parallel Tracking and Verifying: A Framework [39] N. Wang, Y. Song, C. Ma, W. Zhou, W. Liu, and H. Li, “Unsupervised
for Real-Time and High Accuracy Visual Tracking,” in 2017 IEEE Deep Tracking,” in 2019 IEEE/CVF Conference on Computer Vision
International Conference on Computer Vision (ICCV), Oct 2017, pp. and Pattern Recognition (CVPR), June 2019, pp. 1308–1317.
5487–5495. [40] C. Wang, L. Zhang, L. Xie, and J. Yuan, “Kernel cross-correlator,” in
[19] C. Ma, X. Yang, Chongyang Zhang, and M. Yang, “Long-term Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
correlation tracking,” in 2015 IEEE Conference on Computer Vision [41] M. Danelljan, F. S. Khan, M. Felsberg, and J. van de Weijer, “Adaptive
and Pattern Recognition (CVPR), June 2015, pp. 5388–5396. Color Attributes for Real-Time Visual Tracking,” 2014 IEEE Confer-
[20] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals ence on Computer Vision and Pattern Recognition, pp. 1090–1097,
from edges,” in European Conference on Computer Vision. Springer, 2014.
2014, pp. 391–405. [42] Y. Wu, J. Lim, and M. Yang, “Online Object Tracking: A Benchmark,”
[21] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual in 2013 IEEE Conference on Computer Vision and Pattern Recogni-
object tracking using adaptive correlation filters,” in 2010 IEEE Com- tion, June 2013, pp. 2411–2418.
puter Society Conference on Computer Vision and Pattern Recognition.
IEEE, Conference Proceedings, pp. 2544–2550.

Towards Robust Real-Time Visual Tracking For UAV With Joint Scale and Aspect Ratio Optimization

Uploaded by

Copyright:

Available Formats

You might also like

Towards Robust Real-Time Visual Tracking For UAV With Joint Scale and Aspect Ratio Optimization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Towards Robust Real-Time Visual Tracking For UAV With Joint Scale and Aspect Ratio Optimization

Uploaded by

Copyright:

Available Formats

Towards Robust Real-Time Visual Tracking for UAV with Joint Scale

and Aspect Ratio Optimization

Target label Size domain sample Feature maps

speed which hardly meet the real-time requirement. In this Translation

#00001 #00075 #00175 #00001 #00064 #00100

#00001 #00091 #00135 #00001 #00176 #00346

#00001 #00152 #00210 #00001 #00104 #00150

#00001 #00126 #00185 #00001 #00105 #00367

#00001 #00031 #00100 #00001 #00123 #00232

JSAR MCCT-H STRCF STAPLE-CA SRDCF BACF ECO-HC ARCF-H

MCCT-H [0.547] ARCF-H [0.531] ARCF-H [0.431] Precision BACF [0.547]

ARCF-H [0.386] 0.6 KCC [0.390] MCCT-H [0.361] MCCT-H [0.257]

first three parameters are exhibited in Figure 6, from which it

0.72 Precision 0.46

You might also like