11 (2020) Yuan Li - Visual Object Tracking With Adaptive Structural Convolutional Network

Knowledge-Based Systems 194 (2020) 105554
Contents lists available at ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
Visual object tracking with adaptive structural convolutional network✩

∗
Di Yuan a ,1 , Xin Li a ,1 , Zhenyu He a,b , , Qiao Liu a , Shuwei Lu a
a
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China
b
Peng Cheng Laboratory, Shenzhen 518055, China
article info a b s t r a c t
Article history: Convolutional Neural Networks (CNN) have been demonstrated to achieve state-of-the-art performance
Received 2 October 2019 in visual object tracking task. However, existing CNN-based trackers usually use holistic target samples
Received in revised form 20 January 2020 to train their networks. Once the target undergoes complicated situations (e.g., occlusion, background
Accepted 22 January 2020
clutter, and deformation), the tracking performance degrades badly. In this paper, we propose an
Available online 24 January 2020
adaptive structural convolutional filter model to enhance the robustness of deep regression trackers
Keywords: (named: ASCT). Specifically, we first design a mask set to generate local filters to capture local
Visual tracking structures of the target. Meanwhile, we adopt an adaptive weighting fusion strategy for these local
Convolution neural network filters to adapt to the changes in the target appearance, which can enhance the robustness of the
Structural filters tracker effectively. Besides, we develop an end-to-end trainable network comprising feature extraction,
Adaptive weighting decision making, and model updating modules for effective training. Extensive experimental results
on large benchmark datasets demonstrate the effectiveness of the proposed ASCT tracker performs
favorably against the state-of-the-art trackers.
© 2020 Elsevier B.V. All rights reserved.
1. Introduction improves the calculation efficiency. However, due to the bound-

ary effects caused by the Fourier transform and the synthetic
Visual object tracking plays a crucial role in the computer vi- samples generated by the cyclic shift, the tracking performance
sion community that finds numerous applications such as of these DCFs-based trackers is far from the actual demand [9–
human–computer interaction, motion analysis, action recogni- 14]. With the emergence of the large-scale dataset, CNN-based
tion, video surveillance, and autonomous driving, to name only trackers have shown their great capacity in tracking task which
a few. Given the initial information in the first frame, the track- greatly improves the tracking performance [15–22]. In general,
ing task aims to locate the target in the subsequent frames CNN-based methods pre-train their networks on a large-scale
accurately. Although many algorithms have been proposed and dataset (e.g., ImageNet [23]) and finetune the networks with the
have made much progress in recent years, tracking remains a information of the first frame in the tracking sequence. This
challenging problem due to the realistic and complicated tracking effectively avoids the boundary effects and also increases the
environment. tracking result efficiency. Meanwhile, due to the efficiency of
Recently, discriminative correlation filters (DCFs) have been DCFs-based trackers, some CNN-based trackers also attempt to
introduced into the target tracking task and shown a pretty embed correlation filters as a layer into their convolutional neural
tracking speed and accuracy [1–8]. Existing DCFs-based methods networks to improve the tracking efficiency [24–26]. In [25], Chen
perform candidate generation through cyclic shifts from a search and Tao proposed a convolutional regression network for visual
image patch. According to the Convolution Theorem, the convo- tracking and solve the regression problem by optimizing a one-
channel-output convolution layer. Song et al. [26] proposed to
lution calculation in the time domain can be transformed into
reformulate DCFs as a one-layer convolutional neural network
element-wise multiplication in the Fourier domain, which greatly
in tracking task. However, the convolutional filters trained with
holistic samples cannot accurately model the local structures of
✩ No author associated with this paper has disclosed any potential or the tracking target, which are crucial for robust and accurate
pertinent conflicts which may be perceived to have impending conflict with tracking performance.
this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.
On the other hand, the importance of each local structure is
2020.105554.
∗ Corresponding author at: School of Computer Science and Technology, changing along with the variations of the target. It is crucial to
Harbin Institute of Technology, Shenzhen, China. adjust the weights of the local parts to focus on the more discrim-
E-mail address: zhenyuhe@hit.edu.cn (Z. He). inative parts constantly. Ma et al. [28] utilized features from three
1 Contribution equally. convolutional layers to exploit both semantic information and
https://doi.org/10.1016/j.knosys.2020.105554
0950-7051/© 2020 Elsevier B.V. All rights reserved.
2 D. Yuan, X. Li, Z. He et al. / Knowledge-Based Systems 194 (2020) 105554
Fig. 1. Comparisons of the proposed tracker (ASCT) and other representative trackers (DeepSRDCF [15], C-COT [16], CREST [26] and SiamFC [27]) on several
challenging sequences. These trackers perform differently as various features, network structures, and scale estimation strategies are used. The proposed ASCT tracker
performs favorably against these tracking sequences.
spatial details for visual tracking. The weights of these correlation The rest of this paper is structured as follows. We first in-
filters are fixed during the whole tracking process. In [26], Song troduce some related works in Section 2. Next, we propose the
et al. developed a spatiotemporal residual network for target adaptive structural convolutional filter network for visual object
tracking, but in their tracking network the base layer, the spatial tracking, including the introduction of the basic tracker, the struc-
residual layer, and the temporal residual has the same weight. tural convolutional filters model, the adaptive weighting strategy
Several other tracking methods [3,29] generate attention maps and the whole tracking process in Section 3. Subsequently, we
to adjust the weights of different areas, however, the attention introduce the implementation details and the evaluation crite-
without exact boundaries may corrode the local structures. rion, evaluate and discuss our approach on some comprehensive
To address the above-mentioned issues, we propose an adap- benchmark datasets in Section 4. Finally, we briefly present the
tive structural convolutional filter network for visual target track- conclusion of our work in Section 5.
ing. The local filters are generated by adding binary masks to
the original filter whose size is the same as the target size. 2. Related works
Specifically, each local filter is performed by setting the mask
value of the interesting areas as 1 and others as 0. Considering In this section, we introduce the tracking methods closely
the variations of the target, we design an adaptive weighting related to our work in the proper context. A comprehensive
mechanism to focus on the more discriminative local filters. For review of the tracking methods is beyond the scope of this paper,
each local filter, we combine the peak sidelobe ratio and the and some survey papers can be found in [30–32].
Laplacian distribution to adaptively determine the corresponding DCFs-based trackers. DCFs-based trackers have been widely
weight. And then, we integrate the local filter layer and the used in recent years. As a pioneering work, Bolme et al. [9]
original filter layer to make the tracker have better robustness proposed a MOSSE filter for visual tracking. Although the MOSSE
and tracking performance. We find that the integrated adaptive tracker with a fast-tracking speed which nearly 670 frames per
structural convolutional filters networks usually predict the ac- second, the tracking performance is not satisfactory.
curate scale and get compact bounding boxes as shown in Fig. 1. Subsequently, several variants of the MOSSE tracker have been
The main contributions of this paper are as follows: proposed to improve the tracking performance [1,4,10–13,33,34].
Henriques et al. [1,10] propose to adopt kernel methods and
• We propose an adaptive structural convolutional filter net- HOG features into correlation filters framework for accurately
work for visual target tracking. The local filter layer can target tracking. In [13], Danelljan et al. used color features to
effectively capture the structural patterns of the target and replace grayscale features for target representation, which can
can easily integrate with the original filter layer, which can improve the tracking performance. For fast and robust tracking,
generate more accurate and robust tracking results. Zhang et al. [33] exploited the dense spatio-temporal context
• We develop an adaptive weighting strategy to improve the information in their STC tracker. In the correlation filters frame-
stability of the target tracking framework, which can be used work, generating training samples by using a cyclic shift will
to determine the corresponding weight by applying the peak bring boundary effects, which can affect tracking performance
sidelobe ratio and the Laplacian distribution on the local to a certain extent. To handle these boundary effects, Galoogahi
filter. et al. [4,11] proposes took benefit of intrinsic computational re-
• We conduct extensive experiments on the OTB-2015, VOT- dundancy in the Fourier domain and gives a BACF tracker, which
2016, UAV-123 and TC-128 datasets. These experimental re- can efficiently modulate the varies of a target in the foreground
sults demonstrate that our ASCT tracker performs favorably and background. Some spatially regularized DCFs [2,15] track-
against the state-of-the-art trackers. ers adopt large space support to learn their correlation filters,
D. Yuan, X. Li, Z. He et al. / Knowledge-Based Systems 194 (2020) 105554 3
Fig. 2. The framework of the proposed algorithm. We generate structural responses with local convolution filters. Then, we fuse the structural responses with the
base response from a holistic filter to generate the final response map. The weights of the local filters are adaptively updated to focus on more discriminative parts.
which can effectively reduce the boundary effect. Although these and formulate this adaptive tracking task as a decision-making
trackers have an acceptable tracking speed, there is still have a process. In [29], Choi et al. presented a context-aware deep
problem by using the holistic target samples to train their model feature compression framework to achieve a real-time tracking
in complex tracking scenarios. To address these drawbacks, we performance. The CNT [57] tracker uses a two-layer feed-forward
propose an adaptive structural convolutional network to improve convolutional network to generates an effective representation of
the tracking performance, which can adaptively weight the local the tracking target which achieves high-speed performance even
target information to strengthen the representational ability of on a CPU operator. Li et al. [50] introduced a DeepTrack method,
the model. which employed a CNN architecture to evaluates the similarity
Part-based trackers. These trackers described above are sus- of the target and these candidate samples. The representation
ceptible to noise interference and reduce its stability. To make ability of features extracted from a single CNN layer can be im-
the tracker more robust, several patch-based correlation filter proved by extracted from multi-layers. Inspired by this discovery,
trackers have been proposed [35–42]. Liu et al. [35] proposed a Ma et al. [28] proposed a hierarchical convolutional features
patch-based tracking method with multi-CFs models. The com- framework for visual object tracking. Gao et al. [58] proposed
bination of multiple parts can effectively deal with the effects of an end-to-end network model to learn reinforced attentional
noise. Li et al. [43] introduced a reliable part-based tracking al- representation for accurate target tracking. Qi et al. [17] proposed
gorithm, which attempts to evaluate and utilize the reliable parts to use an adaptive Hedge method to hedge several CNN trackers
to tracking the object through a whole tracking process. In [44], into a strong one. In [59] Gao et al. designed a lightweight
a deformable part-based correlation filter tracking approach has Siamese network that can capture more global and local con-
been proposed to deal with long-term tracking tasks that depend textual information at multiple scales. In [60], a TFCR tracking
on coupled interactions between a global filter and several part framework have been proposed to balance the disequilibrium of
filters. Deformable parts models show great potential in tracking positive and negative samples. All of these trackers train their
by principally addressing nonrigid object deformations and self networks with the holistic target, and when the target is occluded
occlusions [45], but a potentially large number of degrees of or disturbed by noise, the tracking results of there trackers will be
freedom have to be estimated for object localization and sim- greatly reduced. Different from the aforementioned trackers, our
plifications of the constellation topology are often assumed to ASCT tracker uses an adaptive structural convolutional network
to adaptively enhance target representation, which can improve
make the inference tractable. Lukezic et al. [46] presented a new
the tracking performance effectively.
formulation of the constellation model with correlation filters
that treat the geometric and visual constraints within a single
3. Adaptive structural convolutional networks
convex cost function and derived a highly efficient optimization
for maximum a posteriori inference of a fully connected con-
In this section, we introduce the adaptive structural convo-
stellation. For different tracking targets, Rasmussen et al. [47]
lutional network for visual tracking task. Firstly, we give a brief
proposed a framework by combining and sharing information introduction to the baseline tracker. Then, we present the struc-
among several state estimation processes operating on the same tural convolutional filter network. Furthermore, we describe the
underlying visual targets. These part-based tracking algorithms adaptive weighting strategy to adjust the weight of different local
all improve the robustness of the tracker to varying degrees. filters effectively. Finally, we show the tracking pipeline of the
Nevertheless, the integration strategy of these trackers often uses proposed tracker, which is shown in Fig. 2.
a fixed weight to determine the weight of each patch, which
cannot show the advantages of the part-based tracker well. In this 3.1. Base framework
paper, we develop an adaptive weighting strategy to mitigate this
defect, which can adaptively apply the peak sidelobe ratio and the CREST [26] reformulates DCFs as a one-layer convolutional
Laplacian distribution to determining the corresponding weight neural network and integrates feature extraction, response map
of each local filter. generation, and model update into neural networks for end-
CNN-based trackers. In recent years, deep learning-based to-end training. This method brings end-to-end training to the
tracking methods are gaining popularity. Convolutional neural convolutional neural network and achieves effective training. We
network (CNN) [19,23,48–52] is the most fashionable deep learn- use the base layer of the CREST network as our base framework.
ing model in visual tracking, due to the formidable capability DCFs-based tracking method usually learns a discriminant
on feature extraction and representation. As demonstrated em- classifier to predict the target center by searching for the maxi-
pirically in [53–55], features play the most important role in mum response value [1,4]. Formally, learning the correlation filter
the tracking method. Since hand-crafted features show natural is to solve the following minimization problem:
defects in the tracking process, Huang et al. [56] proposed an
adaptive approach to select cheap features and deep features W = arg min ∥W ∗ X − Y ∥2 + λ∥W ∥2 , (1)
W
Fig. 3. Example of the implementation of the local part convolution layer.
where W denotes the correlation filter, X denotes the input V t denotes the fused response output of the M local filters at tth
samples, Y is the corresponding Gaussian label, and λ denotes the frame.
regularization parameter. M
∑
The CREST tracker reformulates the learning process of DCFs Vt = γmt ∗ Pmt , (4)
as the loss minimization of the convolutional neural network and m=1
a weight decay term with an ℓ 2 regularization.
where γmt denote the parameter of mth local filter at tth frame,
L(W ) = LW (X ) + λ∥W ∥ , 2
(2) Pmt
denotes the response output of mth local filter at tth frame.
2
For each local filter, the size of the convolution kernel is the
where LW (X ) = ∥F (X ) − Y ∥ . F (X ) is the network output and Y is same as the target size. We use binary masks to set the interesting
the ground-truth label. It is the base layer in the CREST network. part as 1 and the other parts as 0 to divide the target into M
Our ASCT tracker is the improvement of the base layer of the local parts by performing M mask operations. Fig. 3 shows an
CREST network without residual learning. More details about the example of dividing the target into M local parts by the binary
CREST tracker can be found in [26]. masks. During the training and updating process, we only update
the weights of the local part with a nonzero mask value. During
3.2. Structural convolutional filters updating the size of the response output of each local filter is
equivalent to the base layer which can bring convenience to the
Although the CREST tracker achieves good performance, its fusion of different response outputs.
convolutional filters trained with holistic samples are less effec-
tive for capturing the discriminative local structures of the target. 3.3. Adaptive weighting strategy
The local structures play a significant role in tracking perfor-
mance, especially during complicated tracking scenes (e.g., back- In the tracking process, there will be inconsonant changes in
ground clutter, occlusion, deformation). In order to enhance the each local part of the target appearance in different scenes, such
robustness and accuracy of the deep regression tracking frame- as occlusion deformation, etc. If the algorithm uses a fixed weight
work, we propose an adaptive structural convolutional filter net- to directly add the response outputs of the M local filters into
work for visual tracking. the integrated output, the reliability of the local filter may be
Intuitively, the target can be divided into several local parts, inconsistent with the corresponding weight, thereby reducing the
and the response outputs of each local filter are calculated sepa- tracking performance. The response value should be suppressed
rately. If the occlusion or interference occurs in some local areas, if the local part is occluded, and vice versa. Therefore, we propose
the target position can still be located accurately through other an adaptive weighting strategy to achieve adaption.
unobstructed or undisturbed parts. Therefore, based on defining For DCFs-based trackers, the peak sidelobe ratio can be used
DCFs as a convolutional layer, we decompose the convolutional to quantify the sharpness of the correlation peak and it can be
filter into M local parts, and then use a convolutional with a calculated as:
part filter set to calculate the corresponding response map for gmax − µs1
psr = , (5)
each local regions. The part filters are combined with the base σs1
convolution layer to improve the robustness of the tracker. where gmax denotes the peak value of the response map, mus1 and
Specifically, the tracker needs to predict the position of the σs1 denote the mean value and standard deviation of the response
target in subsequent frames based on the given target position map respectively.
in the first frame. After extracting the features, the tracker feeds Fig. 4 is an illustration of the peak sidelobe ratio corresponding
the features into the base layer and the local filters layer. These to a moving target in different image frames. As can be seen
two layers calculate the responses correspondingly. Then the final from Fig. 4, target in the upper left sub-figure is in the center
response is computed as: of the bounding box, and the corresponding response map is
W t = φ ∗ Bt + ϕ ∗ V t , (3) single-peak and continuous smooth in the lower left sub-figure.
The peak-side lobe ratio is 7.0 correspondingly. Meanwhile, in
where φ and ϕ denotes the weight factors of the base layer and the upper right sub-figure, the target deviates from the center of
the part-fused layer, W t denotes the final response output at tth the bounding box, and the corresponding response map is multi-
frame, Bt denotes the base layer response output at tth frame, peak and ambiguous. At this time, the peak sidelobe ratio is 5.5.
Fig. 4. An illustration of the peak sidelobe ratio in different image frames of a moving target.
Therefore, the larger the peak sidelobe ratio is, the closer the
target center located by our tracker is to the center of the real
target bounding box.
If the target is not occluded, the response map is usually
unimodal. However, when the target is partially occluded, the
response map will appear multi-peak. At this time, the peak
sidelobe ratio will not effectively enhance the response of the
real target region. Since the target’s motion distance between two
frames follows the Laplacian distribution [61]. We can take this
discovery to compensate for the lack of the peak sidelobe ratio.
The Laplacian distribution is computed as:
1 |x−µ|
f (x, µ, b) = e(− b
)
, (6)
2b
where µ is the position parameter, and b is the scale parameter. Fig. 5. Laplace distribution under different parameters. As can be seen from the
area under the curve, the target center position is mainly changed in a small
As is apparent from Fig. 5, µ determines the center position of the range.
target in the current frame, b determines the extent to which the
target may move, and x determines the distance the target moves
between two frames. Regardless of how the values of µ and b 3.4. Tracking via ASCT
change, the center position of the target always moves within a
small range. The tracking process of our ASCT algorithm includes four
Based on the peak sidelobe ratio and Laplacian distribution, stages: model initialization, online detection, scale estimation,
and model update.
the weight γ of the tth frame can be calculated as:
Model initialization. Given the first frame with the ground-
γmt = psrmt ∗ αmt , truth in tracking tasks, we extract training samples centered on
the target location. We use VGG-16 [62] for feature extraction
1
⎧
⎨ α, abs(dist(max(gmt ) − max(g t −1 ))) ≥ ,
⎪
(7)
and fixed network parameters. Under the zero-mean Gaussian
αmt = 5 distribution, all parameters of the base layer and the local part
⎩ 1, abs(dist(max(g t ) − max(g t −1 ))) < 1 ,
⎪ layers are randomly initialized.
m
5 Online detection. When a new image frame arrives, the search
t area is extracted based on the target center position of the pre-
where psrmdenote the peak sidelobe ratio of mth local filter in
t vious frame. Enter the search area into the network to generate
tth frame, abs(dist(max(gm ) − max(g t −1 ))) denote the distance
a response map. The position in the response map that has the
between the maximum value of the local filter response value of largest response value is the center of the target at the current
current frame and the maximum value of the response value of frame. Specifically for our ASCT network, the search area is sent
previous frame, where the Euclidean distance is used. to the M + 1 convolutional layers, and each response output is
adaptively added for the final response map. The peak point of 4.2. Experiment on OTB-2015
the final response map is the location of the target center.
Scale estimation. Based on the determined target center lo- In this section, to validate the effectiveness of the proposed
cation, search windows of different sizes are extracted and sent tracker, we make comparisons of our ASCT tracker with some
to our ASCT network to obtain the corresponding response maps. state-of-the-art trackers including DeepSRDCF [15], HDT [17],
The width wt and height ht of the target at tth frame is updated CREST_base, CREST [26] ACFN [3], TRACA [29], BACF [4], SRDCF
as (wt , ht ) = β (wt∗ , h∗t ) + (1 −β )(wt −1 , ht −1 ), where wt∗ , h∗t are the [2], MetaCREST [63], SRDCFdecon [64], CNT [57] and SiamFC [27]
width and height of the sample which has the maximum response on OTB-2015 [67] dataset with 100 different video sequences.
value in current frame. The factor β enables the smooth change Fig. 6 shows the precision and the success plots of OPE re-
of the target size. In general, β takes a small value to avoid the sults of our ASCT tracker and the 12 state-of-the-art trackers
target drift. on the OTB-2015 dataset. Our tracker (0.860/0.644) is the best
Model update. Throughout the tracking process, the algorithm in the precision plots and the success plots, which obtained
continuously generates training data for online updates. For each great improvements on the baseline–CREST_base tracker. Com-
frame, after predicting the target position, the search window pared to the convolutional residual learning tracker CREST [26]
samples can be directly entered into the network of our ASCT with the precision score (0.838) and success score (0.623), the
network for an online update as the training set of the algorithm.
proposed ASCT tracker has obtained improvements approximate
2.63% and 3.37%. While, compared to the offline meta-learning-
4. Experiments
based MetaCREST tracker [63] (0.857/0.637), the proposed ASCT
We evaluate the proposed ASCT tracker with other state-of- tracker is better than it both in precision score and success score.
the-art trackers including SRDCF [2], BACF [4], DeepSRDCF [15], This result demonstrates that adaptive structural convolutional
HDT [17], C–COT [16], MDNet_N [19], CREST_base, CREST [26], networks for visual tracking are very effective and promising in
SiamAN, SiamRN, SiamFC [27], ACFN [3], TRACA [29], MetaCREST practice.
[63], SRDCFdecon [64], CNT [57], Staple [65] and EBT [66] on To further verify the effectiveness of the proposed algorithm,
four widely used benchmarks OTB-2015, TC-128, UAV-123 and we analyze and compare the performance of the tracker on vari-
VOT-2016 [67–70]. ous attributes as shown in Table 1. As we can show that the pro-
posed tracker has the top3 performance in almost of all attributes.
4.1. Implementation details and evaluation criterion For the attributes of fast motion (FM), illumination variation
(IV), out-of-plane rotation (OPR), and scale variation (SV), the
proposed ASCT tracker achieved the best performance, which can
Implementation Details: The tracking task gives the target
be attributed to the adaptive structural convolutional networks.
ground-truth in the first image frame. We can get the training
For other attributes such as background clutter (BC), deformation
samples with labels from the given frame in the areas 5 times
(DEF), in-plane rotation (IPR), and occlusion (OCC), our tracker is
of target both width and height. The local part number M is set
very closed to the MetaCREST tracker [63] which benefit from
to 4. The feature extraction network adopts VGG-16 [62] with the
first two pooling layers. The ASCT tracker extracts the feature map incorporating the actual tracking scenarios into meta-learning.
from the conv4-3 layer and reduces the feature map channel from For the remaining attributes low resolution (LR), out of view
512 channels to 64 channels by principal component analysis. The (OOV), motion blur (MB), our ASCT tracker is also very close to the
regression response map is generated using a two-dimensional best tracker. All of those results show that the adaptive structural
Gaussian function with a peak value of 1.0. The scale estimate convolutional networks can improve the tracking performance.
parameter β is set to 0.6 and the fusion local tracker parameter
α is set to 0.75. Except for the discussion part, the weight factors 4.3. Experiment on TC-128
φ and ϕ are set to 1, which means the base layer and the part-
fused layer has the same importance in target representation. We use TC-128 [68] dataset to evaluate the tracking per-
The default setting is set for the base layer of the CREST [26] formance of the proposed ASCT tracker with 10 state-of-the-
network, we do not use the residual learning. Our experiments art tracking methods using the source tracking results includ-
are performed on a PC with an i7 4.2 GHz CPU, 32 GB RAM and ing ECO [24], BACF [4], SRDCF [2], DeepSRDCF [15], Staple [65],
an Nvidia GTX Titan X GPU with MatConvNet toolbox. MEEM [71], SiamFC [27], MCPF [72], SRDCFdecon [64], CNT [57]
and HDT [17].
Evaluation criterion: We evaluate the proposed ASCT tracker on
The comparison results are shown in Table 2. Among the
OTB-2015, TC-128, UAV-123 and VOT-2016 benchmarks. For the
compared trackers, the BACF, SRDCF, DeepSRDCF, Staple, MEEM,
OTB-2015, UAV-123 and TC-128 benchmarks, we use one-pass
SiamFC, MCPF, SRDCFdecon, HDT and CNT tracking methods
evaluation (OPE) as the evaluation indexes. OPE has two parts:
precision plots and success plots. The precision plots show the achieve the precision and AUC scores of (66.0%, 49.6%), (69.6%,
accurate percentage of predicted position and ground-truth under 51.6%), (74.0%, 54.1%), (66.8%, 50.9%), (70.8%, 50.0%),
different thresholds. While, the success plots measured by an (69.4%, 50.5%), (76.9%, 55.2%), (72.9%, 54.3%), (68.6%, 48.0%) and
average overlap, which accounts for both size and position. For (44.9%, 33.5%) respectively. In contrast, our ASCT tracker performs
the VOT-2016 benchmark, we use the expected average overlap well in both metrics (77.1%, 55.8%). The ASCT method obtains
(EAO), robustness and accuracy as the evaluation indexes. Ac- a performance gain of 11.1% and 6.2% on the precision and
curacy is an indicator of the performance of a direct response AUC scores against the BACF method and outperforms the CNN
algorithm. EAO is an estimator used to estimate the average based correlation filter trackers (DeepSRDCF, SRDCFdecon, etc.)
overlap that a tracker expects to achieve over a large number by a significant margin. The ASCT method performs almost like
of short-term sequences with the same visual attributes as a the best ECO tracker and significantly outperforms other track-
given dataset. Robustness is an algorithmic stability indicator ers (Staple, MEEM and SiamFC). Compared to the HDT tracker
that becomes more unstable as the value increases. The EAO with hierarchical CNN features representation, our ASCT tracker
combines the raw values of per-frame accuracies and failures in achieves a performance gain of 8.5% and 7.8% in terms of precision
a principled manner and has a clear practical interpretation. The and AUC scores. Overall, the proposed ASCT tracker performs
EAO measures the expected no-reset overlap of a tracker run on favorably against the state-of-the-art trackers in terms of both
a short-term sequence. in precision and AUC metrics.
Fig. 6. The precision plots and success plots of OPE on OTB-2015 over 100 standard benchmark video sequences. The legend contains the average distance precision
score at 20 pixels and the area-under-curve (AUC) score for each tracker. To mark it clear, we just plot the top ten trackers.
Table 1
AUC scores of the BACF, ACFN, SRDCFdecon, SRDCF, DeepSRDCF, HDT, CNT, SiamFC, TRACA, MetaCREST, CREST_base, CREST and the proposed ASCT on OTB-2015 for
different attributes. The first, second and third best scores are highlighted in red, blue and green colors, respectively.
Trackers FM BC MB DEF IV IPR LR OCC OPR OV SV
ASCT(Ours) 64.9 63.9 64.5 61.8 66.9 61.9 59.8 60.0 63.6 56.3 62.0
CREST_base [26] 52.8 57.4 52.7 51.7 60.9 56.9 52.4 54.3 57.8 53.5 52.7
CREST [26] 62.7 61.8 65.5 56.9 64.4 61.7 47.3 59.2 61.5 56.6 57.2
TRACA [29] 58.1 60.1 59.8 56.1 62.2 58.5 50.2 57.6 59.3 56.6 55.8
MetaCREST [63] 62.7 67.4 65.4 62.2 63.5 63.5 47.2 61.2 62.7 56.0 58.2
BACF [4] 59.8 64.1 58.7 59.9 63.1 58.2 51.7 57.4 58.4 51.6 57.0
ACFN [3] 56.3 53.8 56.4 53.5 56.7 54.4 51.5 53.9 54.3 50.0 54.9
SRDCFdecon [64] 60.6 64.1 63.9 55.3 64.6 57.3 51.7 58.9 59.1 51.0 60.7
SRDCF [2] 59.7 58.3 59.4 54.4 61.3 54.4 51.4 55.9 55.0 46.0 56.1
DeepSRDCF [15] 62.8 62.7 64.2 56.6 64.6 58.9 56.1 60.1 60.7 55.3 60.5
SiamFC [27] 56.8 52.3 55.0 50.6 56.8 55.7 61.8 54.3 58.8 50.6 55.2
HDT [17] 56.8 57.8 57.5 54.3 53.5 55.5 40.1 52.8 53.3 47.2 48.6
CNT [57] 30.6 49.0 32.6 39.8 46.2 41.3 40.6 43.4 43.6 47.5 41.0
Table 2
Precision and AUC scores of the ECO, BACF, SRDCF, DeepSRDCF, Staple, MEEM, SiamFC, MCPF, SRDCFdecon, HDT, CNT and the proposed ASCT on TC-128 dataset. The
first, second and third best scores are highlighted in red, blue and green colors, respectively.
Trackers ASCT ECO BACF SRDCF DeepSRDCF Staple MEEM SiamFC MCPF SRDCFdecon HDT CNT
Ours [24] [4] [2] [15] [65] [71] [27] [72] [64] [17] [57]
Precision scores 77.1 80.0 66.0 69.6 74.0 66.8 70.8 69.4 76.9 72.9 68.6 44.9
AUC scores 55.8 60.5 49.6 51.6 54.1 50.9 50.0 50.5 55.2 54.3 48.0 33.5
Table 3
Precision and AUC scores of the ECO, CFNet, SRDCF, MUSTer, SAMF, MEEM, SiamFC, DSST, KCF, ASLA, BACF, CNT and the proposed ASCT on UAV-123 dataset. The
first, second and third best scores are highlighted in red, blue and green colors, respectively.
Trackers ASCT ECO CFNet SRDCF MUSTer SAMF MEEM SiamFC DSST KCF ASLA BACF CNT
Ours [24] [6] [2] [73] [74] [71] [27] [12] [1] [75] [4] [57]
Precision scores 71.6 74.1 65.1 67.6 59.1 59.2 62.7 72.6 58.6 52.3 57.1 65.4 52.4
AUC scores 50.6 52.5 43.6 46.4 39.1 39.6 39.2 49.8 35.6 33.1 40.7 45.7 36.9
4.4. Experiment on UAV-123 gain significant improvement on both precision scores and AUC
scores. Compared to these deep learning-based trackers (e.g.,
We evaluate our ASCT tracker on UAV-123 [69] dataset, and CFNet, CNT), our ASCT tracker also obtains performance gain
compare it with 11 state-of-the-art tracking methods using the some improvement on both precision scores and AUC scores. In
source tracking codes or results including ECO [24], CFNet [6], general, out ASCT tracker performs favorably against the state-of-
SRDCF [2], MUSTer [73], SAMF [74], MEEM [71], SiamFC [27], the-art trackers both in precision scores and AUC scores.
DSST [12], KCF [1], BACF [4], CNT [57] and ASLA [75]. In Fig. 7, we further analyze the tracking performance under
The comparison results are shown in Table 3. Among the different challenge attributes (e.g. full occlusion, background clut-
compared trackers, our ASCT tracker performs second or third ter and scale variation) annotated in the UAV-123 [69] dataset.
best in both metrics (71.6%, 50.6%) and it is very close to the We show the AUC score under six challenge attributes. The re-
best tracker (74.1%, 52.5%). Compared to DCFs-based trackers (e.g., sults indicate that our ASCT tracker is more effective than the
BACF, DSST), the proposed ASCT method obtains performance ECO tracker and the SiamFC tracker in handling out-of-view and
Fig. 7. The success plots over six tracking challenges, including out-of-view, full occlusion, background clutter, similar object, illumination variation, and scale variation
on UAV-123 dataset. To mark it clear, we just plot the top ten trackers.
Table 4 overlap (EAO) score (0.33) and the best robustness value (0.89).
Performance comparison for 10 state-of-the-art algorithms on the VOT-2016 Our ASCT and SiamRN achieved the best accuracy value (0.55).
dataset. The evaluation metrics include expected average overlap (EAO), accuracy
and robustness value. The first, second and third best scores are highlighted in
Meanwhile, the performance of our ASCT tracker is similar to
red, blue and green colors, respectively. those of Staple and EBT on EAO score and robustness value.
Trackers EAO Accuracy Robustness Besides, these trackers perform better than CREST, DeepSRDCF,
ASCT(Ours) 0.29 0.55 1.16 SRDCF, SiamRN, SiamAN, and MDNet_N trackers. According to
Staple [65] 0.30 0.54 1.42 the analysis of VOT-2016 benchmark report and the definition of
EBT [66] 0.29 0.46 1.05 the state-of-the-art bound, our ASCT tracker performs favorably
C-COT [16] 0.33 0.53 0.89 against the state-of-the-art tracking methods.
DeepSRDCF [15] 0.28 0.52 1.23
SRDCF [2] 0.25 0.53 1.50
MDNet_N [19] 0.26 0.54 0.91 4.6. Qualitative comparison
SiamAN [27] 0.24 0.53 1.36
SiamRN [27] 0.28 0.55 1.37 Our proposed tracker (ASCT) significantly improves the track-
CREST [26] 0.28 0.51 1.08
ing performance compared to other representative trackers in-
clude BACF [4], DeepSRDCF [15], CREST [26], SiamFC [27], TRACA
[29] and MetaCREST [63] in the visual object tracking task.
full occlusion. This is because the adaptive weighted integrated Fig. 8 shows a qualitative comparison of these seven trackers
layer can capture changes in appearance, effectively updating on some challenging tracking sequences. The tracking results
the model. Our ASCT performed slightly worse than ECO and of BACF [4] tracker display that in the scene of illumination
SiamFC when similar targets appeared in the tracking scene. This variation, fast motion, and background cluster, it is easily inter-
is mainly because our tracker will focus on some local infor- fered and reducing the tracking performance. This may because
mation. When the target undergoes illumination variation and it adopts the HOG feature, which cannot model the target ap-
scales variation, the ECO tracker’s powerful representation capa- pearance very well in complex tracking scenes. In comparison,
bilities make its tracking performance exceed our ASCT tracker DeepSRDCF [15] tracker investigates the impact of CNN-based
and SiamFC tracker. features on the DCFs-based framework. It performs well on oc-
clusion and background cluster (e.g., freeman1). However, direct
4.5. Experiment on VOT-2016 fusion will limit the processing potential of the model on the
illumination variation (e.g., matrix, skating1) and fast motion (e.g.,
For more thorough evaluations, we use VOT-2016 [70] bench- motorRolling, skiing). Because of the use of attentional correla-
mark to validate the performance of the proposed ASCT tracker tion filter network, TRACA [29] tracker achieved good tracking
against 9 state-of-the-art trackers including DeepSRDCF [15], results in illumination variation (e.g., matrix) and fast motion
C–COT [16], MDNet_N [19], CREST [26], SRDCF [2], SiamAN, (e.g., skiing). In other sequences, its tracking performance still
SiamRN [27], Staple [65] and EBT [66]. The tracking performance has much space for improvement. Form these tracking results,
is measured both in terms of expected average overlap, accuracy we can see that when tracking target are affected by complex
and robustness as in [70]. scenes such as fast motion, background cluster, motion blur, and
The comparison results are shown in Table 4. Among the ten etc, these compared trackers may quickly lose the object target
compared trackers, C–COT [16] obtains the best expected average (e.g., motorRolling, matrix, skating1), but our ASCT tracker can
Fig. 8. Qualitative comparison of our ASCT tracker and other representative trackers (BACF [4], DeepSRDCF [15], CREST [26], SiamFC [27], TRACA [29] and
MetaCREST [63]) on some visual object tracking sequences (bolt2, matrix, car4, motorRolling, carScale, ironman, freeman1, freeman4, human3, human6, human8,
human9, skating1, skating2-2, skiing and walking) with fast motion, scale variation, illumination variation, background clutter, deformation and other challenges.
Table 5 representation. Therefore, it is extremely important to select the

Precision scores and AUC scores of our ASCT tracker under different weight weight factors of φ and ϕ . Due to the weight factors mainly
factor φ & ϕ on OTB-2015 dataset. The first, second and third best scores are
highlighted in red, blue and green colors, respectively.
reflects the importance of different layers, we assume that the
Weight factors OTB-2015 dataset
sum of φ and ϕ is 1. We successively choose φ as [1.0, 0.9, 0.8,
0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0], and ϕ as [0.0, 0.1, 0.2, 0.3, 0.4,
φ ϕ Precision scores AUC scores
0.5, 0.6, 0.7, 0.8, 0.9, 1.0] correspondingly.
1.0 0.0 83.4 62.1
Table 5 shows the comparison results on OTB-2015 [67]
0.9 0.1 82.4 61.9
0.8 0.2 82.3 61.5 dataset. When the factor φ is 1.0, which means we just using
0.7 0.3 83.9 62.7 the base layer to tracking the moving target. Otherwise, when
0.6 0.4 82.4 61.5 the factor ϕ is 1.0, which means we just using the local layer
0.5 0.5 86.0 64.4 to tracking the moving target. From Table 5 we can see that
0.4 0.6 83.1 62.2
a common tracking performance can be obtained by using the
0.3 0.7 82.3 61.7
0.2 0.8 82.7 61.5 base layer independently, while the tracking performance by only
0.1 0.9 82.4 60.5 using the local layer is relatively poor. The fusion of these two
0.0 1.0 76.4 55.9 parts with different weights will also bring different tracking
performance. When the weights of these two parts are the same,
a relatively ideal tracking performance can be obtained. This also
shows that although the target representation of local layers can
still be able to locate it faultlessly. Unlike the contrast trackers,
work for the tracker, it still needs a reasonable weight allocation.
benefit from the adaptive structural convolutional network, our
Meanwhile, our ASCT tracker performs well against the base layer
ASCT tracker can handle fast motion, scale variation, illumination
tracker (by 2.4%/2.3%) and the local layer tracker (by 9.6%/8.5%)
variation, deformation, and other challenges very well.
schemes in terms of average precision and success scores. Overall,
these results show that the adaptive structural convolutional
4.7. Discussion network helps improve CNN based tracking methods to obtain
a competitive tracking performance.
In this section, we mainly discuss the weight factors of the Meanwhile, we explored the impact of the adaptive weight-
base layer and the local layer that are essential to our tracking ing strategy on the proposed ASCT tracker. Table 6 shows the
performance. Since our tracker combines the advantages of both performance comparison of the ASCT tracker with/without the
the base layer target representation and the local layer target adaptive weighting strategy on the OTB-2015 dataset [67]. We
Table 6
Precision scores and AUC scores of our ASCT tracker with the adaptive weighting strategy (ASCT_waws) and without the adaptive weighting strategy (ASCT_woaws)
on the OTB-2015 dataset.
Indexes Trackers FM BC MB DEF IV IPR LR OCC OPR OV SV Overall
ASCT_waws 64.9 63.9 64.5 61.8 66.9 61.9 59.8 60.0 63.6 56.3 62.0 64.4
AUC scores
ASCT_woaws 63.1 59.5 63.1 57.1 64.5 59.9 59.4 60.5 61.3 58.1 60.9 62.5
ASCT_waws 82.5 86.8 79.6 85.6 87.8 86.2 94.6 79.9 87.2 69.6 83.2 86.0
Prec. scores
ASCT_woaws 81.3 79.4 77.4 78.5 84.8 82.6 94.2 81.2 83.7 74.2 81.8 83.7
can see that the tracker with an adaptive weighting strategy [10] J.F. Henriques, C. Rui, P. Martins, J. Batista, Exploiting the circulant
(ASCT_waws) is significantly better than the tracker without an structure of tracking-by-detection with kernels, in: European Conference
on Computer Vision, 2012, pp. 702–715.
adaptive weighting strategy (ASCT_woaws), which fully demon-
[11] H.K. Galoogahi, T. Sim, S. Lucey, Correlation filters with limited boundaries,
strates the effectiveness of our adaptive weighting. in: Computer Vision and Pattern Recognition, 2015, pp. 4630–4638.
[12] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Discriminative scale space
5. Conclusions tracking, IEEE Trans. Pattern Anal. Mach. Intell. 39 (8) (2017) 1561–1575.
[13] M. Danelljan, F.S. Khan, M. Felsberg, J.V.D. Weijer, Adaptive color attributes
In this paper, we propose an adaptive structural convolutional for real-time visual tracking, in: Computer Vision and Pattern Recognition,
2014, pp. 1090–1097.
filter network for visual tracking. We generate a set of local filters [14] N. Fan, J. Li, Z. He, C. Zhang, X. Li, Region-filtering correlation tracking,
to capture the local structures of the target. The response outputs Knowl.-Based Syst. 172 (2019) 95–103.
of these filters are combined with the response output of the base [15] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Convolutional features for
layer to enhance the final response output of our ASCT tracker. correlation filter based visual tracking, in: International Conference on
Computer Vision Workshops, 2015, pp. 621–629.
Besides, we propose an adaptive weighting strategy to adjust
[16] M. Danelljan, A. Robinson, F.S. Khan, M. Felsberg, Beyond correlation filters:
the weight of different local filters based on the peak sidelobe Learning continuous convolution operators for visual tracking, in: European
ratio and Laplacian distribution. The adaptive response output Conference on Computer Vision, 2016, pp. 472–488.
is weighted for each local filter to improve the stability of the [17] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, M.H. Yang, Hedged
proposed algorithm. Extensive experimental results on the OTB- deep tracking, in: Computer Vision and Pattern Recognition, 2016, pp.
4303–4311.
2015, VOT-2016, UAV-123 and TC-128 benchmarks validate the
[18] X. Li, Q. Liu, N. Fan, Z. He, H. Wang, Hierarchical spatial-aware siamese
effectiveness and stability of the proposed ASCT tracker. network for thermal infrared object tracking, Knowl.-Based Syst. 166
(2019) 71–81.
Acknowledgments [19] H. Nam, B. Han, Learning multi-domain convolutional neural networks for
visual tracking, in: Computer Vision and Pattern Recognition, 2016, pp.
4293–4302.
This study was supported by the National Natural
[20] X. Li, C. Ma, B. Wu, Z. He, M.-H. Yang, Target-aware deep tracking, in:
Science Foundation of China (Grant No. 61672183), by the Natural Computer Vision and Pattern Recognition, 2019, pp. 1369–1378.
Science Foundation of Guangdong Province (Grant No. [21] K. Zhang, Q. Liu, W. Yi, M.H. Yang, Robust visual tracking via convolutional
2015A030313544), by the Shenzhen Research Council (Grant networks without training, IEEE Trans. Image Process. 25 (4) (2015)
No. JCYJ2017041310455226946, JCYJ20170815113552036, 1779–1792.
[22] B. Liu, Q. Liu, Z. Zhu, T. Zhang, Y. Yang, Msst-resnet: Deep multi-scale
JCYJ20160226201453085), partially by the projects ‘‘PCL Future
spatiotemporal features for robust visual object tracking, Knowl.-Based
Greater-Bay Area Network Facilities for Large-scale Experiments Syst. 164 (2019) 235–252.
and Applications (PCL2018KP001)’’ and ‘‘The Verficiation Plat- [23] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
form of Multi-tier Coverage Communication Network for Oceans convolutional neural networks, in: International Conference on Neural
(PCL2018KP002)’’, and by Shenzhen Medical Biometrics Percep- Information Processing Systems, 2012, pp. 1097–1105.
[24] M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, Eco: Efficient convolution
tion and Analysis Engineering Laboratory. Di Yuan is supported
operators for tracking, in: Computer Vision and Pattern Recognition, 2017,
by a scholarship from China Scholarship Council (CSC). pp. 6638–6646.
[25] K. Chen, W. Tao, Convolutional regression for visual tracking, IEEE Trans.
References Image Process. 27 (7) (2018) 3611–3620.
[26] Y. Song, C. Ma, L. Gong, J. Zhang, R.W. Lau, M.-H. Yang, Crest: Convolutional
[1] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with residual learning for visual tracking, in: International Conference on
kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) Computer Vision, 2017, pp. 2574–2583.
(2014) 583–596. [27] L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H.S. Torr,
[2] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Learning spatially regular- Fully-convolutional siamese networks for object tracking, in: European
ized correlation filters for visual tracking, in: International Conference on Conference on Computer Vision Workshop, 2016, pp. 850–865.
Computer Vision, 2015, pp. 4310–4318. [28] C. Ma, J.B. Huang, X. Yang, M.H. Yang, Hierarchical convolutional features
[3] J. Choi, H.J. Chang, S. Yun, T. Fischer, Y. Demiris, Y.C. Jin, Attentional for visual tracking, in: International Conference on Computer Vision, 2015,
correlation filter network for adaptive visual tracking, in: Computer Vision pp. 3074–3082.
and Pattern Recognition, 2017, pp. 4828–4837. [29] J. Choi, H.J. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris, Y.C. Jin,
[4] H.K. Galoogahi, A. Fagg, S. Lucey, Learning background-aware correlation Context-aware deep feature compression for high-speed visual tracking,
filters for visual tracking, in: International Conference on Computer Vision, in: Computer Vision and Pattern Recognition, 2018, pp. 479–488.
2017, pp. 1135–1143. [30] Q. Liu, Z. He, X. Li, Y. Zheng, Ptb-tir: A thermal infrared pedestrian tracking
[5] X. Li, Q. Liu, Z. He, H. Wang, C. Zhang, W.S. Chen, A multi-view model for benchmark, IEEE Trans. Multimed. (2019) http://dx.doi.org/10.1109/TMM.
visual tracking via correlation filters, Knowl.-Based Syst. 113 (2016) 88–99. 2019.2932615.
[6] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, P.H.S. Torr, End-to-end [31] P. Li, D. Wang, L. Wang, H. Lu, Deep visual tracking: Review and
representation learning for correlation filter based tracking, in: Computer experimental comparison, Pattern Recognit. 76 (2018) 323–338.
Vision and Pattern Recognition, 2017, pp. 2085–2813. [32] H. Lu, P. Li, D. Wang, Visual object tracking: A survey, Pattern Recognit.
[7] K. Zhang, X. Li, H. Song, Q. Liu, L. Wei, Visual tracking using spatio- Artif. Intell. 31 (1) (2018) 61–76.
temporally nonlocally regularized correlation filter, Pattern Recognit. 83 [33] K. Zhang, L. Zhang, Q. Liu, D. Zhang, M.-H. Yang, Fast visual tracking
(2018) 185–195. via dense spatio-temporal context learning, in: European Conference on
[8] D. Yuan, X. Lu, D. Li, Y. Liang, X. Zhang, Particle filter re-detection for Computer Vision, 2014, pp. 127–141.
visual tracking via correlation filters, Multimedia Tools Appl. 78 (11) (2019) [34] K. Zhang, J. Fan, Q. Liu, J. Yang, W. Lian, Parallel attentive correlation
14277–14301. tracking, IEEE Trans. Image Process. 28 (1) (2019) 479–491.
[9] D.S. Bolme, J.R. Beveridge, B.A. Draper, Y.M. Lui, Visual object tracking using [35] T. Liu, G. Wang, Q. Yang, Real-time part-based visual tracking via adaptive
adaptive correlation filters, in: Computer Vision and Pattern Recognition, correlation filters, in: Computer Vision and Pattern Recognition, 2015, pp.
2010, pp. 2544–2550. 4902–4912.
[36] Q. Peng, Y. Cheung, X. You, Y. Tang, A hybrid of local and global saliencies [57] K. Zhang, Q. Liu, Y. Wu, M.H. Yang, Robust visual tracking via convolutional
for detecting image salient region and appearance, IEEE Trans. Syst. Man networks without training, IEEE Trans. Image Process. 25 (4) (2016)
Cybern.: Syst. 47 (1) (2017) 86–97. 1779–1792.
[37] S. Liu, T. Zhang, X. Cao, C. Xu, Structural correlation filter for robust [58] P. Gao, Q. Zhang, F. Wang, L. Xiao, H. Fujita, Y. Zhang, Learning reinforced
visual tracking, in: Computer Vision and Pattern Recognition, 2016, pp. attentional representation for end-to-end visual tracking, Inform. Sci. 517
4312–4320. (2020) 52–67.
[38] D. Yuan, X. Zhang, J. Liu, D. Li, A multiple feature fused model for visual [59] P. Gao, R. Yuan, F. Wang, L. Xiao, H. Fujita, Y. Zhang, Siamese attentional
object tracking via correlation filters, Multimedia Tools Appl. 78 (19) keypoint network for high performance visual tracking, Knowl.-Based Syst.
(2019) 27271–27290. (2019) http://dx.doi.org/10.1016/j.knosys.2019.105448.
[39] T. Liu, X. Cao, J. Jiang, Visual object tracking with partition loss schemes, [60] D. Yuan, N. Fan, Z. He, Learning target-focusing convolutional regression
IEEE Trans. Intell. Transp. Syst. 18 (3) (2017) 633–642. model for visual object tracking, Knowl.-Based Syst. (2020) http://dx.doi.
[40] Z. He, S. Yi, Y.M. Cheung, X. You, Y.Y. Tang, Robust object tracking via key org/10.1016/j.knosys.2020.105526.
patch sparse representation, IEEE Trans. Cybern. 47 (2) (2017) 354–364. [61] D. Held, S. Thrun, S. Savarese, Learning to track at 100 fps with deep
[41] S. Xin, N.M. Cheung, H. Yao, Y. Guo, Non-rigid object tracking via regression networks, in: European Conference on Computer Vision, 2016,
deformable patches using shape-preserved kcf and level sets, in: IEEE pp. 749–765.
International Conference on Computer Vision, 2017, pp. 5496–5504. [62] K. Simonyan, A. Zisserman, Very deep convolutional networks for
[42] S. Zhang, H. Zhou, J. Feng, X. Li, Robust visual tracking using structurally large-scale image recognition, in: International Conference on Learning
random projection and weighted least squares, IEEE Trans. Circuits Syst. Representations, 2015, pp. 1–14.
Video Technol. 25 (11) (2015) 1749–1760. [63] E. Park, A.C. Berg, Meta-tracker: Fast and robust online adaptation for
[43] Y. Li, J. Zhu, S.C.H. Hoi, Reliable patch trackers: Robust visual tracking by visual object trackers, in: European Conference on Computer Vision, 2018,
exploiting reliable patches, in: Computer Vision and Pattern Recognition, pp. 1–17.
2015, pp. 353–361. [64] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Adaptive decontamination
[44] O. Akin, E. Erdem, A. Erdem, K. Mikolajczyk, Deformable part-based of the training set: A unified formulation for discriminative visual tracking,
tracking by coupled global and local correlation filters, J. Vis. Commun. in: Computer Vision and Pattern Recognition, 2016, pp. 1430–1438.
Image Represent. 38 (2016) 763–774. [65] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, P. Torr, Staple: Comple-
[45] R. Yao, Q. Shi, C. Shen, Y. Zhang, A.V.D. Hengel, Part-based visual tracking mentary learners for real-time tracking, in: Computer Vision and Pattern
with online latent structural learning, in: Computer Vision and Pattern Recognition, 2016, pp. 1401–1409.
Recognition, 2013, pp. 2363–2370. [66] G. Zhu, F. Porikli, H. Li, Beyond local search: Tracking objects every-
[46] A. Lukezic, L. Cehovin, M. Kristan, Deformable parts correlation filters for where with instance-specific proposals, in: Computer Vision and Pattern
robust visual tracking, IEEE Trans. Cybern. 48 (6) (2018) 1849–1861. Recognition, 2016, pp. 943–951.
[47] C. Rasmussen, G.D. Hager, Joint probabilistic techniques for tracking multi- [67] Y. Wu, J. Lim, M.H. Yang, Object tracking benchmark, IEEE Trans. Pattern
part objects, in: Computer Vision and Pattern Recognition, 1998, pp. Anal. Mach. Intell. 37 (9) (2015) 1834–1848.
16–21. [68] P. Liang, E. Blasch, H. Ling, Encoding color information for visual tracking:
[48] C. Tian, Y. Xu, Z. Li, W. Zuo, L. Fei, H. Liu, Attention-guided cnn for image Algorithms and benchmark., IEEE Trans. Image Process. 24 (12) (2015)
denoising, Neural Netw. (2020) http://dx.doi.org/10.1016/j.neunet.2019.12. 5630–5644.
024. [69] M. Mueller, N. Smith, B. Ghanem, A benchmark and simulator for uav
[49] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, tracking, in: European Conference on Computer Vision, 2016, pp. 445–461.
in: Computer Vision and Pattern Recognition, 2016, pp. 770–778. [70] M. Kristan, A. Leonardis, J. Matas, et al., The visual object tracking vot2016
[50] H. Li, Y. Li, F. Porikli, Deeptrack: Learning discriminative feature represen- challenge results, in: European Conference on Computer Vision, 2016, pp.
tations by convolutional neural networks for visual tracking, in: British 191–217.
Machine Vision Conference, 2014, pp. 1–12. [71] J. Zhang, S. Ma, S. Sclaroff, Meem: Robust tracking via multiple experts
[51] Q. Liu, X. Lu, Z. He, C. Zhang, W.S. Chen, Deep convolutional neural using entropy minimization, in: European Conference on Computer Vision,
networks for thermal infrared object tracking, Knowl.-Based Syst. (2017) 2014, pp. 188–203.
189–198. [72] T. Zhang, C. Xu, M.H. Yang, Multi-task correlation particle filter for robust
[52] C. Tian, Y. Xu, W. Zuo, Image denoising using deep cnn with batch object tracking, in: Computer Vision and Pattern Recognition, 2017, pp.
renormalization, Neural Netw. 121 (2019) 461–473. 4819–4827.
[53] N. Wang, J. Shi, D.Y. Yeung, J. Jia, Understanding and diagnosing visual [73] Z. Hong, C. Zhe, C. Wang, M. Xue, D. Prokhorov, D. Tao, Multi-store tracker
tracking systems, in: International Conference on Computer Vision, 2015, (muster): a cognitive psychology inspired approach to object tracking, in:
pp. 3101–3109. Computer Vision and Pattern Recognition, 2015, pp. 749–758.
[54] K. Zhang, Q. Liu, Y. Jian, M.H. Yang, Visual tracking via boolean map [74] Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature
representations, Pattern Recognit. 81 (2018) 147–160. integration, in: European Conference on Computer Vision Workshops,
[55] S. Zhang, X. Lan, H. Yao, H. Zhou, D. Tao, X. Li, A biologically inspired 2014, pp. 254–265.
appearance model for robust visual tracking, IEEE Trans. Neural Netw. [75] X. Jia, H. Lu, M.H. Yang, Visual tracking via adaptive structural local sparse
Learn. Syst. 25 (11) (2015) 1749–1760. appearance model, in: Computer Vision and Pattern Recognition, 2012, pp.
[56] C. Huang, S. Lucey, D. Ramanan, Learning policies for adaptive tracking 1822–1829.
with deep feature cascades, in: International Conference on Computer
Vision, 2017, pp. 105–114.

11 (2020) Yuan Li - Visual Object Tracking With Adaptive Structural Convolutional Network

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11 (2020) Yuan Li - Visual Object Tracking With Adaptive Structural Convolutional Network

Uploaded by

Copyright:

Available Formats

Knowledge-Based Systems 194 (2020) 105554

Contents lists available at ScienceDirect

Visual object tracking with adaptive structural convolutional network✩

1. Introduction improves the calculation efficiency. However, due to the bound-

Fig. 3. Example of the implementation of the local part convolution layer.

Table 5 representation. Therefore, it is extremely important to select the

You might also like