11 (2020) Yuan Fan - Learning Target-Focusing Convolutional Regression Model For Visual Object Tracking

Knowledge-Based Systems xxx (xxxx) xxx
Contents lists available at ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
Learning target-focusing convolutional regression model for visual

object tracking✩
∗
Di Yuan a ,1 , Nana Fan a ,1 , Zhenyu He a,b ,
a
School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
b
Peng Cheng Laboratory, Shenzhen 518055, China
article info a b s t r a c t
Article history: Discriminative correlation filters (DCFs) have been widely used in the tracking community recently.
Received 27 June 2019 DCFs-based trackers utilize samples generated by circularly shifting from an image patch to train a
Received in revised form 10 January 2020 ridge regression model, and estimate target location using a response map generated by the correlation
Accepted 13 January 2020
filters. However, the generated samples produce some negative effects and the response map is
Available online xxxx
vulnerable to noise interference, which degrades tracking performance. In this paper, to solve the
Keywords: aforementioned drawbacks, we propose a target-focusing convolutional regression (CR) model for
Visual object tracking visual object tracking tasks (called TFCR). This model uses a target-focusing loss function to alleviate
Discriminative correlation filters the influence of background noise on the response map of the current tracking image frame, which
Target-focusing model effectively improves the tracking accuracy. In particular, it can effectively balance the disequilibrium
Convolutional regression
of positive and negative samples by reducing some effects of the negative samples that act on the
object appearance model. Extensive experimental results illustrate that our TFCR tracker achieves
competitive performance compared with state-of-the-art trackers. The code is available at: https:
//github.com/deasonyuan/TFCR.
© 2020 Elsevier B.V. All rights reserved.
1. Introduction multiplication in the frequency domain. This obviously reduces

the computational complexity, which enables these trackers to
Visual object tracking is a significant topic in the computer efficiently learn from additional training samples. However, com-
vision community. The task of tracking is to sequentially locate pared with real samples in real-world tracking scenes, the syn-
a target, specified by the ground-truth in the first image frame, thetic samples generated by the convolution operation simul-
for the remaining frame of a video sequence. Certain complex taneously produce the boundary effect, which makes them less
tracking environmental factors make the tracking task quite chal- elegant and cause poor performance in the visual object tracking
lenging, such as occlusion, deformation, background clutter, and task [1,11–15]. In order to effectively alleviate the negative effect
scale variation. introduced by synthetic samples, the correlation filter model can
Recently, regression-based methods have attracted signifi- be reformulated as a one-layer convolutional neural network
cant attention from more and more visual object tracking re- (CNN) in these trackers [7,9].
searchers, such as discriminative correlation filters (DCFs)-based Although CR-based algorithms have achieved favorable track-
methods [1–6] and convolutional regression (CR)-based meth- ing performance, they are also limited by the imbalance between
ods [7–10]. DCFs-based trackers can be densely sampled by a positive and negative samples. Similar to the class imbalance
convolution operation, after extracting the features from a search in classification [16–21], it also exists in regression-based track-
image patch only once. According to the Convolution Theorem, ing methods that the background samples are far more than
the convolution calculation can be transformed to element-wise the target samples during the training process in each frame in
regression-based tracking methods the number of background
✩ No author associated with this paper has disclosed any potential or samples is far greater than the number of target samples used
pertinent conflicts which may be perceived to have impending conflict with in the training process for each frame [5,8,9,22–24]. Thus, the
this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. training process is excessively dominated by background samples.
2020.105526.
∗ Corresponding author at: School of Computer Science and Technology, In addition, a trained object appearance model must estimate
Harbin Institute of Technology, Shenzhen 518055, China. the unknown appearances of the target in the tracking task.
E-mail address: zhenyuhe@hit.edu.cn (Z. He). Therefore, the appearance model should place more attention
1 D. Yuan and N. Fan have equal contribution to this paper. emphasis on the target sample than on the background samples.
https://doi.org/10.1016/j.knosys.2020.105526
0950-7051/© 2020 Elsevier B.V. All rights reserved.
Please cite this article as: D. Yuan, N. Fan and Z. He, Learning target-focusing convolutional regression model for visual object tracking, Knowledge-Based Systems (2020)
105526, https://doi.org/10.1016/j.knosys.2020.105526.
2 D. Yuan, N. Fan and Z. He / Knowledge-Based Systems xxx (xxxx) xxx
Fig. 1. The comparisons results of our proposed TFCR tracker and other representative trackers (KCF [1], BACF [5] and CREST [7]) on some challenging tracking
sequences. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
In order to solve the above-mentioned problems, we propose 2. Related work

a Target-Focusing Convolutional Regression (TFCR) model for the
visual object tracking task. Different from other CR model-based DCFs-based trackers. Discriminative Correlation filters (DCFs)
trackers, our CR model uses a target-focusing strategy, which have been widely used in visual object tracking task in recent
allows the model to focus more on the target samples than years. In [25], a MOSSE filter has been proposed for the tracking
on the background samples. Specifically, we use an improved task, which can use a single image frame to produce steady
target-focused regression model to train the convolutional neural correlation filters and the tracking speed is nearly 670 frames
network, which tends to pay more attention to the target sample per second. However, the tracking accuracy of the MOSSE tracker
and reduces the influence of the background sample on the target cannot meet the actual demand. In order to improve the tracking
appearance model. Because there are 104 − 105 background sam- accuracy, Henriques et al. [1] proposed a KCF tracker, which
used kernel methods and HOG features with correlation filters.
ples but only a few target samples during the training process in
Danelljan et al. [26] improved the single-channel gray value fea-
per frame. This situation can lead to the appearance model being
ture used in the tracking task by adopting color attributes. These
overwhelmed by the background samples, which decreases the
trackers described above are susceptible to noise interference
tracking performance. When we put the target-focusing strategy
and reduced its stability. In order to make the tracker more
into the convolutional regression model, this places more focus
robust, several patch-based correlation filter trackers have been
on the target than on the background. Meanwhile, the target- proposed [27–31]. Liu et al. [27] proposed a patch-based tracking
focusing loss function can effectively balance the proportion of method with multi-CFs models. The combination of multiple
positive and negative samples, and prevent overfitting the ap- parts can effectively manage the effects of noise. Li et al. [28] gives
pearance model to the background samples. In practice, we found a reliable part-based tracking algorithm that attempts to evaluate
that the TFCR model usually produced accurate tracking results as and utilize reliable parts to track an object through the entire
shown in Fig. 1. tracking process. These trackers produce training samples using
The main contributions of this work are as follows: a cyclic shift of the image patch, which causes some boundary
effects. To address the boundary effects, the CFLB [32] tracker
• We propose a target-focusing loss function to place more
takes advantage of the intrinsic computational redundancy in
focus on target samples, and simultaneously reduce the ef-
the Fourier domain, which obviously limits the circular bound-
fect of background samples acting on the object appearance ary effects in the tracking task. Galoogahi et al. [5] proposed a
model. Background-Aware Correlation Filter (BACF) based on HOG fea-
• We integrate the target-focusing loss function with a con- tures, which efficiently modulates the varies of an object in
volutional regression model for visual tracking, which can both the foreground and background. The spatially regularized
effectively improve the quality of the response map and DCFs [11,13,33] tracker adopt large space support to learn the
improve the tracking performance directly. correlation filters, which effectively reduces the boundary effect
• Extensive experiments on the standard benchmarks show but at a high computational cost. Although these trackers have
that the proposed TFCR tracker performs favorably against high tracking speed, tracking task is still a thorny problem due to
state-of-the-art trackers. the imbalanced training samples and because the response map is
easily disturbed by noise. As an alternative to these trackers, we
The rest of this paper is structured as follows. We first give a
propose a target-focusing convolutional regression model, which
brief survey of recent related works in Section 2. Next, we pro- effectively reduces the influence of background samples acting
vide an exhaustive characterization of our target-focusing con- on the object appearance model and improves the quality of the
volutional regression model, including the introduction of the response map.
convolutional regression model, the target-focusing loss function CNN-based trackers. In recent research, deep learning has
and the tracking process of our TFCR tracker in Section 3. Subse- obtained impressive results in areas of computer vision, such as
quently, we present the implementation details, the evaluation segmentation, classification, detection, and tracking. The convo-
criterion and the evaluation of our proposed TFCR tracker on lutional neural network (CNN) [34–41] is the most fashionable
some comprehensive benchmark datasets in Section 4. Finally, we deep learning model in visual object tracking, because of its
give a concise conclusion about our work in Section 5. formidable capability for feature extraction and representation.
D. Yuan, N. Fan and Z. He / Knowledge-Based Systems xxx (xxxx) xxx 3
Fig. 2. The framework of our proposed approach. In the training stage, we use the target-focusing loss function to reduce the influences of background samples and
maximize the response of the object target sample. After that, the response map in the next image frame will be improved to locate the object target accurately.
Because hand-crafted features have a significant impact on track- 3.1. Convolutional regression
ing performance, Nam et al. [42] proposed an MDNet tracker
built on a discriminatively trained CNN model with many tracking The Convolutional Regression (CR) model reformulates the
videos, and a pre-trained model with some shared layers and a bi- DCFs as a one-layer CNN network. The CR model is learned
nary classification layer to obtain a generic target representation. from a training patch and estimates the target location using
Liu et al. [37] presented a multi-scale spatiotemporal features the response map. We denote a training patch X and CR model.
model to accurately estimate the size of the tracking target. Li Therefore, the cost function can be written as follows,
et al. [40] analyzed a DeepTrack method, which selected samples
through cluster sampling and employed a CNN architecture with J(w ) = J(w; ϕ (X )) + λr(w ), (1)
a structural loss function to evaluate the similarity of these sam- where ϕ (X ) denotes the feature map extracted from training
ples. The features extracted from a single CNN layer were further
image patch X , J(w; ϕ (X )) is the error term, and r(w ) is the weight
improved with multiple layers. Inspired by this, Qi et al. [43]
decay term. λ denotes the importance of the weight decay term
proposed taking advantage of the abundance of features from
r(w ).
different CNN layers to train several weak trackers and fuse them
Similar to the DCFs, CREST [7] and CRT [9] adopt the L2 norm
into a strong one. In [44], Tao et al. tried to use a similarity
as the weight decay term r(w ), and adopt the least squares
function to select an image patch that was the most similar to the
initialized target. The strength of the similarity matching function method for the error term J(w; ϕ (X )). The cost function in Eq. (1)
came from a Siamese network. Valmadre et al. [45] decoded the can be specifically written as follows,
DCFs learner as a differentiable CNN layer and tracking target in J(w ) = ∥w ∗ ϕ (X ) − y∥2 + λ∥w∥2 , (2)
an end-to-end way. Different from these CNN-based trackers, our
TFCR tracker uses a target-focusing strategy to decrease the in- where ∗ indicates the convolution operation, and y is the desired
fluence of background samples and maximize the response of the response map, a Gaussian distribution centered at the target
target sample, which effectively improves tracking performance. location.
Attention Loss. The attention strategy has been used in track-
ing tasks for a considerable time [46–52]. Fan et al. [46] proposed 3.2. Target-focusing loss
a spatial attention method for robust object tracking. In order
to mitigate the model drift, Cui et al. [47] proposed a recurrent For the visual tracking task, the target samples are obtained
target-attending tracking method that attempts to evaluate and from the tracking results. These samples are very limited in num-
utilize those trustworthy parts in the overall tracking process. ber, and spatially overlapped. Compared with the target samples,
Choi et al. [48] propose an attention-modulated tracking method the background samples are massive and varied. Obviously, the
to decomposes a target into multiple-parts, and multiple elemen- training procedure is dominated by the background samples. We
tary trackers were trained to adjust the attention distribution propose a target-focusing loss on convolutional regression to
of the multiple-parts. Then, they adaptively selected the sub- improve the influence of the target samples on the appearance
set of correlation filters using a deep attention network based model.
on the dynamic characteristics of the tracking target [50]. The Because a trained appearance model is used for estimating
spatial–temporal attention mechanism was also used in the track- the candidates in a new image frame, the maximum value in
ing task for robust and effective tracking performance [51,52]. the response map is always lower than the desired value. In
Zhu et al. [51] proposed using the flow information across con-
other words, the appearance model in standard convolutional
secutive frames to improve feature representation and tracking
regression is insufficient. Different from the least square method,
performance. In [52], Wang et al. reconstructed the correlation
we focus on the relative relationship between target samples and
filters in a Siamese network and added three different attention
background samples. It can be formulated as follows,
mechanisms, which effectively alleviated the over-fitting problem
in deep learning and enhanced the algorithm’s discriminative J(w ) = ∥w ∗ ϕ (X ) − y∥2 − η∥w ∗ ϕ (X )∥2 + λ∥w∥2 , (3)
ability and adaptability. Different from these attention strategies,
we use a target-focusing convolutional regression model to bal- where η controls the relative importance, and ∥w ∗ ϕ (X )∥2 is the
ance the effect of positive and negative samples on the trained target-focusing part, which is used to increase the role of the
object appearance model. The target-focusing loss function effec- target part in the regression model. Because a trained appearance
tively improves the object appearance model by weakening the model is used to estimate the candidates in a new frame, the
influence of the negative samples. maximum value in the response map is always lower than the de-
sired value. The desired responses of the background samples are
3. Target-focusing convolutional regression close to zero, as shown in Fig. 3(a). Therefore, Eq. (3) committed
to increasing the relative difference between the responses of the
In this section, we present the proposed Target-Focusing Con- target samples and the background samples. At the same time, it
volutional Regression. Firstly, we introduce the standard con- maximizes the responses of the target samples.
volutional regression. Then, we propose a Target-focusing loss In order to explain the problem more concisely, we denote an
function on convolutional regression. Furthermore, we show the extra variable y′ ,
tracking process via the proposed TFCR tracker. Fig. 2 shows the
TFCR tracking pipeline and the details are discussed below. y′ = w ∗ ϕ (X ). (4)
Fig. 3. (a) The desired response map, there is a clear dividing line between the target sample and the background samples, and the response map is continuous
smooth and non-multimodal. (b) A visualization example of the derivatives of y′ . It can be clearly seen that the influences of background samples have been more
reduced by target-focusing loss than L2 loss.
Fig. 4. The response maps and tracking results on #370 and #375 image frames on the Lemming sequence. The response map can be improved with the target-
focusing (TF) loss function, which can locate the object target accurately. The rightest column shows the tracking results, the red box represents the object target
location with TF loss and the green box represents the object target location without TF loss.
During the training stage, the derivative of y′ associated with 3.3. Tracking via TFCR
the loss function J and can be obtained according to Eq. (3) as
follows, In this section, we slightly illustrate the detailed procedure of
our TFCR tracker for visual object tracking. The tracking process
∂J
= [2(y′ − y) − 2ηy′ ] involves four parts, which include:
∂ y′ (5)
Model initialization. Model initialization is the first and impor-
= 2[(1 − η)y′ − y].
tant step in the tracking task [14,53,54]. After the first image
To further illustrate the difference between the effects of the frame with the ground-truth is captured, we can obtain a training
L2 loss and the target-focusing loss on convolutional regression, image patch with labels from it. The training patch centered
we visualize an example of the derivative of y′ in Fig. 3(b). To on the target but is larger than the target. Then, we utilize the
clearly show the difference, we vectorize the gradients of which proposed tracking framework to extract features and acquire the
the target samples correspond to the center locations. From the response map from the training image patch. Our TFCR, like most
figure, we can see that compared with the L2 loss, the target- trackers, adopts the VGGNet [55] for feature extraction. Mean-
focusing loss reduces the influence of the background samples while, all the parameters in the convolutional regression layer
and maximizes the response of the target sample. In other words, are randomly initialized and follow a zero-mean two-dimensional
Gaussian distribution.
it increases the relative gap between the background samples and
the target samples. Online detection. Online detection is directly related to tracking
The target-focusing loss function enables the response map performance [56,57]. When the object target in the tth frame
to focus on the target region rather than the background region is located, the tracker can use it as the central location area to
in image frames. Fig. 4 shows an example to explain how the extract a search image patch. In the t + 1th frame, the image
target-focusing loss function affects the response map in practice. patch is fed into the proposed tracking framework to generate a
The response maps and tracking results on tth image frame with response map, which is improved by the target-focusing function.
(or without) the target-focusing loss function can be seen in After the improved response map is obtained, the tracker can
Fig. 4. It clearly demonstrates that the response map with the obtain the target center from the maximum value point on the
target-focusing loss function has high confidence and accurately response map.
locates the target. On the contrary, the response map without the Scale estimation. Because the distance between the moving
target-focusing loss function is disturbed and loses the target. target and the camera is always changing, the moving target’s
scale estimation is highly significant to the visual object tracking height. The search image patch and the train image patch are
task [58,59]. After the target location has been obtained, we of the same size, which is useful in the tracking process. The
can extract some search patches at different scales with the feature extraction network that we adopted was a VGG-16 [55]
same central location and feed them into the proposed feature network with only the first two max-pooling layers. We ex-
extractor. After that, the corresponding prediction maps can be tracted the feature maps from the conv4-3 layer. To improve the
acquired and we can select the optimal scale factor by searching tracking speed, a PCA method was used to decrease the feature
for the maximum value in these prediction maps. The width wt channels to 64 dimensions. In the training stage, we iteratively
and height ht of the target in the tth frame is updated with the used the ADMM optimizer to update the coefficients, and the
following formula, learning rate was set to 5e—8. We update the network every
T = 2 frames with a learning rate set to 2e–8. The parameter
(wt , ht ) = β (wt −1 , ht −1 ), (6) η in Eq. (3) was set to 0.5. In the scale estimation stage, the
where wt −1 and ht −1 are the width and height of the target in the parameter β of the scale factors was set to [0.95, 1.00, 1.05]. Our
previous frame, and β is a smoothing factor. experiments were performed on a PC with an i7 4.2 GHz CPU,
32 GB RAM and an Nvidia GTX 1080Ti GPU with MatConvNet
Model update. Model updating is an essential component in the toolbox. The average tracking speed was probably 2 fps. In order
tracking task [1,5,7,9]. In the course of the movement, the shape, to ensure fairness, the tracking results of the contrast trackers
scale, and appearance of the object target will vary. If the model were taken from the authors’ homepage or from https://github.
does not update, it has difficulty to fit the change in the target, com/foolwood/benchmark_results/blob/master/README.md.
thus it cannot track it very well over the entire tracking process.
In the online visual object tracking process, we generate training 4.2. Evaluation criterion
samples that are consistent from the beginning to the end. For
each image frame, after the object target location is determined, To evaluate the tracking performance of our TFCR tracker,
we can obtain its ground-truth. After that, we can use the search the One-Pass Evaluation (OPE) which was proposed in the OTB-
patch as a training patch. The training patch and response map 2013 [65] benchmark was used as the evaluation index. The OPE
can be collected from each T frame to be sent to the proposed strategy has two parts: precision plots and success plots. The
tracker for online model updating. precision plots show an accurate percentage of predicted position
The integral tracking framework is given in Algorithm 1. and ground-truth under different thresholds [65,67,68]. The suc-
cess plots measure an average overlap, which accounts for both
Algorithm 1 Tracking via TFCR size and position. As with multitudinous tracking algorithms, we
used the Pascal VOC Overlap Ratio (VOR) [69] to express it. Given
Input: The ground-truth of target in first frame, the tracking
the resulting bounding box rb and the ground-truth bounding box
sequences {Xt }Tt=1 .
gb , the VOR score (Vs ) can be computed as:
Output: Tracking result in each image frame.
1: Crop the training patch P1 centered at the target T1 in first S {rb ∩ gb }
Vs = , (7)
image frame I1 ; S {rb ∪ gb }
2: Generate a Gaussian label Y1 centered at the target T1 ;
where ∩ denotes the intersection of the two regions, ∪ denotes
3: Train the TFCR model with Eq. (3) using the training patch P1
the union of the two regions, and S {·} denotes the area of the
and the Gaussian label Y1 ;
corresponding region. Then, the frame whose Vs is greater than
4: for t = 2 to T (T is the total number of each sequence) do
the threshold is considered the successful frame, and the ratio
5: Crop the training patch Pt centered at the target Tt in tth
of the successful frames plot in the success plots and the ranges
image frame It ;
from 0 to 1.
6: Extract the target features from It with area Pt ;
7: Use TFCR model to determine the center location of target;
4.3. Ablation studies on OTB-2013
8: Use Eq. (6) to determine the size of target;
In this section, we analyze the proposed TFCR tracker on
9: if t%2 == 0 then
the OTB-2013 benchmark by showing the effect of our contri-
10: Update the TFCR model with Eq. (3).
butions. We first achieve a Baseline tracker without using the
11: end if
target-focusing loss function. And then, considering the relative
12: end for
relationship between the target samples and the background
samples, we propose using the target-focusing loss function to
4. Experiments improve the baseline tracker (called TFCR). There are some exper-
imental results to verify the improvement on the OTB-2013 [65]
We evaluated the proposed TFCR tracker by comparing it with dataset.
some state-of-the-art trackers including MDNet [42], BACF [5], Fig. 5 demonstrates the OPE results on the OTB-2013 dataset.
CREST [7], SRDCF [11], DeepSRDCF [13], SRDCFdecon [33], HDT We can observe that using the target-focusing loss mechanism
[43], SINT [44], CFNet-conv5 [45], ACFN [50], SiamFC [60], STRes- (TFCR) obviously improves the tracking performance. Specifically,
Net_CF [61], DSST [58], KCF [1], CNT [62], ADNet [63] and C–COT the TFCR tracker achieves 0.671 as a success plot score and
[64] on three widely used datasets OTB-2013, OTB-2015 and 0.871 as a precision plot score. Compared with our baseline
TC-128 [65–67]. tracker (0.636/0.831), the proposed TFCR tracker achieved im-
provements about 5.5% and 4.8%, respectively. This is because
4.1. Implementation details the target-focusing loss function could reduce the influences of
the background samples and maximize the response of the tar-
In the visual object tracking task, the target’s ground-truth is get sample effectively, which effectively improves the tracking
given in the first image frame. We can obtain the training patches performance.
with labels from the first image frame and the training patches The OPE performances of these two trackers on the AUC score
are 5 times larger than the target in terms of both width and for each attribute are shown in Table 1, which further verified
Fig. 5. Precision and success plots of OPE on the OTB2013 dataset. The numbers in the legend indicate the distance precision rate (DPR) at 20 pixels for precision
plots and the average area-under-the-curve (AUC) scores for success plots.
Fig. 6. The precision plots and success plots of OPE on OTB2015 over 100 standard benchmark video sequences. To mark it clear, we just plot the top ten trackers.
Table 1
The success scores of the baseline tracker and the TFCR tracker on OTB2013 for different attributes (fast motion (FM), background
clutters (BC), motion blur (MB), deformation (DEF), illumination variation (IV), in-plane rotation (IPR), low resolution (LR), occlusion
(OCC), out-plane rotation (OPR), out-of-view (OV) and scale variation (SV)). The value format of each table cell is ‘‘AUC (%)’’ score.
Attributes FM BC MB DEF IV IPR LR OCC OPR OV SV
Baseline 60.4 59.4 63.7 62.3 58.7 61.3 52.4 61.6 60.9 47.5 60.8
TFCR 63.2 64.2 62.6 65.2 64.5 62.1 43.3 66.1 66.1 60.5 66.4
the effectiveness of the target-focusing loss function on the visual the proposed TFCR tracker showed slightly worse in tracking
object tracking task. It can be clearly seen that under almost all accuracy. This is mainly because our TFCR tracker does not use
11 attributes, the proposed tracker (TFCR) achieved significant multi-domain information. Compared with the precision score
performance improvement over the baseline tracker. On the con- (0.838) and success score (0.623) of the convolutional residual
trary, the performance of our TFCR was lower than the baseline learning tracker CREST [7], the proposed TFCR tracker obtained
for the attributes of motion blur (MB) and low resolution (LR). improvements of approximately 3.42% and 6.58%. Whereas, com-
This was probably because of the inaccurate tracking of previous pared with the ADNet tracker [63] (0.880/0.646), the proposed
image frames interfered with our tracker. In general, these results TFCR tracker showed very close to the same performance as the
directly reflect the importance of a target-focusing loss function ADNet tracker for precision score, and obtained improvements of
in the visual tracking process. approximate 3.25% in success score. This result demonstrates that
the target-focusing loss strategy is very suitable for visual object
4.4. Experiment on OTB-2015 tracking task.
Fig. 7 shows the tracking accuracy and tracking speed of our
In this section, to validate the effectiveness of our TFCR tracker, TFCR tracker and other state-of-the-art trackers on the OTB-2015
we make comparisons with some state-of-the-art trackers includ- dataset. From this figure, we can see that our tracker has achieved
ing MDNet [42], BACF [5], CREST [7], SRDCF [11], DeepSRDCF [13], favorable tracking accuracy. But the tracking speed still needs to
SRDCFdecon [33], HDT [43], SINT [44], CFNet-conv5 [45], ACFN be improved.
[50], SiamFC [60], CNT [62] and ADNet [63] on OTB-2015 [67] The success plots of these different trackers on some attributes
dataset. are illustrated in Fig. 8. We can see that our TFCR tracker has
Fig. 6 demonstrates the OPE results of the proposed TFCR the best or the second best tracking performance for almost all
tracker and the 13 state-of-the-art trackers on the OTB-2015 attributes. For the attributes of fast motion, motion blur, out of
dataset. Our tracker (0.876/0.665) is the third-best tracker in the view, low resolution, occlusion and scale variation, the convo-
precision plots and the second-best tracker in the success plots. lutional residual learning tracker CREST [7] and other trackers
Compared to the best multi-domain CNN-based MDNet tracker, do not perform well in such scenes, and there is an obvious
TFCR tracker, and it is probable that the ADNet [63] tracker could
benefit from the action-decision Network which trained with
supervised learning and reinforcement learning. Meanwhile, our
TFCR tracker mainly benefited from the target-focusing strategy,
which affected our tracker when the background was cluttered.
For the low resolution attribute, the performance of the CFNet-
conv5 [45] tracker won the first rank and was 3.3% higher than
our TFCR tracker. This is probably due to the low resolution of the
image frame, which makes our target-focusing loss function fail
to work very well.
Our proposed tracker (TFCR) showed significantly better track-
ing performance than the other representative trackers, includ-
ing BACF [5], CFNet-conv5 [45], ACFN [50] and DeepSRDCF [13]
in the visual object tracking sequences. Fig. 9 shows a qual-
itative comparison of these five trackers on some challenging
tracking sequences. The tracking results of the BACF [5] tracker
show that in the scene of illumination variation, fast motion,
and background cluster, the tracking is easily interrupted, and
the tracking performance reduced. This may be because it uses
Fig. 7. The comparison of tracking speed and tracking accuracy on OTB-2015 the HOG features, which cannot model the appearance of the
dataset. The horizontal and vertical coordinates correspond to tracking speed target very well in complex tracking scenes. In comparison, the
and AUC overlap ratio score, respectively. The proposed TFCR tracker achieves DeepSRDCF [13] tracker investigates the impact of CNN-based
a favorable accuracy against the state-of-the-art trackers.
features in a DCFs-based framework. It performs well for oc-
clusion and background clusters (basketball). However, direct
fusion limits the potential of the model for illumination vari-
performance gap compared with our proposed tracker. Other ation (matrix and skating1) and fast motion (motorRolling and
attributes such as deformation, illumination variation and scale skiing). The CFNet-conv5 [45] tracker interprets the DCFs-based
variation, the performance of the ADNet [63] tracker is very close model as a differentiable layer in a CNN-based framework, which
to that of our TFCR tracker. For the background clutter attribute, can manage occlusion and background cluster (basketball) effec-
the performance of the ADNet [63] is better than that of our tively. Because it uses an attentional correlation filter network,
Fig. 8. The success plots of some select attributions (fast motion, background clutter, motion blur, deformation, illumination variation, out of view, low resolution,
occlusion and scale variation) on OTB2015 dataset. To mark it clear, we just plot the top ten trackers.
Fig. 9. Qualitative comparison of our TFCR tracker and other representative trackers (BACF [5], DeepSRDCF [13], CFNet-conv5 [45] and ACFN [50]) on some visual
object tracking sequences (motorRolling, skiing, matrix, basketball and skating1) with fast motion, scale variation, illumination variation, deformation and other
challenges. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Table 2
Precision and AUC scores of C-COT, BACF, SRDCF, DeepSRDCF, SRDCFdecon, STResNet_CF, CFNet, HDT, DSST, KCF and our TFCR on
TC-128[66] dataset. The first, second and third best scores are highlighted in red, blue and green colors, respectively.
Trackers TFCR C-COT BACF SRDCF DeepSRDCF SRDCFdecon STResNet_CF CFNet HDT DSST KCF
Prec. scores 77.6 78.3 66.0 69.6 74.0 72.9 73.3 60.7 68.6 53.4 54.9
AUC scores 56.4 57.3 49.6 50.9 53.6 53.4 50.5 45.6 48.0 40.5 38.7
the ACFN [50] tracker achieved good tracking results for illu- our TFCR tracker acquired the second-best (very close to C–COT)
mination variation (matrix) and fast motion (skiing). In other with a distance precision rate of 77.6% and an overlap success rate
sequences, its tracking performance still had considerable space of 56.4%, respectively. Compared with DeepSRDCF [13], which
for improvement. From these tracking results, we can see that obtained the third-best precision score (74.0%) and success score
when the tracking target is affected by complex scenes such as (53.6%), our proposed TFCR achieved significant improvements,
fast motion, background clutter, motion blur, etc., other trackers which shows clearly the benefits of using a target-focusing loss
may quickly lose the target (motorRolling, matrix, and skating1), function.
but our tracker is still able to locate it faultlessly. Unlike the con-
trast trackers, benefit from the target-focusing loss function, our 5. Conclusions
TFCR tracker can handle fast motion, scale variation, illumination
variation, deformation, and other challenges very well. In this paper, we propose a target-focusing convolutional re-
gression model for object tracking. In our proposed TFCR tracker,
4.5. Experiment on TC-128 a novel target-focusing loss strategy was proposed to balance
the disequilibrium of the positive and negative samples by re-
In this section, we use the TC-128 [66] dataset to validate ducing the effect of the negative samples and maximizing the
the performance of our TFCR tracker. The comparison with sev- response of the target sample. Compared with the DCFs-based
eral state-of-the-art trackers (C-COT [64], BACF [5], SRDCF [11], trackers, our proposed TRCF tracker could incorporate essentially
DeepSRDCF [13], SRDCFdecon [33], STResNet_CF [61], CFNet [70], limitless authentic negative samples from the video image and
HDT [43], DSST [58] and KCF [1]) is shown in Table 2. Among acquired a more clean response map through correlation filters.
the 10 compared trackers, C–COT [64] obtains the best precision Extensive experimental results showed that our TFCR tracker was
score (78.3%) and the best success score (57.3%). By comparison, more effective and robust than DCFs-based trackers and even
some CNN-based trackers. Moreover, the target-focusing strategy [17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object
showed attractive potential for visual tracking tasks, such as detection, in: IEEE International Conference on Computer Vision, 2017, pp.
2999–3007.
further improved tracking accuracy with more effective depth
[18] Y. Zhao, X. You, S. Yu, C. Xu, W. Yuan, X.Y. Jing, T. Zhang, D. Tao, Multi-
tracking frameworks. view manifold learning with locality alignment, Pattern Recognit. 78 (2018)
154–166.
CRediT authorship contribution statement [19] A. Shrivastava, A. Gupta, R. Girshick, Training region-based object detectors
with online hard example mining, in: IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 761–769.
Di Yuan: Conceptualization, Methodology, Writing - original [20] S. Yi, Y. Liang, Z. He, Y. Li, Y.-m. Cheung, Dual pursuit for subspace learning,
draft, Writing - review & editing, Formal analysis, Software, In- IEEE Trans. Multimedia 21 (6) (2018) 1399–1411.
vestigation. Nana Fan: Software, Validation, Visualization, Inves- [21] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, Cascade object detection
tigation, Resources, Data curation. Zhenyu He: Writing - review with deformable part models, in: IEEE Conference on Computer Vision
and Pattern Recognition, 2010, pp. 2241–2248.
& editing, Supervision, Project administration.
[22] Z. He, X. Li, D. Tao, X. You, Y.Y. Tang, Connected component model for
multi-object tracking, IEEE Trans. Image Process. 25 (8) (2016) 3698–3711.
Acknowledgments [23] P. Zhang, S. Yu, J. Xu, X. You, X. Jiang, X.-Y. Jing, D. Tao, Robust visual track-
ing using multi-frame multi-feature joint modeling, IEEE Trans. Circuits
Syst. Video Technol. 29 (12) (2019) 3673–3686.
This work was supported by the National Natural Science
[24] M. Mueller, N. Smith, B. Ghanem, Context-aware correlation filter tracking,
Foundation of China (Grant No. 61672183), by the Natural Science in: IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
Foundation of Guangdong Province (Grant No. 2015A030313544), 1387–1395.
by the Shenzhen Research Council (Grant No. [25] D.S. Bolme, J.R. Beveridge, B.A. Draper, Y.M. Lui, Visual object tracking using
JCYJ20170413104556946, JCYJ20170815113552036), and by the adaptive correlation filters, in: IEEE Conference on Computer Vision and
Pattern Recognition, 2010, pp. 2544–2550.
project ‘‘PCL Future Greater-Bay Area Network facilities for Large-
[26] M. Danelljan, F.S. Khan, M. Felsberg, J.V.D. Weijer, Adaptive color attributes
scale Experiments and Applications" (PCL2018KP001) and ‘‘The for real-time visual tracking, in: IEEE Conference on Computer Vision and
Verification Platform of Multi-tier Coverage Communication Net- Pattern Recognition, 2014, pp. 1090–1097.
work for Oceans (PCL2018KP002)’’. Di Yuan is supported by a [27] T. Liu, G. Wang, Q. Yang, Real-time part-based visual tracking via adaptive
scholarship from China Scholarship Council (CSC). correlation filters, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 4902–4912.
[28] Y. Li, J. Zhu, S.C.H. Hoi, Reliable patch trackers: Robust visual tracking by
References exploiting reliable patches, in: IEEE Conference on Computer Vision and
Pattern Recognition, 2015, pp. 353–361.
[1] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with [29] S. Liu, T. Zhang, X. Cao, C. Xu, Structural correlation filter for robust visual
kernelized correlation filters, IEEE Trans. Pattern Anal. Mach. Intell. 37 (3) tracking, in: IEEE Conference on Computer Vision and Pattern Recognition,
(2014) 583–596. 2016, pp. 4312–4320.
[2] M. Danelljan, G. Bhat, F.S. Khan, M. Felsberg, Eco: Efficient convolution [30] T. Liu, X. Cao, J. Jiang, Visual object tracking with partition loss schemes,
operators for tracking, in: IEEE Conference on Computer Vision and Pattern IEEE Trans. Intell. Transp. Syst. 18 (3) (2017) 633–642.
Recognition, 2017, pp. 6638–6646. [31] D. Yuan, X. Zhang, J. Liu, D. Li, A multiple feature fused model for visual
[3] N. Fan, J. Li, Z. He, C. Zhang, X. Li, Region-filtering correlation tracking, object tracking via correlation filters, Multimedia Tools Appl. 78 (19)
Knowl.-Based Syst. 172 (2019) 95–103. (2019) 27271–27290.
[4] F. Li, C. Tian, W. Zuo, L. Zhang, M.-H. Yang, Learning spatial-temporal [32] H.K. Galoogahi, T. Sim, S. Lucey, Correlation filters with limited boundaries,
regularized correlation filters for visual tracking, in: IEEE Conference on in: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp.
Computer Vision and Pattern Recognition, 2018, pp. 4904–4913. 4630–4638.
[5] H.K. Galoogahi, A. Fagg, S. Lucey, Learning background-aware correlation [33] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Adaptive decontamination
filters for visual tracking, in: IEEE International Conference on Computer of the training set: A unified formulation for discriminative visual tracking,
Vision, 2017, pp. 1135–1143. in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp.
[6] X. Li, C. Ma, B. Wu, Z. He, M.-H. Yang, Target-aware deep tracking, in: 1430–1438.
IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. [34] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
1369–1378. convolutional neural networks, in: International Conference on Neural
[7] Y. Song, C. Ma, L. Gong, J. Zhang, R.W. Lau, M.-H. Yang, Crest: Convolutional Information Processing Systems, 2012, pp. 1097–1105.
residual learning for visual tracking, in: IEEE International Conference on [35] Y. Liu, X. Chen, J. Cheng, H. Peng, Z. Wang, Infrared and visible image
Computer Vision, 2017, pp. 2574–2583. fusion with convolutional neural networks, Int. J. Wavelets Multiresolut.
[8] C. Sun, H. Lu, M.-H. Yang, Learning spatial-aware regressions for visual Inf. Process. 16 (3) (2018) 1850018.
tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, [36] C. Tian, Y. Xu, W. Zuo, Image denoising using deep cnn with batch
2018, pp. 8962–8970. renormalization, Neural Netw. 121 (2020) 461–473.
[9] K. Chen, W. Tao, Convolutional regression for visual tracking, IEEE Trans. [37] B. Liu, Q. Liu, Z. Zhu, T. Zhang, Y. Yang, Msst-resnet: Deep multi-scale
Image Process. 27 (7) (2018) 3611–3620. spatiotemporal features for robust visual object tracking, Knowl.-Based
[10] D. Held, S. Thrun, S. Savarese, Learning to track at 100 fps with deep Syst. 164 (2019) 235–252.
regression networks, in: European Conference on Computer Vision, 2016, [38] K. Wang, J. An, X. Zhao, J. Zou, Accurate landmarking from 3d facial scans
pp. 749–765. by cnn and cascade regression, Int. J. Wavelets Multiresolut. Inf. Process.
[11] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Learning spatially regular- 16 (6) (2018) 1840007.
ized correlation filters for visual tracking, in: IEEE International Conference [39] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
on Computer Vision, 2015, pp. 4310–4318. in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp.
[12] J.F. Henriques, C. Rui, P. Martins, J. Batista, Exploiting the circulant 770–778.
structure of tracking-by-detection with kernels, in: European Conference [40] H. Li, Y. Li, F. Porikli, Deeptrack: Learning discriminative feature represen-
on Computer Vision, 2012, pp. 702–715. tations by convolutional neural networks for visual tracking, in: British
[13] M. Danelljan, G. Hager, F.S. Khan, M. Felsberg, Convolutional features for Machine Vision Conference, 2014, pp. 1–12.
correlation filter based visual tracking, in: IEEE International Conference [41] B. Liu, Q. Liu, T. Zhang, Y. Yang, Msstresnet-tld: A robust tracking
on Computer Vision Workshops, 2015, pp. 621–629. method based on tracking-learning-detection framework by using multi-
[14] X. Li, Q. Liu, Z. He, H. Wang, C. Zhang, W.S. Chen, A multi-view model for scale spatio-temporal residual network feature model, Neurocomputing
visual tracking via correlation filters, Knowl.-Based Syst. 113 (2016) 88–99. 362 (2019) 175–194.
[15] D. Yuan, X. Lu, D. Li, Y. Liang, X. Zhang, Particle filter re-detection for [42] H. Nam, B. Han, Learning multi-domain convolutional neural networks
visual tracking via correlation filters, Multimedia Tools Appl. 78 (11) (2019) for visual tracking, in: IEEE Conference on Computer Vision and Pattern
14277–14301. Recognition, 2016, pp. 4293–4302.
[16] S. Yi, Z. He, Y.M. Cheung, W.S. Chen, Unified sparse subspace learning via [43] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, M.H. Yang, Hedged deep
self-contained regression, IEEE Trans. Circuits Syst. Video Technol. 28 (10) tracking, in: IEEE Conference on Computer Vision and Pattern Recognition,
(2018) 2537–2550. 2016, pp. 4303–4311.
[44] R. Tao, E. Gavves, A.W.M. Smeulders, Siamese instance search for tracking, [58] M. Danelljan, G. Häger, F.S. Khan, M. Felsberg, Discriminative scale space
in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, tracking, IEEE Trans. Pattern Anal. Mach. Intell. 39 (8) (2016) 1561–1575.
pp. 1420–1429. [59] L. Chen, X. Hu, T. Xu, H. Kuang, Q. Li, Turn signal detection during
[45] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, P.H.S. Torr, End- nighttime by cnn detector and perceptual hashing tracking, IEEE Trans.
to-end representation learning for correlation filter based tracking, in: Intell. Transp. Syst. 18 (12) (2017) 3303–3314.
IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. [60] L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H.S. Torr,
2085–2813. Fully-convolutional siamese networks for object tracking, in: European
[46] J. Fan, Y. Wu, S. Dai, Discriminative spatial attention for robust tracking, Conference on Computer Vision, 2016, pp. 850–865.
in: European Conference on Computer Vision, 2010, pp. 480–493. [61] Z. Zhu, B. Liu, Y. Rao, Q. Liu, R. Zhang, Stresnet_cf tracker: The deep
[47] Z. Cui, S. Xiao, J. Feng, S. Yan, Recurrently target-attending tracking, in: spatiotemporal features learning for correlation filter based robust visual
IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. object tracking, IEEE Access 7 (2019) 30142–30156.
1449–1458. [62] K. Zhang, Q. Liu, Y. Wu, M.H. Yang, Robust visual tracking via convolutional
[48] J. Choi, H.J. Chang, J. Jeong, Y. Demiris, Y.C. Jin, Visual tracking using networks without training, IEEE Trans. Image Process. 25 (4) (2016)
attention-modulated disintegration and integration, in: IEEE Conference on 1779–1792.
Computer Vision and Pattern Recognition, 2016, pp. 4321–4330. [63] S. Yun, J. Choi, Y. Yoo, K. Yun, Y.C. Jin, Action-decision networks for
[49] H. Fan, H. Ling, Sanet: Structure-aware network for visual tracking, in: visual tracking with deep reinforcement learning, in: IEEE Conference on
IEEE Conference on Computer Vision and Pattern Recognition Workshops, Computer Vision and Pattern Recognition, 2017, pp. 1349–1358.
2017, pp. 2217–2224. [64] M. Danelljan, A. Robinson, F.S. Khan, M. Felsberg, Beyond correlation filters:
[50] J. Choi, H.J. Chang, S. Yun, T. Fischer, Y. Demiris, Y.C. Jin, Attentional Learning continuous convolution operators for visual tracking, in: European
correlation filter network for adaptive visual tracking, in: IEEE Conference Conference on Computer Vision, 2016, pp. 472–488.
on Computer Vision and Pattern Recognition, 2017, pp. 4828–4837. [65] Y. Wu, J. Lim, M.H. Yang, Online object tracking: A benchmark, in:
[51] Z. Zhu, W. Wu, W. Zou, J. Yan, End-to-end flow correlation tracking with IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp.
spatial-temporal attention, in: IEEE Conference on Computer Vision and 2411–2418.
Pattern Recognition, 2018, pp. 548–557. [66] P. Liang, E. Blasch, H. Ling, Encoding color information for visual tracking:
[52] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, S. Maybank, Learning attentions: Algorithms and benchmark., IEEE Trans. Image Process. 24 (12) (2015)
Residual attentional siamese network for high performance online visual 5630–5644.
tracking, in: IEEE Conference on Computer Vision and Pattern Recognition, [67] Y. Wu, J. Lim, M.H. Yang, Object tracking benchmark, IEEE Trans. Pattern
2018, pp. 4854–4863. Anal. Mach. Intell. 37 (9) (2015) 1834–1848.
[53] Q. Liu, X. Lu, Z. He, C. Zhang, W.S. Chen, Deep convolutional neural [68] Q. Liu, Z. He, X. Li, Y. Zheng, Ptb-tir: A thermal infrared pedestrian tracking
networks for thermal infrared object tracking, Knowl.-Based Syst. (2017) benchmark, IEEE Trans. Multimed. (2019) http://dx.doi.org/10.1109/TMM.
189–198. 2019.2932615.
[54] Z. He, S. Yi, Y.M. Cheung, X. You, Y.Y. Tang, Robust object tracking via key [69] M. Everingham, J. Winn, The pascal visual object classes challenge 2010
patch sparse representation, IEEE Trans. Cybern. 47 (2) (2017) 354–364. (voc2010) development kit contents, in: International Conference on
[55] K. Simonyan, A. Zisserman, Very deep convolutional networks for Machine Learning, 2011, pp. 117–176.
large-scale image recognition, in: International Conference on Learning [70] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, P.H. Torr, End-to-
Representations, 2015, pp. 1–14. end representation learning for correlation filter based tracking, in:
[56] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422. 2805–2813.
[57] W. Ou, D. Yuan, Q. Liu, Y. Cao, Object tracking based on online represen-
tative sample selection via non-negative least square, Multimedia Tools
Appl. 77 (9) (2018) 10569–10587.

11 (2020) Yuan Fan - Learning Target-Focusing Convolutional Regression Model For Visual Object Tracking

Uploaded by

Copyright:

Available Formats

You might also like

11 (2020) Yuan Fan - Learning Target-Focusing Convolutional Regression Model For Visual Object Tracking

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11 (2020) Yuan Fan - Learning Target-Focusing Convolutional Regression Model For Visual Object Tracking

Uploaded by

Copyright:

Available Formats

Knowledge-Based Systems xxx (xxxx) xxx

Contents lists available at ScienceDirect

Learning target-focusing convolutional regression model for visual

1. Introduction multiplication in the frequency domain. This obviously reduces

In order to solve the above-mentioned problems, we propose 2. Related work

You might also like