Professional Documents
Culture Documents
Robust Visual Tracking
Robust Visual Tracking
Robust Visual Tracking
Fig. 3. The diagram of the proposed TM3 tracker contains four main steps, namely candidate generation, fast candidate selection, similarity computation,
candidate regions Rt Et are produced in candidate generation step, and then processed by the fast candidate
and template updating. Numerous potential
selection step to form the refined Rt Et . In similarity computation step, MBS is used to evaluate the similarities between these refined potential regions and
the templates Tmplr , Tmple , respectively. Such step results in three tracking cues including Ir∗ = argmax MBS(Rt , Tmplr ), Ie∗ = argmax MBS(Et , Tmple )
with green box, and Iedist = argmin dist (Et , Ir∗ ) with yellow box (the definition of distance metric dist can be found in Section III-A). The final tracking
result I ∗ is jointly decided by above three cues. Finally, two types of templates Tmplr and Tmple are updated via the memory filtering strategy. Above
process iterates until all the frames of a video sequence have been processed.
these candidate regions when they are extremely similar. In Flow-r process, at the tth frame, the Nr target regions
Nr
Moreover, such discriminative ability inherited by MBS is also Rt = {I(cri , sri )}i=1 are randomly sampled from a Gaussian
theoretically demonstrated in our paper. distribution, where I(cri , sri )2 is the i th image region with
iy
For template updating, two types of templates Tmplr and the center coordinate cri = [cri x , cr ] and scale factor sri .
Tmple are designed in a comprehensive updating scheme. To efficiently reduce the computational complexity, a fast
The template Tmplr is established by carefully selecting k-NN selection algorithm [24] is used to form the refined
Nr
both “representative” and “reliable” tracking results during a candidate regions Rt = {I(cri , sri )}i=1 that are composed of
long period. Herein, “representative” denotes that a template Nr nearest neighbors of the tracking result at the (t − 1)th
should well represent the target under various appearances frame from Rt . After that, MBS evaluates the similarities
during the previous video frames. The terminology “reliable” between these target regions and the template Tmplr , and then
implies that the selected template should be accurate and outputs I(cr∗ , sr∗ ) with the highest similarity score.
stable. To this end, a memory filtering strategy is proposed In Flow-e process, at the tth frame, the Edge-
in our TM3 tracker to carefully pick up the representative and Box approach [25] is utilized to generate a set Et =
reliable tracking results during past frames, so that the accurate j j j j j
{I(ce , se )} Nj =1
e
with Ne target proposals, where I(ce , se ) (Ie
template Tmplr can be properly constructed. By doing so, the for simplicity) is the j th image region with the center coor-
memory filtering strategy is able to “recall” some stable results j jx jy j
dinate ce = [ce , ce ] and scale factor se . Subsequently,most
and “forget” some results under abnormal situations. Different non-target proposals are filtered out by exploiting the geometry
from Tmplr , the template Tmple is frequently updated from constraint. After that, only a small amount of Ne potential
the latest frames to timely adapt to the target’s appearance j Ne
proposals Et = {I(ce , se )} j =1
j
are evaluated by MBS to
changes within a short period. Hence, compared to using the
indicate how they are similar to the template Tmple . This
single template in CFTs, the combination of Tmple and Tmplr
process generates two tracking cues Ie∗ and Iedist , which will
is beneficial for our tracker to capture the target appearance
be further described in Section III-A. Based on the three
variations in different periods.
tracking cues generated by above the two flows, the final
Extensive evaluations on the Object Tracking Bench-
tracking result I ∗ at the tth frame is obtained by fusing Ir∗ ,
mark (OTB) [8] (including OTB-50 and OTB-100) and Prince-
Ie∗ and Iedist with details introduced in Section III-B.
ton Tracking Benchmark (PTB) [23] suggest that in most cases
Note that in the entire tracking process, illustrated in Fig. 3,
our method significantly improves the performance of template
similarity computation and template updating are critical for
matching based trackers and also performs favorably against
our tracker to achieve satisfactory performance, so we will
the state-of-the-art trackers.
detail them in Sections II-A and II-B, respectively.
1) The Definition of MBS: In our TM3 tracker, every image pairwise patches in P and Q, then we have:
region is split into a set of non-overlapped 3×3 image patches.
Without loss of generality, we denote a candidate region as a EMBS (P, Q) = MBP(pi , q j )Pr{P}Pr{Q}d Pd Q,
set P = {pi }i=1
N , where p is a small patch and N is the number
i P Q
of such small patches. Likewise, a template (Tmplr or Tmple ) EBBS (P, Q) = BBP(pi , q j )Pr{P}Pr{Q}d Pd Q. (3)
is represented by the set of Q = {q j } M j =1 of size M. The
P Q
objective of MBS is to reasonably evaluate the similarity By this definition, the variance VMBS (P, Q) and
between P and Q, so that a faithful and discriminative VBBS (P, Q) can be easily computed. Formally, we have the
similarity score can be assigned to the candidate region P. following three lemmas. We begin with the simplification of
To design MBS, we begin with the definition of a similarity EMBS (P, Q) in Lemma 1, and next compute EMBS2 (P, Q)
metric MBP between two patches {pi ∈ P, q j ∈ Q}. Assum- and E2MBS (P, Q) to obtain VMBS (P, Q) in Lemma 2. Lastly,
ing that q j is the r th nearest neighbor of pi in the set of Q we seek for the relationship between EMBS2 (P, Q) and
(denoted as q j = NNr (pi , Q)), and meanwhile pi is the sth EMBS (P, Q) in Lemma 3. The proofs of these three lemmas
nearest neighbor of q j in P (denoted as pi = NNs (q j , P)), are put into Appendix V, V and V, respectively.
then the similarity MBP of two patches pi and q j is: Lemma 1: Let FP (x), FQ (x) be the cumulative distribu-
tion functions (CDFs) of Pr{P} and Pr{Q}, respectively, and
− σrs
MBP(pi ,q j ) = e 1 , if q j = NNr (pi , Q) ∧ pi assuming that each patch is independent of the others, then
= NNs (q j , P), (1) the multivariate integral in Eq. (3) can be represented by
Eq. (5), as shown at the bottom of the next page, where
p + = p + d( p, q), p− = p − d( p, q), q + = q + d( p, q), and
where σ1 is a tuning parameter. In our experiment, we empir-
q − = q − d( p, q).
ically set it to 0.5. Such similarity metric MBP evaluates the
Lemma 2: Given EMBS (P, Q) obtained in Lemma 1, the
closeness level between two patches by the scheme of multiple
variance VMBS(P, Q) can be computed by Eq. (6), as shown
reciprocal nearest neighbors. Therefore, MBS between P and
at the bottom of the next page.
Q is defined as3 :
Lemma 3: The relationship between EMBS2 (P, Q) and
N
M EMBS (P, Q) satisfies:
1
MBS(P, Q) = · MBP(pi , q j ), (2) EMBS2 (P, Q) > EMBS (P, Q). (4)
min{M, N}
i=1 j =1
By introducing above three auxiliary Lemmas, we formally
One can see that MBS is the statistical average of MBP. present Theorem 1 as follows.
Specifically, the similarity metric BBP used in BBS [22] is Theorem 1: The relationship between VMBS (P, Q) and
defined as: VBBS (P, Q) satisfies:
1 q j = NN1 (pi , Q) ∧ pi = NN1 (q j , P); VMBS (P, Q) > VBBS (P, Q), i f EMBS (P, Q) = EBBS (P, Q).
BBP(pi , q j ) = (7)
0 otherwise.
Proof: We firstly obtain VBBS (P, Q) and then prove
Herein, the metric BBP can be viewed as a special case of the that under the condition of EMBS (P, Q) = EBBS (P, Q),
proposed similarity MBP in Eq. (1) when r and s are set to 1.
As aforementioned, BBS shows less discriminative ability
the variance
of MBS(P, 2 Q) is larger than that of BBS(P, Q).
Since BBP(pi , q j ) = BBP(pi , q j ) is derived from
than MBS for object tracking, and next we will theoretically
1 Eq. (3), we have EBBS2 (P, Q) = EBBS (P, Q)5 by
explain the reason. Note that the factor min{M,N} defined in
Definition 1. Based on this, under the condition of
Eq. (2) does not have influence on the theoretical result, so we EMBS (P, Q) = EBBS (P, Q), we have:
investigate the relationship between BBS and MBS without
any factor to verify the effectiveness of such multiple nearest VMBS −VBBS = EMBS2 − [EMBS ]2 − EBBS2 + [EBBS ]2
neighbor scheme. To this end, we introduce first-order statistic = EMBS2 − EBBS2 (Using EMBS = EBBS )
(mean value) and second-order statistic (variance) of two such
= EMBS2 − EBBS (Using EBBS2 = EBBS )
metrics to analyse their respective discriminative (or “scatter”)
ability. We begin with the following statistical definition: = EMBS2 − EMBS > 0. (8)
Definition 1: Suppose two image patches pi and q j are Finally, we obtain VMBS (P, Q) > VBBS (P, Q) as claimed,
randomly sampled from two given distributions Pr{P} and thereby completing the proof.
Pr{Q}4 corresponding to the sets P and Q, respectively, and Theorem 1 theoretically demonstrates that under the con-
EMBS (P, Q) is the expectation of the similarity score between dition of EMBS (P, Q) = EBBS (P, Q), the variance of
a pair of patches {pi , q j } computed by MBP over all possible MBS(P, Q) is larger than that of BBS(P, Q), which indi-
cates that MBS is able to produce more disperse similarity
3 In the experimental setting, the set sizes of P and Q have been set to the scores over P and Q than BBS when they have the same
same value.
4 Such general definition does not rely on a specific distribution of Pr{P} 5 Note that we denote E
BBS (P, Q) as EBBS by dropping “(P, Q)” for
and Pr{Q}. The details of EBBS (P, Q) can be found in [22, eq. (4)]. simplicity in the following description.
LIU et al.: ROBUST VISUAL TRACKING REVISITED: FROM CORRELATION FILTER TO TEMPLATE MATCHING 2781
∞
1
EMBS (P, Q) = 1 − M N FP (q + ) − FP (q − ) FQ ( p + ) − FQ ( p− ) f P ( p) f Q (q)d pdq
σ1
−∞
∞
1 2 2
+ M2 N2 FP (q + ) − FP (q − ) FQ ( p + ) − FQ ( p − ) f P ( p) f Q (q)d pdq. (5)
2σ12
−∞
∞
1 2 2
VMBS (P, Q) = 2 M 2 N 2 FP (q + ) − FP (q − ) FQ ( p+ ) − FQ ( p − ) f P ( p) f Q (q)d pdq
σ1
−∞
∞ 2
1 + −
+ −
− 2 M2 N2 FP (q ) − FP (q ) FQ ( p ) − FQ ( p ) f P ( p) f Q (q)d pdq . (6)
σ1
−∞
2782 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 6, JUNE 2018
Algorithm 2 The Proposed TM3 Tracking Algorithm value are retained, so that they can be used in the following
step in Flow-e process. Therein, two tracking cues including
Iedist with the smallest di st, and the Ie∗ with the highest
similarity score are picked up for the fusion step.
Fig. 7. Success and precision plots of our color-based tracker TM3 -color and various compared trackers. (a) shows the results on OTB-2013 with 36 colored
sequences, and (b) presents the results on OTB-2015 with 77 colored sequences. For clarity, we only show the curves of top 10 trackers in (a) and (b).
neighbors of each image patch in P based on the similarity The trade-off parameters δ and β in Eq. (9) are fixed to
matrix. Due to the exponential decay operator in Eq. (1), there 5 and 10 accordingly; The number of atoms in the target
is no sense to consider a large r and s. Hence, we just consider dictionary D is decided as N D = 12.
the c ≤ 4 nearest neighbors of an image patch to accelerate
the computation in our experiment. As described in [22], such B. Results on OTB
operation can be further reduced to O(Mc) on the average.
1) Dataset Description and Evaluation Protocols: OTB
Finally, the overall MBS complexity is O(M 2 cd).
includes two versions, i.e. OTB-2013 and OTB-2015.
OTB-2013 contains 51 sequences with precise bounding-
IV. E XPERIMENTS box annotations, and 36 of them are colored sequences.
In this section, we compare the proposed TM3 tracker with In OTB-2015, there are 77 colored video sequences among all
other recent algorithms on two benchmarks including OTB [8] the 100 sequences. Specifically, considering that the proposed
and PTB [23]. Moreover, the results of ablation study and TM3 -color tracker is executed on CIE Lab and RGB color
parametric sensitivity are also provided. spaces, the TM3 -color tracker is only compared with other
baseline trackers on the colored sequences to achieve fair
comparison. Differently, our TM3 -deep tracker can handle
A. Experimental Setup both colored and gray-level sequences, so it is evaluated on
In our experiment, the proposed TM3 tracker is tested all the sequences in the above two benchmarks.
on both color feature (denoted as “TM3 -color”) and deep The quantitative analysis on OTB is demonstrated on two
feature (denoted as TM3 -deep). Our TM3 -color tracker is evaluation plots in the one-pass evaluation (OPE) protocol:
implemented in MATLAB on a PC with Intel i5-6500 CPU the success plot and the precision plot. In the success plot,
(3.20 GHz) and 8 GB memory, and runs about 5 fps the target in a frame is declared to be successfully tracked
(frames per second). The proposed TM3 -deep tracker is if its current overlap rate exceeds a certain threshold. The
based on MatConvNet toolbox [32] with Intel Xeon E5-2620 success plot shows the percentage of successful frames at the
CPU @2.10GHz and a NVIDIA GTX1080 GPU, and runs overlap threshold varies from 0 to 1. In the precision plot,
almost 4 fps. the tracking result in a frame is considered successful if the
Parameter Settings: In our TM3 -color tracker, every image center location error (CLE) falls below a pre-defined threshold.
region is normalized to 36 × 36 pixels, and then split into a The precision plot shows the ratio of successful frames at the
set of non-overlapped 3 × 3 image patches. In this case, M = CLE threshold ranging from 0 to 50. Based on the above two
2
N = 36 32
= 144. In the TM3 -deep tracker, each image region evaluation plots, two ranking metrics are used to evaluate all
is represented by a 4096-dimensional feature vector, namely compared trackers: one is the Area Under the Curve (AUC)
M = N = 4096. In Flow-r process, to generate target metric for the success plot, and the other is the precision score
regions in the tth frame, we draw Nr = 700 samples in transla- at threshold of 20 pixels for the precision plot. For details
tion and scale dimension, xit = (cri , sri ), i = 1, 2, · · · , Nr , from about the OTB protocol, refer to [8].
a Gaussian distribution whose mean is the previous target state Apart from the totally 29 and 37 trackers included in
x∗t −1 and covariance is a diagonal matrix = diag(σx , σ y , σs ) OTB-2013 and OTB-2015, respectively, we also compare
of which diagonal elements are standard deviations of the our tracker with several state-of-the-art methods, includ-
sampling parameter vector [σx , σ y , σs ] for representing the ing ACFN [1], RaF [2], TrSSI-TDT [33], DLSSVM [14],
target state. In our experiments, the sampling parameter vector Staple [7], DST [34], DSST [35], MEEM [36], TGPR [37],
is set to [σx , σ y , 0.15] where σx = min{w/4, 15} and σ y = KCF [6], IMT [38], LNLT [39], DAT [40], and CNT [41].7
min{h/4, 15} are fixed for all test sequences, and w, h have Specifically, two state-of-the-art template matching based
been defined in Eq. (14). The number of potential proposals 7 The implementation of several algorithms i.e., TrSSI-TDT and LNLT is
Rt and Et are set to Nr = Ne = 50. In Flow-e process, not public, and hence we just report their results on OTB provided by the
we use the same parameters in EdgeBox as described in [25]. authors for fair comparisons.
LIU et al.: ROBUST VISUAL TRACKING REVISITED: FROM CORRELATION FILTER TO TEMPLATE MATCHING 2785
Fig. 8. Success and precision plots of our deep feature based tracker TM3 -deep and various compared trackers. (a) shows the results on OTB-2013 with
51 sequences, and (b) presents the results on OTB-2015 containing 100 sequences. For clarity, we only show the curves of top 10 trackers.
Fig. 9. Attribute-based analysis of our TM3 -deep tracker with four main attribute on OTB-2015(100), respectively. For clarity, we only show the top
10 trackers in the legends. The title of each plot indicates the number of videos labelled with the respective attribute.
trackers including ELK [18] and BBT [42] are also incor- based algorithms, template matching based approaches, and
porated for comparison. other representative methods. The favorable performance of
2) Overall Performance: Fig. 7 shows the performance of our TM3 tracker benefits from the fact that the discriminative
all compared trackers on OTB-2013 and OTB-2015 datasets. similarity metric, the memory filtering strategy, and the rich
On OTB-2013, our TM3 tracker with color feature achieves feature help our TM3 tracker to accurately separate the target
57.1% on average overlap rate, which is higher than the from its cluttered background, and effectively capture the
56.7% produced by a very competitive correlation filter based target appearance variations.
algorithm Staple. On OTB-2015, it can be observed that 3) Attribute Based Performance Analysis: To analyze the
the performance of all trackers decreases. The proposed strength and weakness of the proposed algorithm, we pro-
TM3 -color tracker and Staple still provide the best results with vide the attribute based performance analysis to illus-
the AUC scores equivalent to 54.5% and 55.4%, respectively. trate the superiority of our tracker on four key attributes
Specifically, on these two benchmarks, we see that the com- in Fig. 9. All video sequences in OTB have been manually
petitive template matching based tracker BBT obtains 50.0% annotated with several challenging attributes, including Occlu-
and 45.2% on average overlap rate, respectively. Compara- sion (OCC), Illumination Variation (IV), Scale Variation (SV),
tively, our TM3 tracker significantly improves the performance Deformation (DEF), Motion Blur (MB), Fast Motion (FM),
of BBT with a noticeable margin of 7.1% and 9.3% on In-Plane Rotation, Out-of-Plane Rotation (OPR), Out-of-
OTB-2013 and OTB-2015, accordingly. View (OV), Background Clutter (BC), and Low Resolu-
We also test our tracker with deep feature and the corre- tion (LR). As illustrated in Fig. 9, our TM3 -deep tracker
sponding performance of these trackers are shown in Fig. 8. performs the best on OPR, DEF, IPR, and FM attributes
Not surprisingly, TM3 -deep tracker boosts the performance of when compared to some representative trackers. The favorable
TM3 -color with color feature. It achieves 61.2% and 58.0% performance of our tracker on appearance variations (e.g. OPR,
success rates on the above two benchmarks, both of which IPR, and DEF) demonstrates the effectiveness of the discrim-
rank first among all compared trackers. On the precision inative similarity metric and the memory filtering strategy.
plots, the proposed TM3 -deep tracker yields the precision rates
of 82.3% and 79.0% on the two benchmarks, respectively.
The overall plots on the two benchmarks demonstrate that C. Results on PTB
our TM3 (with colored and deep features) tracker comes in The PTB benchmark database contains 100 video sequences
first or second place among the trackers with a comparable with both RGB and depth data under highly diverse cir-
performance evaluated by the success rate. It is able to outper- cumstances. These sequences are grouped into the following
form the trackers such as CNN based trackers, correlation filter aspects: target type (human, animal and rigid), target size
2786 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 6, JUNE 2018
TABLE I
R ESULTS ON THE P RINCETON T RACKING B ENCHMARK W ITH 95 V IDEO S EQUENCES : S UCCESS R ATES AND R ANKINGS ( IN PARENTHESES )
U NDER D IFFERENT S EQUENCE C ATEGORIZATIONS . T HE B EST T HREE R ESULTS A RE H IGHLIGHTED BY, AND R ESPECTIVELY
A PPENDIX A
T HE PROOF OF L EMMA 1
This section aims to simplify the formulation of
EMBS (P, Q) defined by Definition 1 as follows.8
Due to the independence among image patches, the integral
in Eq. (3) can be decoupled as:
EMBS (P, Q) = · · · · · · MBP(pi , q j )
p1 p N q1 qM
N
M
f P ( pk ) f Q (ql )d Pd Q, (16)
k=1 l=1
and the fusion scheme. In this case, “Random Sampling” which shares the similar formulation of BBP(pi , q j ) in [22]
directly outputs Ir∗ as the final tracking result without any (see in Eq. (7) on Page 5). And next, by defining:
fusion scheme, and then produces only one template Tmplr
for updating. As a consequence, the average overlap rate ∞
dramatically decreases from the original 57.1% to 47.2% if C pk = I[d(pk , q j ) ≤ d(pi , q j )] f P ( pk )d pk , (18)
only Flow-r process is used. Likewise, in the “Edgebox” −∞
setting, Flow-e process is retained and Flow-r process
is removed, so only Tmplr is involved for template updat- and assuming d(p, q) = (p − q)2 = |p − q|, we can rewrite
ing. Such setting achieves 49.4% success rate and 64.3% Eq. (18) as:
precision rate, which are much lower than the original
∞
result.
C pk = I pk < q−j ∨ pk > q+j f P ( pk )d pk
2) Parametric Sensitivity: Here we investigate the paramet-
ric sensitivity of the number of atoms in D, the kernel width σ1 −∞
in Eq. (1), and two regularized parameters δ and β in Eq. (9). = FP (q + −
j ) − FP (q j ). (19)
Fig. 11 illustrates that the proposed TM3 tracker is robust to
the variations of these parameters, so they can be easily tuned Similarly, Cql is:
for practical use. ∞
Cql = I d(ql , pi ) ≤ d(pi , q j ) f Q (ql )dql
V. C ONCLUSION −∞
This paper proposes a novel template matching based = FQ ( pi+ ) − FQ ( pi− ). (20)
tracker named TM3 to address the limitations of existing CFTs.
The proposed MBS notably improves the discriminative ability Note that C pk and Cql only depend on pi , q j , and the under-
of our tracker as revealed by both empirical and theoretical lying distributions f P ( p) and f Q (q). Therefore, EMBS (P, Q)
analyses. Moreover, the memory filtering strategy is incorpo- can be reformulated as:
rated into the tracking framework to select “representative”
and “reliable” previous tracking results to construct the current EMBS (P, Q) = d pi dq j f P ( pi ) f Q (q j )MBP(pi , q j ).
pi qj
trustable templates, which greatly enhances the robustness
of our tracker to appearance variations. Experimental results Using the Taylor expansion with second-order
on two benchmarks indicate that our TM3 tracker equipped approximation exp(− σ1 x) = 1 − σ1 x + 2σ1 2 x 2 and
with the multiple reciprocal nearest neighbor scheme and the
memory filtering strategy can achieve better performance than 8 Our simplification process of E
MBS (P, Q) is similar to that of
other state-of-the-art trackers. EBBS (P, Q). Please refer to[22, Sec. 3.1].
2788 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 6, JUNE 2018
N
NC pk = k=1,k = I d(pk , q j ) ≤ d(pi , q j ) by Eq. (19), and As a result, Lemma 2 can be easily proved after some
M i
MCql = l=1,l =i I d(ql , pi ) ≤ d(pi , q j ) Eq. (20), we have:
straightforward algebraic manipulations on Eq. (22) and
Eq. (24).
EMBS (P, Q) = 1
1
− (MCql ) · (NC pk ) f P ( pi ) f Q (q j )d pi dq j A PPENDIX C
σ1 pi q j T HE PROOF OF L EMMA 3
1 By Lemma 1 and Eq. (24), EMBS2 (P, Q) can be repre-
+ 2 (MCql )2 · (NC pk )2 f P ( pi ) f Q (q j )d pi dq j .
2σ1 pi q j sented by:
(21)
EMBS2 (P, Q) = 4EMBS (P, Q) − 3
Finally, after some straightforward algebraic manipulations, ∞
3M N
the EMBS in Eq. (5) can be easily obtained. + FP (q + ) − FP (q − )
σ1
−∞
A PPENDIX B
× FQ ( p + ) − FQ ( p− ) f P ( p) f Q (q)d pdq.
T HE PROOF OF L EMMA 2
This section begins with the formulation of EMBS2 (P, Q), Therefore, by Lemma2 and Eq.(21), we have:
and then computes the variance VMBS (P, Q) = EMBS2 (P, Q) − EMBS (P, Q) = 3EMBS (P, Q) − 3
EMBS2 (P, Q) − E2MBS (P, Q) by Lemma 1. ∞
First, the similarity metric MBP2 (P, Q) is obtained by 3M N
+ FP (q + ) − FP (q − )
Eq. (17), namely: σ1
−∞
2
N
× FQ ( p + ) − FQ ( p − ) f P ( p) f Q (q)d pdq
MBP2 (pi , q j ) = exp − I d(pk , q j ) ≤ d(pi , q j )
σ1 ∞
k=1,k =i 3M 2 N 2 2
= FP (q + ) − FP (q − )
M
2σ12
· I d(ql , pi ) ≤ d(pi , q j ) . −∞
2
l=1,l =i × FQ ( p + ) − FQ ( p − ) f P ( p) f Q (q)d pdq > 0, (25)
Similar to the above derivation of EMBS (P, Q) in Lemma 1, which completes the proof.
EMBS2 (P, Q) can be computed as:
EMBS2 (P, Q) = 1 (22) ACKNOWLEDGEMENTS
∞ The authors would like to thank Cheng Peng from Shanghai
2M N
− FP (q + ) − FP (q − ) FQ ( p + ) − FQ ( p− ) Jiao Tong University for his work on algorithm comparisons,
σ1 and also sincerely appreciate the anonymous reviewers for
−∞
× f P ( p) f Q (q)d pdq their insightful comments.
∞
2M 2 N 2 2 2 R EFERENCES
+ FP (q + ) − FP (q − ) FQ ( p+ ) − FQ ( p − )
σ12
[1] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and J. Y. Choi,
−∞
“Attentional correlation filter network for adaptive visual tracking,” in
× f P ( p) f Q (q)d pdq. (23) Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 4828–4837.
Next, E2MBS (P, Q) can be obtained by Lemma 1 with its [2] L. Zhang, J. Varadarajan, P. N. Suganthan, N. Ahuja, and P. Moulin,
second-order approximation, namely: “Robust visual tracking using oblique random forests,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5825–5834.
E2MBS (P, Q) = 1 [3] F. Liu, C. Gong, T. Zhou, K. Fu, X. He, and J. Yang, “Visual tracking via
nonnegative multiple coding,” IEEE Trans. Multimedia, vol. 19, no. 12,
∞ pp. 2680–2691, Dec. 2017.
M2 N2
+ [FP (q + ) − FP (q − )][FQ ( p + ) − FQ ( p − )] [4] X. Lan, A. J. Ma, and P. C. Yuen, “Multi-cue visual tracking using robust
σ12 feature-level fusion based on joint sparse representation,” in Proc. IEEE
−∞ Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2014, pp. 1194–1201.
2 [5] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual
× f P ( p) f Q (q)d pdq object tracking using adaptive correlation filters,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 2544–2550.
[6] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed
∞
2M N tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal.
− FP (q + ) − FP (q − ) FQ ( p + ) − FQ ( p− ) Mach. Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
σ1 [7] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr,
−∞ “Staple: Complementary learners for real-time tracking,” in Proc. IEEE
× f P ( p) f Q (q)d pdq Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 1401–1409.
∞ [8] Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE
M2 N2 2 2 Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1834–1848,
+ FP (q + ) − FP (q − ) FQ ( p + ) − FQ ( p − ) Sep. 2015.
σ12
[9] A. Li, M. Lin, Y. Wu, M.-H. Yang, and S. Yan, “NUS-PRO: A new visual
−∞
tracking challenge,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38,
× f P ( p) f Q (q)d pdq. (24) no. 2, pp. 335–349, Feb. 2016.
LIU et al.: ROBUST VISUAL TRACKING REVISITED: FROM CORRELATION FILTER TO TEMPLATE MATCHING 2789
[10] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparse [36] J. Zhang, S. Ma, and S. Sclaroff, “MEEM: Robust tracking via multiple
collaborative appearance model,” IEEE Trans. Image Process., vol. 23, experts using entropy minimization,” in Proc. Eur. Conf. Comput.
no. 5, pp. 2356–2368, May 2014. Vis. (ECCV), 2014, pp. 188–203.
[11] X. Jia, H. Lu, and M. Yang, “Visual tracking via coarse and fine [37] J. Gao, H. Ling, W. Hu, and J. Xing, “Transfer learning based visual
structural local sparse appearance models,” IEEE Trans. Image Process., tracking with Gaussian processes regression,” in Proc. Eur. Conf. Com-
vol. 25, no. 10, pp. 4555–4564, Oct. 2016. put. Vis. (ECCV), 2014, pp. 188–203.
[12] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learning [38] J. H. Yoon, M.-H. Yang, and K.-J. Yoon, “Interacting multiview tracker,”
for robust visual tracking,” Int. J. Comput. Vis., vol. 77, nos. 1–3, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 5, pp. 903–917,
pp. 125–141, 2008. May 2015.
[13] X. Li, W. Hu, Z. Zhang, and X. Zhang, “Robust visual tracking [39] B. Ma, H. Hu, J. Shen, Y. Zhang, and F. Porikli, “Linearization to
based on an effective appearance model,” in Proc. Eur. Conf. Comput. nonlinear learning for visual tracking,” in Proc. IEEE Int. Conf. Comput.
Vis. (ECCV), 2008, pp. 396–408. Vis. (ICCV), Dec. 2015, pp. 4400–4407.
[14] J. Ning, J. Yang, S. Jiang, L. Zhang, and M.-H. Yang, “Object tracking [40] H. Possegger, T. Mauthner, and H. Bischof, “In defense of color-
via dual linear structured SVM and explicit feature map,” in Proc. IEEE based model-free tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 4266–4274. Recognit. (CVPR), Jun. 2015, pp. 2113–2120.
[15] P. Sebastian and Y. V. Voon, “Tracking using normalized cross correla- [41] K. Zhang, Q. Liu, Y. Wu, and M.-H. Yang, “Robust visual tracking via
tion and color space,” in Proc. Int. Conf. Intell. Adv. Syst., Nov. 2007, convolutional networks without training,” IEEE Trans. Image Process.,
pp. 770–774. vol. 25, no. 4, pp. 1779–1792, Apr. 2016.
[42] S. Oron, D. Suhanov, and S. Avidan. (2016). “Best-buddies tracking.”
[16] K. Briechle and U. D. Hanebeck, “Template matching using fast
[Online]. Available: https://arxiv.org/abs/1611.00148
normalized cross correlation,” in Proc. SPIE Opt. Pattern Recognit. XII,
[43] S. Hare, A. Saffari, and P. H. S. Torr, “Struck: Structured output tracking
2001, pp. 95–102.
with kernels,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Nov. 2011,
[17] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” pp. 263–270.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422, [44] J. Kwon and K. Mu Lee, “Visual tracking decomposition,” in Proc. IEEE
Jul. 2012. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2010, pp. 1269–1276.
[18] S. Oron, A. Bar-Hille, and S. Avidan, “Extended lucas-Kanade tracking,” [45] K. Zhang, L. Zhang, and M. Yang, “Fast compressive tracking,” IEEE
in Proc. Eur. Conf. Comput. Vis. (ECCV), 2014, pp. 142–156. Trans. Pattern Anal. Mach. Intell., vol. 36, no. 10, pp. 2002–2015,
[19] G. Nebehay and R. Pflugfelder, “Consensus-based matching and tracking Oct. 2014.
of keypoints for object tracking,” in Proc. IEEE Winter Conf. Appl. [46] B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with
Comput. Vis. (WACV), Mar. 2014, pp. 862–869. online multiple instance learning,” IEEE Trans. Pattern Anal. Mach.
[20] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao, “Multi- Intell., vol. 33, no. 8, pp. 1619–1632, Aug. 2011.
store tracker (MUSTer): A cognitive psychology inspired approach to [47] H. Grabner, C. Leistner, and H. Bischof, “Semi-supervised on-line
object tracking,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- boosting for robust tracking,” in Proc. Eur. Conf. Comput. Vis. (ECCV),
nit. (CVPR), Jun. 2015, pp. 749–758. 2008, pp. 234–247.
[21] G. Nebehay and R. Pflugfelder, “Clustering of static-adaptive correspon-
dences for deformable object tracking,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 2784–2791.
[22] T. Dekel, S. Oron, M. Rubinstein, S. Avidan, and W. T. Freeman, “Best-
buddies similarity for robust template matching,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 2021–2029.
[23] S. Song and J. Xiao, “Tracking revisited using RGBD camera: Unified
benchmark and baselines,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Dec. 2013, pp. 233–240.
Fanghui Liu received the B.S. degree in control
[24] T. Zhou, H. Bhaskar, F. Liu, and J. Yang, “Graph regularized and
science and engineering from the Harbin Institute of
locality-constrained coding for robust visual tracking,” IEEE Trans.
Technology, Harbin, China, in 2014. He is currently
Circuits Syst. Video Technol., vol. 27, no. 10, pp. 2153–2164, Oct. 2017.
pursuing the Ph.D. degree with the Institute of
[25] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from Image Processing and Pattern Recognition, Shang-
edges,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2014, pp. 391–405. hai Jiao Tong University, under the supervision of
[26] A. Krause and V. Cevher, “Submodular dictionary selection for sparse Prof. Jie Yang. His research areas mainly include
representation,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, computer vision and machine learning with respect
pp. 567–574. to kernel learning, visual tracking, and bayesian
[27] N. Parikh and S. Boyd, “Proximal algorithms,” Found. Trends Opt., learning.
vol. 1, no. 3, pp. 127–239, 2013.
[28] C. Bailer, A. Pagani, and D. Stricker, “A superior tracking approach:
Building a strong tracker through fusion,” in Proc. Eur. Conf. Comput.
Vis. (ECCV), 2014, pp. 170–185.
[29] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, and
A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int.
J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
[30] R. Girshick, “Fast R-CNN,” in Proc. IEEE Int. Conf. Comput.
Vis. (ICCV), Dec. 2015, pp. 1440–1448.
[31] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: Chen Gong received the dual Ph.D. degree from
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Shanghai Jiao Tong University (SJTU) and the Uni-
Vis. Pattern Recognit. (CVPR), Jun. 2009, pp. 248–255. versity of Technology Sydney in 2016, under the
[32] A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks supervision of Prof. Jie Yang and Prof. Dacheng
for MATLAB,” in Proc. ACM Int. Conf. Multimedia, 2015, pp. 689–692. Tao, respectively. He is currently a Professor with
[33] W. Hu, J. Gao, J. Xing, C. Zhang, and S. Maybank, “Semi-supervised the School of Computer Science and Engineer-
tensor-based graph embedding learning and its application to visual ing, Nanjing University of Science and Technology.
discriminant tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, He has published over 30 technical papers at promi-
no. 1, pp. 172–188, Jan. 2017. nent journals and conferences such as the IEEE
[34] J. Xiao, L. Qiao, R. Stolkin, and A. Leonardis, “Distractor-supported T-NNLS, the IEEE T-IP, the IEEE T-CYB, CVPR,
single target tracking in extremely cluttered scenes,” in Proc. Eur. Conf. AAAI, IJCAI, and ICDM. His research interests
Comput. Vis. (ECCV), 2016, pp. 121–136. mainly include machine learning and data mining. He was a recipient of the
[35] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Accurate Excellent Doctoral Dissertation awarded by SJTU and Chinese Association
scale estimation for robust visual tracking,” in Proc. Brit. Mach. Vis. for Artificial Intelligence. He was also enrolled by the Summit of the Six Top
Conf. (BMVC), 2014, pp. 1–5. Talents Program of Jiangsu Province, China.
2790 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 6, JUNE 2018
Xiaolin Huang (S’10–M’12) received the B.S. Jie Yang received the Ph.D. degree from the Depart-
degree in control science and engineering and the ment of Computer Science, Hamburg University,
B.S. degree in applied mathematics from Xi’an Germany, in 1994. He is currently a Professor with
Jiaotong University, Xi’an, China, in 2006, and the Institute of Image Processing and Pattern recog-
the Ph.D. degree in control science and engi- nition, Shanghai Jiao Tong University, China. He has
neering from Tsinghua University, Beijing, China, led many research projects (e.g., National Science
in 2012. From 2012 to 2015, he was a Post-Doctoral Foundation and 863 National High Tech. Plan), had
Researcher with ESAT-STADIUS, KU Leuven, one book published in Germany, and authored over
Leuven, Belgium. After that he was selected as 300 journal papers. His major research interests are
an Alexander von Humboldt Fellow and with object detection and recognition, data fusion and
the Pattern Recognition Laboratory, the Friedrich- data mining, and medical image processing.
Alexander-Universitt Erlangen-Nürnberg, Erlangen, Germany, where he was
appointed as a Group Head. Since 2016, he has been an Associate Professor Dacheng Tao (F’15) is currently a Professor of
with the Institute of Image Processing and Pattern Recognition, Shanghai Jiao computer science and an ARC Laureate Fellow with
Tong University, Shanghai, China. His current research areas include machine the School of Information Technologies and the
learning and optimization and their applications. In 2017, he has been awarded Faculty of Engineering and Information Technolo-
as 1000-Talent (Young Program). gies, and the Inaugural Director of the UBTECH
Sydney Artificial Intelligence Centre, The University
of Sydney. He mainly applies statistics and math-
ematics to artificial intelligence and data science.
His research interests spread across computer vision,
data science, image processing, machine learning,
and video surveillance. His research results have
expounded in one monograph and over 500 publications at prestigious journals
Tao Zhou received the M.S degree in computer
and prominent conferences, such as the IEEE T-PAMI, T-NNLS, T-IP, JMLR,
application technology from Jiangnan University
IJCV, NIPS, ICML, CVPR, ICCV, ECCV, ICDM, and ACM SIGKDD. He is a
in 2012 and the Ph.D. degree in pattern recognition
fellow of the AAAS, OSA, IAPR, and SPIE. He was a recipient of the several
and intelligent system from the Institute of Image
best paper awards, such as the best theory/algorithm paper runner up award
Processing and Pattern Recognition, Shanghai Jiao
in the IEEE ICDM’07, the best student paper award in the IEEE ICDM’13,
Tong University, in 2016. His current research inter-
the distinguished student paper award in the 2017 IJCAI, the 2014 ICDM
ests include object detection, visual tracking, and
10-year highest-impact paper award, and the 2017 IEEE Signal Processing
machine learning.
Society Best Paper Award. He was also a recipient of the 2015 Australian
Scopus-Eureka Prize, the 2015 ACS Gold Disruptor Award, and the 2015 UTS
Vice-Chancellor’s Medal for Exceptional Research.