Professional Documents
Culture Documents
Multi-Cue Visual Tracking
Multi-Cue Visual Tracking
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
I. I NTRODUCTION
ISUAL tracking is an important and active research
topic in computer vision community because of its
wide range of applications, e.g., intelligent video surveillance,
human computer interaction and robotics. Although it has been
extensively studied in the last two decades, it still remains to
be a challenging problem due to many appearance variations
caused by occlusion, pose, illumination and so on. Therefore,
effective modeling of the objects appearance is one of the
key issues for the success of a visual tracker [1], and many
visual features have been proposed to account for the different
variations [2][5]. However, since the appearance of a target
and the environment changes dynamically, especially in long
term videos, a single feature is insufficient to deal with all
1057-7149 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted,
but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
5827
5828
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
5829
1 k
y X k wk 22 + (W )
2
K
min
(4)
k=1
A. Particle Filter
Our tracking algorithm is developed under the framework of
sequential Bayesian inference. Let lt and z t denote the latent
state and observation at time t, respectively. Given a set of
observation Z t = {z t , t = 1, . . . , T } up to Frame T , the true
posterior P(lt |Z t ) is approximated by a particle filter with a
set of particles lti , i = 1, . . . , n. The latent state variable lt ,
which describes the motion state of target is estimated using:
lt = arg max P(lti |Z t )
lti
where P(lt |lt 1) denotes the motion model, and P(z t |lt ) the
observation model. We use the same motion model as in [5],
and define the observation model using the proposed tracking
algorithm, which will be described in the following subsection.
B. Feature-Level Fusion Based on Joint Sparse
Representation
N
wn 2
(6)
n=1
(1)
The tracking problem is thus formulated as recursively estimating of the posterior probability P(lt |Z t ),
P(lt |Z t ) P(z t |lt ) P(lt |lt 1 )P(lt 1 |Z t 1)dlt 1 (2)
(5)
(3)
y k = w1k x 1k + + wkN x Nk + k
k
(7)
k
where w1 , . . . , w N are non-zero weights. In this case, we cannot obtain a robust fusion result by minimizing optimization
problem (4) with the regularization function (6).
Although unreliable features cannot satisfy (5), reliable features can still be sparsely represented by (5) and used to choose
the most informative target templates for reconstruction. With
the selected templates of index n 1 , , n c , we rewrite (7) as
follows,
yk
c
i=1
wnk i x nki =
Nc
j =1
k
wm
x k + k
j mj
(8)
5830
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
K
N K
|wnk |2 + 2
K Nc
k=1 j =1
n=1 k=1
k 2
|wm
|
j
(9)
(10)
yk
c
snki x nki =
Nc
smk j x mk j + k ,
j =1
i=1
k = K K + 1, , K
(11)
N
rn 2 + 2
n=1
K
s k 2
(12)
k=1
W,R,S
1 k
y X k wk 22 + 1
rn 2 + 2
s k 2
2
K
k=1
n=1
k=1
s.t. W = R + S
(13)
O = {k , s.t., s k 2 T }
(14)
This scheme detects the unreliable visual cues when the norm
of some column of matrix S is larger than a pre-defined
threshold T . That is to say, the features which get strong
response in the corresponding columns of S are determined as
unreliable features. Since we have no information about the
selected templates and unreliable features before learning the
reconstruction coefficients, the matrix S may also encode some
small components of reconstruction coefficients of reliable
features (can be found in Fig.1 (d) where matrix S shows some
small non-zero components of reliable features). Therefore,
if the threshold is too small, reliable features which have small
components in S may be mistaken as unreliable features.
On the other hand, if it is too large, unreliable features may not
be successfully detected. In our experiment, we empirically
set it as 0.0007. On the other hand, the likelihood function is
defined by R and S as follows. The representation coefficients
of different visual cues are estimated and the unreliable
features are detected by solving the optimization problem (13).
Then, the observation likelihood function in (2) is defined as
1
y j X j r j 22 )
(15)
p(z t |lt ) ex p(
KK
j
/O
G(R, S) = 1
k=1
N
rn 2 + 2
n=1
k=1
K
n=1
s k 2
(16)
k=1
(R, S) =
k=1
+ (vk,t )T (s k v k,t )}
(17)
k=1
(18)
R,S
min
S
K
N
1 k
1
1
r (u k,t uk,t )22 +
rn 2
2
k=1
K
1
k=1
1 k k k
1
f (r k , s k ) =
y
x n (rn + snk )22
2
2
K
F(R, S) =
5831
1
t+ 1
t +1
rn = max(0, 1
) rn 2 , n = 1, , N (20)
1
t+
rn 2 2
1
1
s k,t + 2 = v k,t vk,t , k = 1, , K ,
1
2
) s k,t + 2 , k = 1, , K
s k,t +1 = max(0, 1
k,t + 12
s
2
(21)
S t +1
It should be noted that the update schemes (20) for R and (21)
for S are different from each other, since R and S have
different sparsity properties grouping according to columns
and rows, respectively.
2) Aggregation Step: the aggregation matrix is updated as
follows.
at 1 t +1
(R
R t ),
U t +1 = R t +1 +
at +1
at 1 t +1
(S
St )
(22)
V k+1 = S t +1 +
at +1
1+ 1+4at2
where at +1 =
, and a0 = 1.
2
The overall optimization procedure is summarized in
Algorithm 1.
n=1
s k (v k,t
K
1
2 k
vk,t )22 +
s 2
(19)
k=1
5832
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
After solving (25), the representation coefficients of different visual cues are estimated and the unreliable features in
kernel space are detected. Then, the observation likelihood
function can be re-defined as,
1
[K(y j , y j ) 2(w j )T K(X j , y j )
p(z t |lt ) ex p(
K K
j
/O
+ (w j )T K(X j , X j )w j ])
(27)
(23)
n=1
k=1
+ 2
K
s k 2
k=1
s.t. W = R + S
(24)
k=1
N
rn 2 + 2
n=1
s.t. W = R + S
K
s k 2
k=1
(25)
Here the element in the i -th row and j -th columns of the
kernel matrix K(X, Y ) is defined as,
Ki, j (X, Y ) = (x i )T (y j )
1
{K(y k , y k ) 2 (wk )T K(X k , y k )
min
W,R,S 2
+ (wk )T K(X k , X k )wk } + 1
(26)
5833
V. E XPERIMENTS
In this section, we evaluate the proposed robust joint sparse
representation-based feature-level fusion tracker (RJSRFFT)
and its kernelized method (K-RJSRFFT) using both synthetic
data and real videos from publicly available datasets.
A. Unreliable Feature Detection on Synthetic Data
(28)
5834
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
TABLE I
T HE C HALLENGING FACTOR AND I TS C ORRESPONDING V IDEOS
5835
Fig. 2. Graphical Illustration of Unreliable Feature Detection Scheme. The upper-left image in purple box show the tracking result in the 658-th frame
of video Occluded Face 1. The seven figures in the upper-right green box illustrate the resulting reconstruction coefficients via solving the 1 minimization
problem as shown in small green box. After solving (24), the resulting reconstruction coefficients of each feature encoded in matrix R are illustrated in the
bottom-left blue box, and the unreliable detection results encoded in matrix S are illustrated in the bottom-right blue box. The y-coordinate of all above
figures denotes the value of coefficient, and the x-coordinate denotes the index of the template in the template set.
TABLE II
C ENTER L OCATION E RROR ( IN P IXELS ). T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE AND G REEN
utilize different features to handle different kinds of variations while preserving the advantages of sparse representation.
The proposed multi-cue trackers also perform better than
existing multi-cue trackers, i.e., OAB, MIL, and SemiB. This
5836
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
TABLE III
VOC OVERLAPPING R ATE . T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE AND G REEN
TABLE IV
S UCCESS R ATE . T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE AND G REEN
5837
Fig. 3. The survival curve based on the F-score for the entire thirty-three videos and the subset of videos containing different challenging factors, and the
average F-scores are shown in the corresponding legend in descending order.(a) Whole set of videos. (b) Occlusion subset of videos. (c) Cluttered background
subset of videos. (d) Scale subset of videos. (e) Illumination subset of videos. (f) Pose subset of videos.
5838
Fig. 4.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
Qualitative results on some typical frames including some challenging factors. (a) Occlusion. (b) Background. (c) Scale. (d) Illumination. (e) Pose.
5839
F F
2r k r k s k
Bk = F F
r k s k 2 s k
K(X k , X k )
=
K(X k , X k )
K(X k , X k )
K(X k , X k )
R(2N)(2N)
(29)
(30)
5840
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
[66] S. Bahrampour, A. Ray, N. M. Nasrabadi, and K. W. Jenkins, Qualitybased multimodal classification using tree-structured sparsity, in Proc.
CVPR, Jun. 2014, pp. 41144121.
[67] P. Gong, J. Ye, and C. Zhang, Robust multi-task feature learning, in
Proc. KDD, 2012, pp. 895903.
[68] Y. Nesterov, Gradient methods for minimizing composite functions,
Math. Program., vol. 140, no. 1, pp. 125161, Aug. 2013.
[69] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding
algorithm for linear inverse problems, SIAM J. Imag. Sci., vol. 2, no. 1,
pp. 183202, 2009.
[70] P. Tseng, On accelerated proximal gradient methods for convexconcave optimization, SIAM J. Optim., 2008.
[71] X.-T. Yuan, X. Liu, and S. Yan, Visual classification with multitask
joint sparse representation, IEEE Trans. Image Process., vol. 21, no. 10,
pp. 43494360, Oct. 2012.
[72] X. Chen, W. Pan, J. T. Kwok, and J. G. Carbonell, Accelerated
gradient method for multi-task sparse learning problem, in Proc. ICDM,
Dec. 2009, pp. 746751.
[73] A. Shrivastava, V. M. Patel, and R. Chellappa, Multiple kernel learning for sparse representation-based classification, IEEE Trans. Image
Process., vol. 23, no. 7, pp. 30133024, Jul. 2014.
[74] S. Gao, I. W. Tsang, and L.-T. Chia, Sparse representation with
kernels, IEEE Trans. Image Process., vol. 22, no. 2, pp. 423434,
Feb. 2013.
[75] Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learning-detection,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, pp. 14091422,
Jul. 2012.
[76] M. Godec, P. M. Roth, and H. Bischof, Hough-based tracking of nonrigid objects, in Proc. ICCV, Nov. 2011, pp. 8188.
[77] L. Sevilla-Lara and E. G. Learned-Miller, Distribution fields for tracking, in Proc. CVPR, Jun. 2012, pp. 19101917.
[78] P. Li, Q. Wang, W. Zuo, and L. Zhang, Log-Euclidean kernels for
sparse representation and dictionary learning, in Proc. ICCV, Dec. 2013,
pp. 16011608.
[79] N. Dalal and B. Triggs, Histograms of oriented gradients for human
detection, in Proc. CVPR, Jun. 2005, pp. 886893.
[80] Q. Wang, F. Chen, W. Xu, and M.-H. Yang, Object tracking via partial
least squares analysis, IEEE Trans. Image Process., vol. 21, no. 10,
pp. 44544465, Oct. 2012.
5841
Rama
Chellappa
(S78M79SM83F92)
received the B.E. (Hons.) degree in electronics and
communication engineering from the University of
Madras, India, the M.E. (Hons.) degree from the
Indian Institute of Science, Bangalore, India, and the
M.S.E.E. and Ph.D. degrees in electrical engineering
from Purdue University, West Lafayette, IN. From
1981 to 1991, he was a Faculty Member with
the Department of Electrical Engineering-Systems,
University of Southern California (USC). Since
1991, he has been a Professor of Electrical and
Computer Engineering and an Affiliate Professor of Computer Science with
the University of Maryland (UMD), College Park. He is also affiliated
with the Center for Automation Research and the Institute for Advanced
Computer Studies (Permanent Member) and is serving as the Chair of the
Electrical and Computer Engineering Department. In 2005, he was named
Minta Martin Professor of Engineering. His current research interests span
many areas in image processing, computer vision, and pattern recognition.
He is a Golden Core Member of the IEEE Computer Society, served as
a Distinguished Lecturer of the IEEE Signal Processing Society and as
the President of the IEEE Biometrics Council. He is a fellow of IAPR,
Optical Society, American Association for the Advancement of Science,
Association for Computing Machinery, Association for the Advancement of
Artificial Intelligence, and holds four patents. He is a recipient of an NSF
Presidential Young Investigator Award and four IBM Faculty Development
Awards. He received two paper awards and the K.S. Fu Prize from the
International Association of Pattern Recognition (IAPR). He is a recipient
of the Society, Technical Achievement, and Meritorious Service Awards
from the IEEE Signal Processing Society. He also received the Technical
Achievement and Meritorious Service Awards from the IEEE Computer
Society. He is also a recipient of Excellence in teaching award from
the School of Engineering at USC. At UMD, he received the college and
university level recognitions for research, teaching, innovation, and mentoring
undergraduate students. In 2010, he was recognized as an Outstanding
Electrical and Computer Engineering by Purdue University. He served as
the Editor-in-Chief of the IEEE T RANSACTIONS ON PATTERN A NALYSIS
AND M ACHINE I NTELLIGENCE and the General and Technical Program
Chair/Co-Chair of several IEEE International and National Conferences and
Workshops.