Multi-Cue Visual Tracking

5826
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 12, DECEMBER 2015
Joint Sparse Representation and Robust

Feature-Level Fusion for Multi-Cue Visual Tracking
Xiangyuan Lan, Student Member, IEEE, Andy J. Ma, Pong C. Yuen, Senior Member, IEEE,
and Rama Chellappa, Fellow, IEEE
Abstract Visual tracking using multiple features has been
proved as a robust approach because features could complement each other. Since different types of variations such as
illumination, occlusion, and pose may occur in a video sequence,
especially long sequence videos, how to properly select and fuse
appropriate features has become one of the key problems in
this approach. To address this issue, this paper proposes a new
joint sparse representation model for robust feature-level fusion.
The proposed method dynamically removes unreliable features
to be fused for tracking by using the advantages of sparse
representation. In order to capture the non-linear similarity
of features, we extend the proposed method into a general
kernelized framework, which is able to perform feature fusion on
various kernel spaces. As a result, robust tracking performance
is obtained. Both the qualitative and quantitative experimental
results on publicly available videos show that the proposed
method outperforms both sparse representation-based and fusion
based-trackers.
Index Terms Visual tracking, feature fusion, joint sparse
representation.
I. I NTRODUCTION
ISUAL tracking is an important and active research
topic in computer vision community because of its
wide range of applications, e.g., intelligent video surveillance,
human computer interaction and robotics. Although it has been
extensively studied in the last two decades, it still remains to
be a challenging problem due to many appearance variations
caused by occlusion, pose, illumination and so on. Therefore,
effective modeling of the objects appearance is one of the
key issues for the success of a visual tracker [1], and many
visual features have been proposed to account for the different
variations [2][5]. However, since the appearance of a target
and the environment changes dynamically, especially in long
term videos, a single feature is insufficient to deal with all
Manuscript received March 2, 2015; revised July 26, 2015; accepted

August 27, 2015. Date of publication September 23, 2015; date of current version October 28, 2015. This work was supported by the General
Research Fund through the Research Grants Council, Hong Kong, under
Grant HKBU 212313. The associate editor coordinating the review of this
manuscript and approving it for publication was Prof. Ling Shao.
X. Lan and P. C. Yuen are with the Department of Computer Science, Hong
Kong Baptist University, Hong Kong (e-mail: xylan@comp.hkbu.edu.hk;
pcyuen@comp.hkbu.edu.hk).
A. J. Ma was with the Department of Computer Science, Hong Kong
Baptist University, Hong Kong. He is now with the Department of Computer
Science, Johns Hopkins University, Baltimore, MD 21218 USA (e-mail:
andyjhma@cs.jhu.edu).
R. Chellappa is with the Center for Automation Research, Department
of Electrical and Computer Engineering, University of Maryland Institute
for Advanced Computer Studies, University of Maryland, College Park,
MD 20742 USA (e-mail: rama@cfar.umd.edu).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2015.2481325
such variations. As such, the use of multiple cues/features to

model the object appearance has been proved to be a robust
approach for improved performance [6][11]. Along this line,
many algorithms have been proposed for visual tracking in
the past years. Generally speaking, existing multi-cue tracking algorithms can be roughly divided into two categories:
score-level and feature-level. Score-level approaches combine
classification score corresponding to different visual features
to perform foreground and background classification. Methods
such as online boosting [7], [12], multiple kernel boosting [6]
and online multiple instance boosting [8] belong to this
category. However, the Data Processing Inequality (DPI) [13]
indicates that the feature-level contains more information than
the classifier level. Therefore, feature-level fusion should be
performed to take advantage of more informative cues for
tracking. A typical feature-level fusion method is to concatenate different feature vectors to form a single vector [9]. But
such a method may result in a high-dimensional feature vector
which may degrade tracking efficiency. In addition, ignoring
the incompatibility of heterogeneous features may cause performance degradation. Another feature-level fusion method is
multiple kernel learning (MKL) [14], which aims to learn a
weighted combination of different feature kernels according
to their discriminative power. Since some features extracted
from a target may be corrupted due to some unexpected
variations, performance of the pre-learned classifier based on
MKL may be affected by such features. Moreover, because not
all cues/features are reliable, combining all the features may
not improve the tracking performance. As such, the dynamic
selection/combination of visual cues/features is required for
multi-cue tracking.
Recently,
multi-task
joint
sparse
representation (MTJSR) [15] has been proposed for feature-level
fusion for object classification and promising results have
been reported. In MTJSR, the class-level joint sparsity
patterns among multiple features are discovered by using
a joint sparsity-inducing norm. Therefore, the relationship
among different visual cues can be discovered by the joint
sparsity constraint. Moreover, high-dimensional features are
represented by low-dimensional reconstruction weights for
efficient fusion. However, MTJSR was derived based on the
assumption that all representation tasks are closely related
and share the same sparsity pattern, which may not be valid
in tracking applications, since unreliable features may exist
due to appearance and other variations of a tracked target.
For example, when a tracked object is partially occluded,
its local features extracted from the occluded part(s) may
1057-7149 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted,
but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION
not be relevant to or well represented by the corresponding

feature dictionary. In such cases, reconstruction coefficients
of such features may not demonstrate the sparsity pattern,
which is an intrinsic characteristic of sparse representation
as indicated in [16] and [17]. In addition, since unreliable
features are not relevant to its corresponding feature
dictionary, the reconstruction error may be large. Fusing
such features with large reconstruction error may not be
able to describe the object appearance well, which may
degrade the tracking performance. In [15], a weighted scheme
based on LPBoost [18] is adopted to measure the confidence
of different features on a validation set for fusion. But in
visual tracking, the variation of the tracked target cannot be
predicted, and it is not reasonable to measure the confidence
of different features based on off line validation set.
In order to address the above mentioned issues in featurelevel fusion based trackers, we propose a new tracking
algorithm called Robust Joint Sparse Representation-based
Feature-level Fusion Tacker (RJSRFFT).1 We integrate feature
selection, fusion and representation into a unified optimization
model. Within this model, unreliable features are detected and
feature-level fusion is simultaneously performed on selected
reliable features . Therefore, the negative effect from unreliable
features can be removed, and the relationship of reliable
features can be exploited to select representative templates for
a sparse representation of the tracked target, which processes
the advantage of sparse trackers. Based on the proposed
model, we also develop a general tracking algorithm called
Robust Joint Kernel Sparse Representation based Featurelevel Fusion Tacker (K-RJSRFFT) to dynamically perform
fusion of different features from different kernel spaces, which
is able to utilize the non-linearity in features to enhance the
tracking performance.
The contributions of this paper are as follows:
We develop a new visual tracking algorithm based on
feature-level fusion using joint sparse representation. The
proposed method inherits all the advantages of joint
sparse representation and is able to fuse multiple features
for object tracking.
We present a method to detect the unreliable visual
features for feature-level fusion. By removing unreliable
features (outliers) which introduce negative effect in
fusion, the tracking performance can be improved.
In order to utilize the non-linearity in data samples to
enhance the performance of the proposed tracking algorithm, we extend the algorithm into a general framework
which is able to dynamically combine multiple kernels
with multiple features.
The main differences of this journal version and our preliminary conference paper version [19] are as follows. First,
the tracker reported in the conference version [19] is linear
in data samples while this journal version considers nonlinear extension and is able to dynamically perform featurelevel fusion in multiple kernel spaces. Second, the optimization procedure used in this journal version (the proposed
K-RJSRFFT method) is derived based on kernel functions
1 Preliminary results of this work were presented in [19].
5827
without knowing the exact mapping. Therefore, the proposed

tracker K-RJSRFFT is able to operate in any implicit feature
spaces once the kernel matrix is given. Third, to increase
the computational efficiency using the accelerated proximal
gradient method, this journal version introduces a new method
to adaptively and tightly estimate the Lipschitz constant. This
is different from the conference version [19] which used a
predefined value, Finally, comprehensive analysis and experimental evaluation between the proposed trackers and stateof-the-art trackers are presented in this journal version, which
further demonstrates the effectiveness of the proposed robust
feature-level fusion scheme.
The rest of this paper is organized as follows. In Section II,
we first review some related works on tracking from two
approaches: generative and discriminative. We also give an
analysis of sparse representation-based trackers as well as
review joint sparse representation-based methods for computer
vision. In Section III, we will present the proposed featurelevel fusion tracking algorithm. In Section IV, we discuss some
implementation details of the proposed algorithm. Experiment
results and conclusions are are presented in Sections V and VI,
respectively.
II. R ELATED W ORK
In this section, we first review some nominal tracking
algorithms and discuss sparse representation-based trackers.
We also review existing joint sparse representation methods
related to our proposed tracking algorithm. For a more comprehensive study of tracking methods, interested readers can
refer to [1] and [20][24].
A. Generative Trackers
Generative trackers construct an appearance model to
describe the target and then search for an image region
that best matches the object appearance model. In the last
decade, much work has been done on object appearance
modeling, including mixture models [25], [26], kernel-based
methods [27][29], subspace learning [3], [30][33] and linear
representations [34][40]. In [25], a WSL mixture model was
constructed via the expectation maximization algorithm to
handle appearance variations and occlusion during tracking.
In [27], a mean-shift procedure based on color histogram
representation was introduced to locate the object. Based
on the work in [27], some variants of kernel-based trackers
were proposed, e.g., [28], [29]. Since most kernel-based
trackers do not have occlusion handling steps, fragment-based
trackers [41] was proposed to handle occlusions using a
patch-based appearance model. Based on the assumption
that the object appearances under mild pose variations lie
in a low-dimensional subspace, Black and Jepson [30]
employed subspace learning to construct an object appearance
model off-line within the optical flow framework. Instead of
using a pre-learned appearance model which may not well
account for large appearance changes, incremental subspace
learning methods, e.g., [3], [31], [32] attempted to learn
adaptive, low-dimensional subspace for appearance modeling.
Kwon and Lee [33] exploited sparse PCA to construct multiple
observation models to enhance the robustness to combinatorial
5828
appearance changes. Different from subspace learning

approaches, which require to learn subspace bases, sparse
linear representation-based methods [34] directly use sparse
linear combination of object templates plus trivial templates
to account for target appearance variations while modeling
occlusion and image noise. Li et al. [35] proposed nonsparse metric-weighted least square regression for efficient
appearance modeling, as such an adaptive metric can capture
the correlation of feature dimension, leading to the utilization
of more information for object appearance modeling.
B. Discriminative Trackers
Discriminative
trackers
cast
tracking
as
a
foreground/background classification problem, and locate the
objects position by maximizing the interclass separation.
Recent efforts on discriminative trackers include SVM-based
tracking [42], [43], online boosting [6], [7], [12], multiple
instance learning [8], [44], compressive tracking [4], and
metric learning trackers [45][47]. Avidan [42] employed an
SVM based on optical flow for object tracking. However,
since the proposed tracking algorithm learned the classifier
with off-line samples, it may not be able to handle
unpredicted appearance variations. Different from the static
appearance model, an adaptive model can handle appearance
variation adaptively. In [7], an online boosting algorithm
was developed for feature selection for visual tracking.
Since this method regards the current tracker position as a
positive sample for model updating, the tracker may drift
owing to updating the model with potential misaligned
samples. In [12], a semi-supervised learning algorithm was
proposed for tracking in which positive samples are collected
only at the first frame to handle ambiguity of sample
labeling. Babenko et al. [8] developed an online tracking
algorithm within the multiple instance learning framework
in which samples are collected in the view of bag. With
the goal of enhancing resistance to background distracters,
Jiang et al. [45] exploited neighborhood component analysis
metric learning to learn adaptive metric for differential
tracking. Different from classical subspace learning-based
feature extraction methods in visual tracking [31], [48]
that are data-dependent and may be sensitive to misaligned
samples, Zhang et al. [4] exploited a data-independent sparse
random measurement matrix to extracted low-dimensional
features for Naive Bayes classifier-based tracking. Recently,
Hare et al. [49] proposed an online structured output
SVM-based tracker to incorporate structured constraints into
classifier training stage. Henriques et al. [50] developed a
fast Fourier transformation-based tracker which exploits the
circulant structure of kernel in an SVM. In [51], a superpixelbased discriminative appearance model was introduced to
handle occlusion and deformation.
C. Sparse Representation Based Trackers
Based on the intuition that appearances of a tracked object
can be sparsely represented by its appearance in previous
frames, a sparse representation-based tracker was introduced
in [17], which is robust to occlusion and noise. Subsequently,
many algorithms have been proposed to improve tracking
accuracy with reduced computational complexity [22].

Li et al. [52] exploited compressive sensing theory to
reduce the template dimension to improve the computational
efficiency. Liu et al. [53] developed a two-stage sparse
optimization-based tracker in which sparse discriminative
features are selected and temporal information is exploited for
target representation via dynamic group sparsity algorithm.
Liu et al. [54] proposed a mean-shift tracker based on
histogram of local sparse code. Mei et al. [34] adopted
2 minimization to bound the 1 error in order to speed up
particle resampling, and Bao et al. [55] utilized the accelerated
proximal gradient method and introduced an 2 norm regularization step on trivial template coefficients to speed up and
improve the performance of the 1 tracker. Zhong et al. [56]
combined a holistic sparsity-based discrminative classifier
and a local sparse generative model to handle occlusion,
cluttered background and model drift. Zhang et al. [57]
proposed a multi-task joint sparse learning method to exploit
the relationship between particles such that the accuracy of
1 tracker can be improved. Jia et al. [58] developed a local
sparse appearance model to enhance robustness to occlusion.
Wang et al. [59] introduced sparsity regularization to
incremental subspace learning to account for the noise caused
by occlusion. Zhuang et al. [60] constructed a discriminative
similarity map for reverse sparse-based tracking. In order
to exploit a visual prior for tracking, Wang et al. [61]
constructed a codebook from SIFT descriptors learned from
a general image data set for sparse coding, and a linear
classifier is trained on sparse code for foreground/background
classification. The trackers mentioned above utilized a
single cue/feature for appearance modeling. To fuse multiple
features, Wu et al. [9] concatenated multiple features into a
high-dimensional feature vector to construct a template set
for sparse representation. However, the high dimensionality
of the combined feature vector increases the computational
complexity of this method. And, fusion via concatenation
may not be able to improve performance when some source
data are corrupted. Wang et al. [62] introduced a kernel sparse
representation-based tracking algorithm to fuse multiple features in kernel space for tracking, but this method assumes all
features are reliable and it could not adaptively select reliable
features for tracking. Recently, Wu et al. [63] proposed metric
learning-based structural appearance model in which weighted
local sparse coding are fused for appearance modeling by
concatenation.
D. Multi-Task Joint Sparse Representation
Multi-task learning aims to improve the overall performance of related tasks by exploiting the cross-task relationships. Yuan and Yan [15] formulated linear representation models from multiple visual features as a multi-task
joint sparse representation problem in which multiple features are fused via class-level joint sparsity regularization.
Zhang et al. [64] proposed a novel joint dynamic sparsity
prior for multi-observation visual recognition. Shekhar et
al. [65] proposed a novel multimodal multivariate sparse
representation method for multimodal biometrics recognition.
Recently, Bahrampour et al. [66] incorporated a tree-structure
sparsity-inducing norm and a weighted scheme for multimodality classification.
5829
the reconstruction error and a regularization term, i.e.
III. ROBUST F EATURE -L EVEL F USION

FOR M ULTI -C UE T RACKING
This section presents the details of the proposed featurelevel fusion tracking algorithm. The proposed method is
developed based on the particle filter framework, and consists
of two major components: feature-level fusion based on joint
sparse representation and detecting unreliable visual cues for
robust fusion.
1 k
y X k wk 22 + (W )
2
K
min
(4)
k=1
where 2 represents 2 norm, is a non-negative parameter,

W = (w1 , . . . , w K ) R NK is the matrix of the weight
vectors and is the regularization function on W .
To derive the regularization function , we assume that the
current tracking result can be sparsely represented by the same
set of chosen target templates with index n 1 , , n c for each
visual cue, i.e.
y k = wnk 1 x nk1 + + wnk c x nkc + k , k = 1, , K
A. Particle Filter
Our tracking algorithm is developed under the framework of
sequential Bayesian inference. Let lt and z t denote the latent
state and observation at time t, respectively. Given a set of
observation Z t = {z t , t = 1, . . . , T } up to Frame T , the true
posterior P(lt |Z t ) is approximated by a particle filter with a
set of particles lti , i = 1, . . . , n. The latent state variable lt ,
which describes the motion state of target is estimated using:
lt = arg max P(lti |Z t )
lti
where P(lt |lt 1) denotes the motion model, and P(z t |lt ) the
observation model. We use the same motion model as in [5],
and define the observation model using the proposed tracking
algorithm, which will be described in the following subsection.
B. Feature-Level Fusion Based on Joint Sparse
Representation
(W ) = (w1 2 , , w N 2 )1 =
N
wn 2
(6)
n=1
where wn denotes the n-th row in matrix W corresponding

to the weights of visual cues for the n-th target template.
With this formulation, the joint sparsity across different visual
cues can be discovered, i.e. wn becomes zero for a large
number of target templates when minimizing the optimization
problem (4). This ensures that all the selected templates (with
non-zero weights) play more important roles in reconstructing
the current tracking result for all the visual cues.
C. Detecting Unreliable Visual Cues for Robust Fusion
In the particle filter-based multi-cue tracking framework,

we are given K types of visual cues, e.g. color, shape and
texture, to represent the tracking result in the current frame and
template images of the target object. Denote the k-th visual
cues of the current tracking result and the n-th template image
as y k and x nk , respectively. Inspired by the sparse tracking
algorithm [17], the tracking result in the current frame can
be sparsely represented by a linear combination of the target
templates added by an error vector k for each visual cue, i.e.
y k = X k wk + k , k = 1, , K
Under the joint sparsity assumption, the number of chosen

target templates c = (w1 2 , , w N 2 )0 is a small
number. Therefore, we can minimize the sparsity measurement
as the regularization term in optimization problem (4). Since
the 0 norm can be approximated by the 1 norm to make the
optimization problem tractable, we define as the following
equation similar to that in [15] measuring the class-level
sparsity for classification applications,
(1)
The tracking problem is thus formulated as recursively estimating of the posterior probability P(lt |Z t ),

P(lt |Z t ) P(z t |lt ) P(lt |lt 1 )P(lt 1 |Z t 1)dlt 1 (2)
(5)
(3)
where wk is a weight vector with dimension N to reconstruct

the current tracking result with visual cue y k based on the
template set X k = [x 1k , . . . , x Nk ] and N is the number of
templates.
In (3), the weight vectors w1 , , w K can be considered
as an underlying representation of the tracking result in the
current frame with visual cues y 1 , , y K . In other words,
the feature-level fusion is realized by discovering the relationship among the visual cues y 1 , , y K to determine weight
vectors w1 , , w K dynamically. To learn the optimal fused
representation, we define the objective function by minimizing
Since some visual cues may be sensitive to some variations,

the assumption about shared sparsity may not be valid for
tracking. Such unreliable visual cues of the target cannot be
sparsely represented by the same set of the selected target

templates. That means, for the unreliable visual cue y k , all
the target templates are likely to have non-zero weighting for
small reconstruction error, i.e.

y k = w1k x 1k + + wkN x Nk + k
k
(7)
k
where w1 , . . . , w N are non-zero weights. In this case, we cannot obtain a robust fusion result by minimizing optimization
problem (4) with the regularization function (6).
Although unreliable features cannot satisfy (5), reliable features can still be sparsely represented by (5) and used to choose
the most informative target templates for reconstruction. With
the selected templates of index n 1 , , n c , we rewrite (7) as
follows,

yk
c

i=1
wnk i x nki =
Nc

j =1
k
wm
x k + k
j mj
(8)
where m j denotes the index for the template which

is not chosen to reconstruct the current tracking result.
5830
Suppose we have K unreliable visual cues. Without loss

of generality, let visual cues 1, , K K be reliable,
while K K + 1, , K be unreliable. To detect the K
unreliable visual cues, we employ the sparsity assumption for
the unreliable
features, i.e. the number
visual
Nc ofK unreliable
1 |2 , ,
2 ) is a small
|w
|w
|
cues K = ( Nc
0
mj
mj
j =1
j =1
number, which can be used to define the regularization function. Similar to (6), the 1 norm is used instead of the 0 norm.
Combining with the regularization function for discovering
the joint sparsity among reliable features, the regularization
function in (4) becomes
(W ) = 1
K
N K
|wnk |2 + 2
K Nc

k=1 j =1
n=1 k=1
k 2
|wm
|
j
(9)
where 1 and 2 are non-negative parameters to balance

the joint sparsity across the selected target templates and
unreliable visual cues.
However, we have no information about the selected templates and unreliable features before learning, so we cannot a
priori define the regularization function like (9). Inspired by
robust multi-task feature learning [67], the weight matrix W
can be decomposed into two terms R and S with W = R + S.
Suppose the non-zero weights of the reliable features be
encoded in R, while the non-zero weights of the unreliable
features encoded in S. The current tracking result of the
reliable visual cue k can be reconstructed by the information
in R only, i.e.(5) is revised as
y k = rnk1 x nk1 + + rnkc x nkc + k , k = 1, , K K
(10)
On the other hand, (8) for the unreliable feature k is

changed to

yk
c
snki x nki =
Nc
smk j x mk j + k ,
j =1
i=1
k = K K + 1, , K
(11)
Based on the above analysis, the final regularization function

can be defined analogous to (9), i.e.
(W ) = 1
N
rn 2 + 2
n=1
K
s k 2
(12)
k=1
Denote 1 = 1 and 2 = 2 . Substituting (W ) in (12)

into optimization problem (4), the proposed robust joint sparse
representation based feature-level fusion tracker (RJSRFFT) is
developed as,
min
W,R,S

1 k
y X k wk 22 + 1
rn 2 + 2
s k 2
2
K
k=1
n=1
k=1
s.t. W = R + S
(13)
The procedures to solve the optimization problem (13)

will be given in the following section. The optimal fused
representation is given by R and S, which encode the sparsity patterns of reliable features and non-sparsity patterns of
unreliable visual cues, respectively. Therefore, when the nonsparsity patterns of unreliable features is encoded in S, the
corresponding columns of coefficients in S would be larger
Fig. 1. Graphical illustration of unreliable feature detection on synthetic data.

(a) Original Weight Matrix. (b) Weight Matrix by MTJSR [15]. (c) Matrix R
by RJSRFFT. (d) Matrix S by RJSRFFT.
than zero. Therefore, we determine the index set O of the

unreliable features as

O = {k , s.t., s k 2 T }
(14)
This scheme detects the unreliable visual cues when the norm
of some column of matrix S is larger than a pre-defined
threshold T . That is to say, the features which get strong
response in the corresponding columns of S are determined as
unreliable features. Since we have no information about the
selected templates and unreliable features before learning the
reconstruction coefficients, the matrix S may also encode some
small components of reconstruction coefficients of reliable
features (can be found in Fig.1 (d) where matrix S shows some
small non-zero components of reliable features). Therefore,
if the threshold is too small, reliable features which have small
components in S may be mistaken as unreliable features.
On the other hand, if it is too large, unreliable features may not
be successfully detected. In our experiment, we empirically
set it as 0.0007. On the other hand, the likelihood function is
defined by R and S as follows. The representation coefficients
of different visual cues are estimated and the unreliable
features are detected by solving the optimization problem (13).
Then, the observation likelihood function in (2) is defined as

1
y j X j r j 22 )
(15)
p(z t |lt ) ex p(

KK
j
/O
where the right side of this equation denotes the average

reconstruction error of reliable visual cues. Since the proposed
model can detect the unreliable cues, the likelihood function
can combine the reconstruction error of reliable cues to define
the final similarity between the target candidate and the target
templates.
D. Optimization Procedure
The objective function in (13) consists of a smooth and
non-smooth functions. This kind of optimization problem
can be solved efficiently by employing Accelerated Proximal

Gradient Method (APG). By using the first order information,
the APG method can obtain the global optimal solution with
the convergence rate O( t12 ), which has been shown in [68].
It has also been further studied in [69] and [70] and
applied in solving multi-task sparse learning/representation
problem [67], [71], [72]. We apply the APG method similar to that in [71] and derive the following algorithm.
Let
G(R, S) = 1
k=1
N
rn 2 + 2
n=1
k=1
K
n=1
s k 2
(16)
k=1
where F are differentiable convex function with Lipschitz

continous gradient, while G are non-differentiable but convex
function. Then by taking the first-order Taylor expansion of F
with the quadratic regularization function, we construct the
objective function with the quadratic upper approximation
for F and the non-differentiable function G at the aggregation
matrices U t and V t as follows,
1
{ f (u k,t , v k,t ) + (uk,t )T (r k u k,t )
2
K
(R, S) =
k=1
+ (vk,t )T (s k v k,t )}
+ r k u k,t 22 + s k v k,t 22

2
2
N
K

+ 1
rn 2 + 2
s k 2
n=1
(17)
k=1
where is the Lipschitz constant [69], uk,t and vk,t are

the gradient operator of F(U, V ) on u k,t and v k,t , respectively. In the (t + 1)-th iteration, given the aggregation
matrices U t and V t , the proximal matrices R t +1 and S t +1
are obtained by minimizing the following optimization
problem,
arg min(R, S)
(18)
R,S
With some algebraic manipulations, (18) can be separated into

two independent sub-problems about R and S, respectively,
i.e.
min
R
min
S
K
N
1 k
1
1
r (u k,t uk,t )22 +
rn 2
2
k=1
K
1
k=1
Algorithm 1 Optimization Procedure for Problem (13)
1 k k k
1
f (r k , s k ) =
y
x n (rn + snk )22
2
2
K
F(R, S) =
5831
1) Gradient Mapping Step: the proximal matrices R t +1 and

are updated by (20) and (21), respectively.
1
1
r k,t + 2 = u k,t uk,t , k = 1, , K ,
1
t+ 1
t +1
rn = max(0, 1
) rn 2 , n = 1, , N (20)
1
t+
rn 2 2
1
1
s k,t + 2 = v k,t vk,t , k = 1, , K ,
1
2
) s k,t + 2 , k = 1, , K
s k,t +1 = max(0, 1
k,t + 12
s
2
(21)
S t +1
It should be noted that the update schemes (20) for R and (21)
for S are different from each other, since R and S have
different sparsity properties grouping according to columns
and rows, respectively.
2) Aggregation Step: the aggregation matrix is updated as
follows.
at 1 t +1
(R
R t ),
U t +1 = R t +1 +
at +1
at 1 t +1
(S
St )
(22)
V k+1 = S t +1 +
at +1
1+ 1+4at2
where at +1 =
, and a0 = 1.
2
The overall optimization procedure is summarized in
Algorithm 1.
n=1
s k (v k,t
K
1
2 k
vk,t )22 +
s 2
(19)
k=1
where the gradient operators are given by uk,t = (X k )T y k +

(X k )T (X k )u k,t + (X k )T (X k )v k,t , vk,t = (X k )T y k +
(X k )T (X k )v k,t + (X k )T (X k )u k,t . As in [71], we solve
the above subproblem using the following two steps
iteratively:
E. Kernelized Framework for Robust Feature-Level Fusion

In order to exploit the non-linearity of features which has
been shown to enhance the performance in many computer
vision tasks [6], [73], [74], we extend our algorithm into a
general framework that can combine multiple visual features
from different kernel space while detecting unreliable features
for tracking. Let denote the mapping function to a kernel
space. According to (3), sparse representation with template
5832
After solving (25), the representation coefficients of different visual cues are estimated and the unreliable features in
kernel space are detected. Then, the observation likelihood
function can be re-defined as,

1
[K(y j , y j ) 2(w j )T K(X j , y j )
p(z t |lt ) ex p(
K K
Algorithm 2 Optimization Procedure for Problem (25)
j
/O
+ (w j )T K(X j , X j )w j ])
(27)
where O denote the index set of unreliable features in kernel

space as in (14), K and K denote the total number of features
and unreliable features in kernel space, respectively.
F. Computational Complexity
set for tracking result in linear space can be extended to a

kernel space as follows,
(y k ) = (X k )wk + k , k = 1, . . . , K .
(23)
where (X k ) = [(x 1k ), , (x Nk )] denote the template

set in kernel space via the mapping function . Based on
the derivation leading to (13), the robust joint kernel sparse
representation based feature-level fusion (K-JSRFFT) model
for visual tracking is developed as,

1
min
(y k ) (X k )wk 22 + 1
rn 2
W,R,S 2
K
n=1
k=1
+ 2
K
s k 2
k=1
s.t. W = R + S
(24)
Expanding the reconstruction term in optimization

problem (24), we can reformulate problem (24) as,
K
k=1
N
rn 2 + 2
n=1
s.t. W = R + S
K
s k 2
k=1
(25)
Here the element in the i -th row and j -th columns of the
kernel matrix K(X, Y ) is defined as,
Ki, j (X, Y ) = (x i )T (y j )
IV. I MPLEMENTATION D ETAILS

A. Template Update Scheme
1
{K(y k , y k ) 2 (wk )T K(X k , y k )
min
W,R,S 2
+ (wk )T K(X k , X k )wk } + 1
The major computation time of the proposed tracker is

due to the following processes: feature extraction, kernel
matrix computation, Lipschitz constant estimation and the
optimization procedure. The linear/non-linear kernel matrix
for the feature dictionary is computed once the template
update scheme is performed. Therefore, assuming that the
flop count of the kernel mapping of two feature vectors is p,
the computational complexity of the kernel matrix computation for the corresponding feature dictionary is O(N 2 p)
where N is the number of templates in the template set.
By using the tight Lipschitz constant estimation as discussed
in Section IV, the computational complexity for Lipschitz
constant estimation is reduced from O(K 3 N 3 ) to O(N 3 ),
where K is the number of visual cues. The optimization
procedure as illustrated in Algorithms 1 and 2 is dominated
by the gradient computation in steps 3 and 4. The complexity
of Steps 3 and 4 are O(K N 2 n) where n is the number of
particles. The proposed tracker is implemented in MATLAB
without code optimization.
As measured on an i7 Quad-Core machine with 3.4 GHz
CPU, the running time of K-RJSRFFT and RJSRFFT are
2 sec/frame and 1.5 sec/frame, respectively. We will explore
ways to increase its computation efficiency in our future work
so as to perform in real-time.
(26)
where x i and y j are the i -th and j -th column of X and Y ,

respectively. And, K(y k , y k ) in problem (25) is equal
to 1 because of the normalization to unit length. Based
on Algorithm 1, the optimization procedure for problem (25)
can be derived and is summarized in Algorithm 2.
The proposed tracker is sparse-based. We adopt the template

update scheme in [17] with a minor modification to fit our proposed fusion-based tracker with an outlier detection scheme.
Similar to [17], we associate each template in different visual
cues with a weight, and the weight is updated in each frame.
Once the similarity between the templates with the largest
weight from the reliable visual cue and the target sample of the
corresponding visual cue is less than a predefine threshold, the
proposed tracker will replace the template which has the least
weight with the target sample. The difference between [17] and
the proposed method is that the update scheme in this paper is
performed simultaneously for template sets in different visual
cues. Once one of the templates in a visual cue is replaced,
the template in other visual cues will be replaced because the
proposed model performs multi-cue fusion at feature level.
As such, all the cues of the same template should be updated
5833
Hessian matrix in high dimension. In the experiment, we set

the value of Lipschitz constant as the lower bound.
Algorithm 3 Template Update Scheme
V. E XPERIMENTS
In this section, we evaluate the proposed robust joint sparse
representation-based feature-level fusion tracker (RJSRFFT)
and its kernelized method (K-RJSRFFT) using both synthetic
data and real videos from publicly available datasets.
A. Unreliable Feature Detection on Synthetic Data
simultaneously. The template update scheme is summarized in

Algorithm 3.
It should be noted that the similarity threshold determines
the template update frequency. A higher similarity threshold
would result in a higher updating frequency while a lower
similarity threshold would lead to a lower updating frequency.
If the template set is updated too frequently, small errors
would be accumulated and the tracker would drift from the
target gradually. On the other hand, if the template set remains
unchanged for a long time, the template set may not be well
adaptive to appearance and background change. In our experiments, we set the similarity threshold as cos(35) 0.82.
B. Tight Lipschitz Constant Estimation
The Lipschitz constant is important to above optimization
algorithm. An improper value of will result in either divergence or slow convergence in above optimization algorithm.
In this subsection, we will derive the method for tight
Lipschitz constant estimation. As such, the Lipschitz constant
will be automatically set in the initial frame or update once
the template update scheme is performed. Proposition 1 gives
the method for tight Lipschitz constant estimation.
Propostion 1: Let F denote the function defined in (16)
with {X k |k = 1, . . . , K }, where X k denote the template set
in the k-th visual cue. The Lipschitz constant for F can be
estimated via its tight lower bound as follows,
max{2 kmax |k = 1, . . . , K }
(28)
where kmax is the largest eigenvalue of K(X k , X k ).

The proof of Proposition 1 can be found in Appendix.
We can see that with this method in (28), the tight lower
bound of the Lipschitz can be estimated just via the maximum
eigenvalues of the sub-blocks of the Hessian matrix. Therefore,
it avoids the direct computation of the eigenvalues of the
To demonstrate that the proposed method can detect

unreliable features, we compare the RJSRFFT with
the weight matrices obtained by solving (4) with the
regularization term (6) as in the multi-task joint sparse
representation (MTJSR) method [15]. In this experiment,
we simulate the multi-cue tracking problem by randomly
generating five kinds of ten dimensional normalized features
with 30 templates, i.e. X k R1030 , k = 1, , 5 are the
template sets. Two kinds of features are set as unreliable with
non-sparsity patterns. For the other three kinds of reliable
features, we divide the template sets into three groups and
randomly generate the template weight vector wk R30 , such
that the elements in wk corresponding to only one group of
templates are non-zero. The testing sample of the k-th feature
y k to represent the current tracking result is computed by
X k wk plus a Gaussian noise vector with zero mean and
variance 0.2 to represent the reconstruction error k . For a
fair comparison with the MTJSR [15], we extend our model
to impose the group lasso penalty by simply using a group
sparsity term in optimization problem (13). We empirically
set parameters , 1 , 2 as 0.001 and the step size as 0.002
and repeat the experiment 100 times.
We use the average normalized mean square error between
the original weight matrix and the recovered one for evaluation. Our method RJSRFFT achieves a much lower average
recovery error of 4.69% compared with that of the MTJSR
with 12.29%. This indicates that our method can better recover
the underlying weight matrix by detecting the unreliable features successfully. To further demonstrate the ability for unreliable feature detection, we give a graphical illustration of one
out of the 100 experiments in Fig.1. The original weight matrix
is shown in Fig.1(a) with each row representing a weight
vector wk . The horizontal axis records the sample indexes,
while the vertical gives the values of weights. From Fig.1(a),
we can see that the first three share the same sparsity patterns
over the samples with indexes in the middle range, while all
the weights of the last two features are non-zeros, thus nonsparse. In this case, the MTJSR cannot discover the sparsity
patterns as shown in Fig.1(b), while the proposed RJSRFFT
can find out the shared sparsity of the reliable features and
detect unreliable features as shown in Fig.1(c) and 1(d). This
also explains the reason why our method can better recover
the underlying matrix as shown in Fig.1(a).
B. Visual Tracking Experiments
To evaluate the performance of the proposed tracking
algorithms, i.e., RJSRFFT and K-RJSRFFT, we conduct
5834
TABLE I
T HE C HALLENGING FACTOR AND I TS C ORRESPONDING V IDEOS
experiment on thirty-three publicly available video sequences.2

The challenging factors in these video sequences include
occlusion, cluttered background, change of scale, illumination
and pose, which are listed in Table I. We compare the
proposed tracking algorithms, RJSRFFT and K-RJSRFFT
with state-of-the-art tracking methods including multicue tracker: online boosting (OAB) [7], online multiple
instance learning (MIL) [8], semi-supervised online
boosting (SemiB) [12], sparse representation-based
tracker: 1 tracker (L1APG) [55], sparse collaborative
model (SCM) [56], multi-task tracker (MTT) [57], and other
state-of-the-art methods: struck method (Struck) [49], circulant
structure tracker (CSK) [50], fragment tracker (Frag) [41],
incremental learning tracker (IVT) [31], distribution field
tracker (DFT) [77], compressive tracker (CT) [4]. The source
codes of these trackers are provided by the authors of
these papers. For a fair comparison, all the trackers are
set to be with the same initialization parameters. To
illustrate the effectiveness of the proposed unreliable feature
detection scheme that encodes the non-zero weights of the
reliable feature in matrix S and detect the unreliable feature
based on the vector norm, we implement the Joint Sparse
Representation-based Feature-level Fusion Tacker without
unreliable feature detection scheme (JSRFFT) for short as
discussed in Section III(B). Both qualitative and quantitative
evaluation are presented in Section V(B).
1) Experiment Settings: For the proposed K-RJSRFFT, we
extract six kinds of local and global features with totally
3 kinds of kernel for fusion from a normalized 32 by 32
image patch representing the target observation. For local
visual cues, we divide the tracking bounding box into 2 by
2 blocks and extract covariance descriptor in each block with
log-Euclidean kernel function as in [78]. For global visual
cues, we use HOG [79] with linear and polynomial kernel,
and GLF [5] with linear kernel. For RJSRFFT and JSRFFT, we
extract the same kinds of feature but only with linear kernel.
We empirically set the parameters as follows. 1 and 2 is
set to 0.57 and 0.87, respectively. The threshold T in (14) is
2 can be downloaded from http://www.cs.technion.ac.il/amita/fragtrack/
fragtrack.htm [41] http://www.cs.toronto.edu/dross/ivt/ [31] http://personal.
ee.surrey.ac.uk/Personal/Z.Kalal/ [75], [21] http://www.cs.toronto.edu/vis/
projects/dudekfaceSequence.html [25] http://ice.dlut.edu.cn/lu/Project/TIP12SP/TIP12-SP.htm [59] http://cv.snu.ac.kr/research/vtd/ [33] http://lrs.icg.
tugraz.at/research/houghtrack/ [76] http://www.umiacs.umd.edu/fyang/spt.
html [51].
0.0007. And the number of templates is 12, and 200 samples

are drawn for particle filtering. We adopt the cosine function
as the similarity function, and empirically set the similarity
threshold as cos(35) 0.82.
2) Visualization Experiment: To illustrate how unreliable
features are detected in the proposed method, for the tracking
result in the bounding box as shown in the upper-left image
in Fig. 2, we use features of the tracking results and the template set to obtain the reconstruction coefficients via solving
the kernel sparse representation problem for each feature as
shown in the small green box, and plot the value of the coefficients of each template of all features. The higher value of
coefficient means the corresponding template is more representative for reconstructing the target. As we can see, the reconstruction coefficients of GLF features do not demonstrate sparsity pattern as other features, which means the GLF features
extracted from the tracking result cannot be well represented
by the corresponding template set. We can also note that for
most features, the sixth template is selected as the most representative template with the highest value of coefficient, which
implies the underlying relationship between different features.
Through our proposed model, the sixth template are jointly
selected as the representative template for target representation
as shown in the bottom-left figure and the GLF features also
detected as an unreliable feature as shown in the bottom-right
figure.
3) Quantitative Comparison: The performance of the compared trackers are evaluated quantitatively from two aspects:
video-by-video comparison and video set-based comparison.
Following [43], [56], [59][61], [80], three evaluation criteria
are used for quantitative comparison: center location error,
overlap ratio and success rate. The center location error
is defined as the Euclidean distance between the central
location of a bounding box and the labeled ground truth.
area(BT BG )
, where
And the overlap ratio is defined as area(B
T BG )
BT and BG are the bounding boxes of the tracker and
ground-truth. A frame is successfully tracked means that
the overlap ratio is larger than 0.5. Tables II, III and IV
show the quantitative comparison of evaluated trackers in
terms of center location error, overlap ratio and success
rate, respectively. And the last row of Table IV shows
the average frame-per-second running time of the compared
trackers. The proposed trackers RJSRFFT and K-RJSRFFT
rank in the top three among all trackers. The proposed
5835
Fig. 2. Graphical Illustration of Unreliable Feature Detection Scheme. The upper-left image in purple box show the tracking result in the 658-th frame
of video Occluded Face 1. The seven figures in the upper-right green box illustrate the resulting reconstruction coefficients via solving the 1 minimization
problem as shown in small green box. After solving (24), the resulting reconstruction coefficients of each feature encoded in matrix R are illustrated in the
bottom-left blue box, and the unreliable detection results encoded in matrix S are illustrated in the bottom-right blue box. The y-coordinate of all above
figures denotes the value of coefficient, and the x-coordinate denotes the index of the template in the template set.
TABLE II
C ENTER L OCATION E RROR ( IN P IXELS ). T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE AND G REEN
trackers achieve much better results compared with sparse

representation-based tracker using single feature, i.e., L1APG,
SCM and MTT. This is because our tracker is based on
multi-feature sparse representations which can dynamically
utilize different features to handle different kinds of variations while preserving the advantages of sparse representation.
The proposed multi-cue trackers also perform better than
existing multi-cue trackers, i.e., OAB, MIL, and SemiB. This
5836
TABLE III
VOC OVERLAPPING R ATE . T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE AND G REEN
TABLE IV
S UCCESS R ATE . T HE B EST T HREE R ESULTS A RE S HOWN IN R ED , B LUE AND G REEN
is because the more informative features are dynamically

exploited for fusion at feature level than that of score level.
But feature level fusion-based tracker JSRFFT does not per-
form as well as RJSRFFT and K-RJSRFFT which suggests

that unreliable feature detection scheme is essential for
feature-level fusion. Comparing the performance of RJSRFFT
5837
Fig. 3. The survival curve based on the F-score for the entire thirty-three videos and the subset of videos containing different challenging factors, and the
average F-scores are shown in the corresponding legend in descending order.(a) Whole set of videos. (b) Occlusion subset of videos. (c) Cluttered background
subset of videos. (d) Scale subset of videos. (e) Illumination subset of videos. (f) Pose subset of videos.
and JSRFFT suggests that unreliable features detection can

improve the performance of joint sparse representation-based
tracker. K-RJSRFFT outperforms RJSRFFT which indicates
that exploiting the non-linearity of features can improve the
performance of multi-cue tracker.
Recently, Smeulders et al. presented a comprehensive survey and systematic evaluation on state-of-the-art visual tracking algorithms in [23] based on the Amsterdam Library of
Ordinary Videos data set, named ALOV++. The evaluation
protocol in [23] shows that the F-score is an effective metric
to evaluate a trackers accuracy, and the survival curve as
illustrated in [23] provides a cumulative rendition of the
quality of the tracker on a set of videos, which avoids the risk
of being trapped in the peculiarity of the single video instance.
Therefore, to conduct the video set-based comparison, we
compute the F-score using the tracking results of each tracker
on each video, and plot the corresponding survival curve on
the entire thirty-three videos and the subset of videos containing different challenging factors as shown in Table I. The
F-score for each video is defined as 2 ( pr eci si on r ecall)/
( pr eci si on + r ecall), where pr eci si on = n t p /(n t p + n f p ),
r ecall = n t p /(n t p + n f n ), and n t p , n f p , n f n denote the
number of true positives, false positives, false negatives in
a video. After getting the F-score of each video in the video
set, the sets of videos are sorted in the descending order and
the curve survival curve of F-score is plotted with respect to
the sorted videos. Fig. 3 illustrates the survival curves based
on F-score on the whole thirty-three videos and the subset of
videos containing different challenging factors. We can see
that the K-RJSRFFT method achieve the best performance
on the whole video and subsets of video containing different
factors in term of average F-core. By performing featurelevel fusion in multiple kernel space, the K-RJSRFFT method
perform better than the JSRFFT method. With the unreliable
feature detection scheme, both K-RJSRFFT and RJSRFFT

method perform better than the JSRFFT method, which shows
the effectiveness of the unreliable feature detection scheme.
By utilizing the complementarity of multiple reliable features,
both K-RJSRFFT and RJSRFFT method show superior performance under different challenging factors, especially in
occlusion, illumination and cluttered background.
4) Qualitative Comparison: We qualitatively evaluate the
trackers in five aspects based on some typical video sequences
shown in Fig. 4 as follows:
a) Occlusion: For the video Occluded Face 1, a woman
face undergoes some partial and severe occlusion by a book
in some frames. Except that the CT, DFT, JSRFFT and
MIL methods have small drift when the womans face is
severely occluded around Frame 730 and Frame 873, all other
trackers can successfully track the womans face through the
whole video. For the video David Outdoor, a man is walking
with body deformation and occlusion. The contrast between
the target and background is low which makes the SemiB,
MIL, CT, L1APG, Struck, JSRFFT methods lose the target
around Frame 49, 108, 120, 195, 238, respectively. The full
body occlusion also cause the tracking drift problem of the
MTT, OAB, SCM, CSK, JSRFFT, DFT, IVT methods around
Frame 27, 83, 188, respectively. Because the K-RJSRFFT
method combines local and global features from non-linear
kernel space to account for the variation in foreground and
background, it performs well in the whole sequence. For the
sequence Walking2, a woman is walking with change of scale.
Most trackers perform well except that the Frag, OAB, SemiB,
CT, MIL, JSRFFT methods drift away form the woman due
to the occlusion caused by a man with similar appearance.
As can be seen from above comparison, the proposed trackers
RJSRFFT and K-RJSRFFT can handle occlusion well. With
the similar merit of [3], the fusion of local features enables it
5838
Fig. 4.
Qualitative results on some typical frames including some challenging factors. (a) Occlusion. (b) Background. (c) Scale. (d) Illumination. (e) Pose.
to be less sensitive to partial occlusion. Besides, different from

the JSRFFT method which directly performs fusion of all features, unreliable feature detection scheme in the RJSRFFT and
K-RJSRFFT methods can improve the tracking performance
of joint sparse representation-based trackers since the features
extracted from occluded object may not be reliable.
b) Cluttered background: In the video Crossing, the
contrast between the walking person and the background
is low, which leads to the drift problem of some trackers,
i.e., the MTT, L1APG, DFT and Frag methods around
Frame 55 and 83. In the video Mountain-bike, a man riding
a montain bike undergoes pose variation with cluttered background. The cluttered background makes the CT, Frag, DFT,
and JSRFFT methods drift away from the tracked object
around Frame 50 and 95, respectively. In the video Jumping,
A man is jumping rope with motion blur occurring on his face.
The cluttered background makes it easy for trackers such as
the IVT, SCM, L1APG, MTT, OAM, CT, DFT, CSK, SemiB
methods drift from the blurred face as shown in Frame 117.
The proposed tracker K-RJSRFFT performs well under a
cluttered background, and shows improved results compared
with the RJSRFFT method. This is because different from
the RJSRFFT method, the K-RJSRFFT method utilizes the
non-linear similarity of feature and performs feature fusion in
a more discriminative kernel space, which enhances the discriminability of appearance model and makes it less sensitive
to cluttered background.
c) Scale: In the video Car4, there is a drastic change

of scale and illumination when the car goes underneath the
overpass. Only the CSK, SCM, IVT, RJSRFFT, K-RJSRFFT,
Struck and JSRFFT methods can perform well in the whole
sequence while the other trackers have some small or large
drift away from the car. In the David indoor sequence, a man
is walking out of the dark room with a gradual change in
scale and pose. The IVT, SCM, MIL, RJSRFFT, K-RJSRFFT,
JSRFFT methods perform well on this sequence. Other trackers have some drift when the man undergoes pose and scale
variation. (e.g. Frame 166). In the Freeman3 sequence, a man
is walking in the classroom from the back to the front with
a gradual change in scale. Due to the similar appearance
of human face in the background and the scale change of
the target, the OAB, Struck, Frag, CT, MIL, CSK, IVT,
L1APG, DFT methods lose the tracking of the target. The
K-RJSRFFT, RJSRFFT and JSRFFT methods perform well
on this sequence.
d) Illumination: For the challenging video Trellis, the
object appearance changes significantly because of illumination and pose. the OAB, SemiB, IVT, MTT, L1APG, CT,
MIL methods lose the target due to the cast shadow on the
face. When the object changes its pose, the SCM method
drifts away from the target. Only the RJSRFFT, K-RJSRFFT,
and JSRFFT methods perform well in the whole sequence
as a result of the fusion of local illumination insensitive
feature which enables them to handle such kinds of variations.
Video Skating1 is also challenging in which a woman is

skating under variations of pose, illumination and background.
Some partial occlusion also occurs as shown in Frame 164.
Except the proposed trackers K-RJSRFFT and RJSRFFT,
other trackers do not perform well in the whole sequence.
And the dramatic change of lighting in the background
let the RJSRFFT method drift away from target while the
K-RJSRFFT method still tracks the target stably which shows
that fusion features in kernel space can enhance the discrminability of feature fusion based trackers. In the sequence
Car11, the significant change of illumination and low resolution of object makes tracking difficult. Except the DFT, CT,
Frag, MIL methods, all others trackers are able to track the
target with high success rate.
e) Pose: The Shaking sequence is challenging because of
significant pose variations. The proposed method K-RJSRFFT
can successfully track the target. The RJSRFFT, OAB, MIL,
CSK methods have small drift away from the target due to the
pose variation. Other trackers such as the JSRFFT, CT, IVT,
SemiB, MTT, L1APG methods fail to track the target due
to the combined variation of pose and occlusion as well as
illumination around Frame 59. For the Basketball sequence,
the target is a basketball player running on the court who
undergoes significant pose variation as well as some partial
occlusion. The Struck, OAB, JSRFFT methods drift from
the target around Frame 64. And when another player with
similar appearance goes close to the player around Frame 302,
the MIL, CT and L1APG method lose the target. Only the
K-RJSRFFT, DFT, CSK and Frag methods perform well in
this sequence. In the sequence Sylvester, the object undergoes
significant pose variation around frame 1091, and the DFT,
IVT, SCM, L1APG, Frag and MIL method drift away from
the target. Our proposed trackers K-RJSRFFT and RJSRFFT
are able to track the target with high success rate.
VI. C ONCLUSION AND F UTURE W ORK
In this paper, we have formulated a feature-level fusion
visual tracker based on joint sparse representation. The proposed method outperforms other feature fusion-based trackers
and sparse representation-based trackers under appearance
variations such as occlusion, scale, illumination and poses
which can be shown in the experimental results. This demonstrates that the proposed method for robust feature-level
fusion is effective. On the other hand, since the proposed
method is sparsity-based tracker, it is a bit difficult to perform real-time tracking. Therefore, one of our future works
will be developping more efficient methods to reduce the
computational complexity for practical application. Besides,
how to extend the optimization model to other computer
vision problem will be explored in future. In addition, we
will explore how to fuse more information, such as prior
information, context information in our multi-cue tracking
framework.
A PPENDIX
P ROOF OF P ROPOSITION 1
Proof: The Hessian matrix H R(2NK )(2NK ) for
function F in (16) is a block diagonal matrix with Bk as the
5839
k-th block, where
F F
2r k r k s k
Bk = F F
r k s k 2 s k

K(X k , X k )
=
K(X k , X k )
K(X k , X k )
K(X k , X k )
R(2N)(2N)
(29)
Assume that k = di ag(k1 , . . . , kN ) is a diagonal matrix

whose elements are the eigenvalues for Bk , and Z k is a matrix
with eigenvectors of Bk as column vectors. That is to say,
Bk Z k = Z k k .
Let
Z
=
di ag(Z 1, , Z K ),
and
1
then
HZ
=

=
di ag( , , K ),
di ag(B1 Z 1 , . . . , B K Z K ) = di ag(Z 11 , . . . , Z K K ) =
di ag(Z 1, , Z K )di ag(1, , K ) = Z . That is to
say, the eigenvalues of H are the same as its diagonal block
B1 , . . . , B K . Hence,
max (H) = max{max (Bk )|k = 1, . . . , K }
= max{2 max (K(X k , X k ))|k = 1, . . . , K }
(30)
where max (H) and max (Bk ) are maximum eigenvalue of H

and Bk , respectively. The proof is done.
ACKNOWLEDGMENT
The authors would like to thank the editor and reviewers
for their helpful comments which improve the quality of this
paper.
R EFERENCES
[1] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, and A. Van Den Hengel,
A survey of appearance models in visual object tracking, ACM Trans.
Intell. Syst. Technol., vol. 4, no. 4, p. 58, Sep. 2013.
[2] S. He, Q. Yang, R. W. H. Lau, J. Wang, and M.-H. Yang, Visual
tracking via locality sensitive histograms, in Proc. CVPR, Jun. 2013,
pp. 24272434.
[3] W. Hu, X. Li, W. Luo, X. Zhang, S. J. Maybank, and Z. Zhang, Single
and multiple object tracking using log-Euclidean Riemannian subspace
and block-division appearance model, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 34, no. 12, pp. 24202440, Dec. 2012.
[4] K. Zhang, L. Zhang, and M.-H. Yang, Real-time compressive tracking,
in Proc. ECCV, 2012, pp. 864877.
[5] W. W. Zou, P. C. Yuen, and R. Chellappa, Low-resolution face tracker
robust to illumination variations, IEEE Trans. Image Process., vol. 22,
no. 5, pp. 17261739, May 2013.
[6] F. Yang, H. Lu, and M.-H. Yang, Robust visual tracking via multiple
kernel boosting with affinity constraints, IEEE Trans. Circuits Syst.
Video Technol., vol. 24, no. 2, pp. 242254, Feb. 2014.
[7] H. Grabner and H. Bischof, On-line boosting and vision, in Proc.
CVPR, Jun. 2006, pp. 260267.
[8] B. Babenko, M.-H. Yang, and S. Belongie, Robust object tracking with
online multiple instance learning, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 33, no. 8, pp. 16191632, Aug. 2011.
[9] Y. Wu, E. Blasch, G. Chen, L. Bai, and H. Ling, Multiple source data
fusion via sparse representation for robust visual tracking, in Proc. Int.
Conf. Inf. Fusion, Jul. 2011, pp. 18.
[10] A. J. Ma, P. C. Yuen, and J.-H. Lai, Linear dependency modeling for
classifier fusion and feature combination, IEEE Trans. Pattern Anal.
Mach. Intell., vol. 35, no. 5, pp. 11351148, May 2013.
[11] A. J. Ma and P. C. Yuen, Reduced analytic dependency modeling:
Robust fusion for visual recognition, Int. J. Comput. Vis., vol. 109,
no. 3, pp. 233251, Sep. 2014.
[12] H. Grabner, C. Leistner, and H. Bischof, Semi-supervised on-line
boosting for robust tracking, in Proc. ECCV, 2008, pp. 234247.
[13] T. M. Cover and J. A. Thomas, Elements of Information Theory.
New York, NY, USA: Wiley, 2006.
5840
[14] X. Jia, D. Wang, and H. Lu, Fragment-based tracking using online

multiple kernel learning, in Proc. ICIP, Sep./Oct. 2012, pp. 393396.
[15] X.-T. Yuan and S. Yan, Visual classification with multi-task joint sparse
representation, in Proc. CVPR, Jun. 2010, pp. 34933500.
[16] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, Robust face
recognition via sparse representation, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 31, no. 2, pp. 210227, Feb. 2009.
[17] X. Mei and H. Ling, Robust visual tracking and vehicle classification
via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 33, no. 11, pp. 22592272, Nov. 2011.
[18] P. Gehler and S. Nowozin, On feature combination for multiclass object
classification, in Proc. ICCV, Sep./Oct. 2009, pp. 221228.
[19] X. Lan, A. J. Ma, and P. C. Yuen, Multi-cue visual tracking using robust
feature-level fusion based on joint sparse representation, in Proc. CVPR,
Jun. 2014, pp. 11941201.
[20] A. Yilmaz, O. Javed, and M. Shah, Object tracking: A survey, ACM
Comput. Surv., vol. 38, no. 4, p. 13, 2006.
[21] Y. Wu, J. Lim, and M.-H. Yang, Online object tracking: A benchmark,
in Proc. CVPR, Jun. 2013, pp. 24112418.
[22] S. Zhang, H. Yao, X. Sun, and X. Lu, Sparse coding based visual
tracking: Review and experimental comparison, Pattern Recognit.,
vol. 46, no. 7, pp. 17721788, Jul. 2013.
[23] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,
A. Dehghan, and M. Shah, Visual tracking: An experimental survey,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 14421468,
Jul. 2014.
[24] H. Yang, L. Shao, F. Zheng, L. Wang, and Z. Song, Recent advances
and trends in visual tracking: A review, Neurocomputing, vol. 74,
no. 18, pp. 38233831, Nov. 2011.
[25] A. D. Jepson, D. J. Fleet, and T. F. El-Maraghi, Robust online
appearance models for visual tracking, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 25, no. 10, pp. 12961311, Oct. 2003.
[26] S. K. Zhou, R. Chellappa, and B. Moghaddam, Visual tracking
and recognition using appearance-adaptive models in particle filters, IEEE Trans. Image Process., vol. 13, no. 11, pp. 14911506,
Nov. 2004.
[27] D. Comaniciu, V. Ramesh, and P. Meer, Kernel-based object tracking,
May 2003.
[28] C. Shen, M. J. Brooks, and A. van den Hengel, Fast global kernel
density mode seeking: Applications to localization and tracking, IEEE
Trans. Image Process., vol. 16, no. 5, pp. 14571469, May 2007.
[29] I. Leichter, Mean shift trackers with cross-bin metrics, IEEE Trans.
Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 695706, Apr. 2012.
[30] M. J. Black and A. D. Jepson, EigenTracking: Robust matching and
tracking of articulated objects using a view-based representation, in
Proc. ECCV, 1996, pp. 329342.
[31] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, Incremental learning for
robust visual tracking, Int. J. Comput. Vis., vol. 77, no. 1, pp. 125141,
May 2008.
[32] L. Wen, Z. Cai, Z. Lei, D. Yi, and S. Z. Li, Robust online learned
spatio-temporal context model for visual tracking, IEEE Trans. Image
Process., vol. 23, no. 2, pp. 785796, Feb. 2014.
[33] J. Kwon and K. M. Lee, Visual tracking decomposition, in Proc.
CVPR, Jun. 2010, pp. 12691276.
[34] X. Mei, H. Ling, Y. Wu, E. P. Blasch, and L. Bai, Efficient minimum
error bounded particle resampling L1 tracker with occlusion detection,
IEEE Trans. Image Process., vol. 22, no. 7, pp. 26612675, Jul. 2013.
[35] X. Li, C. Shen, Q. Shi, A. Dick, and A. van den Hengel, Non-sparse
linear representations for visual tracking with online reservoir metric
learning, in Proc. CVPR, Jun. 2012, pp. 17601767.
[36] S. Zhang, H. Zhou, F. Jiang, and X. Li, Robust visual tracking
using structurally random projection and weighted least squares, to be
published, doi: 10.1109/TCSVT.2015.2406194.
[37] S. Zhang, H. Yao, H. Zhou, X. Sun, and S. Liu, Robust visual tracking based on online learning sparse representation, Neurocomputing,
vol. 100, pp. 3140, Jan. 2013.
[38] S. Zhang, H. Yao, X. Sun, and S. Liu, Robust visual tracking using an
effective appearance model based on sparse coding, ACM Trans. Intell.
Syst. Technol., vol. 3, no. 3, p. 43, May 2012.
[39] Z. Hong, X. Mei, D. Prokhorov, and D. Tao, Tracking via robust multitask multi-view joint sparse representation, in Proc. ICCV, Dec. 2013,
pp. 649656.
[40] X. Mei, Z. Hong, D. Prokhorov, and D. Tao, Robust multitask multiview
tracking in videos, IEEE Trans. Neural. Netw. Learn. Syst., to be
published, doi: 10.1109/TNNLS.2015.2399233.
[41] A. Adam, E. Rivlin, and I. Shimshoni, Robust fragments-based

tracking using the integral histogram, in Proc. CVPR, Jun. 2006,
pp. 798805.
[42] S. Avidan, Support vector tracking, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 26, no. 8, pp. 10641072, Aug. 2004.
[43] X. Li, A. R. Dick, C. Shen, Z. Zhang, A. van den Hengel, and H. Wang,
Visual tracking with spatio-temporal DempsterShafer information
fusion, IEEE Trans. Image Process., vol. 22, no. 8, pp. 30283040,
Aug. 2013.
[44] K. Zhang and H. Song, Real-time visual tracking via online
weighted multiple instance learning, Pattern Recognit., vol. 46, no. 1,
pp. 397411, Jan. 2013.
[45] N. Jiang, W. Liu, and Y. Wu, Learning adaptive metric for robust visual
tracking, IEEE Trans. Image Process., vol. 20, no. 8, pp. 22882300,
Aug. 2011.
[46] N. Jiang, W. Liu, and Y. Wu, Order determination and sparsityregularized metric learning adaptive visual tracking, in Proc. CVPR,
Jun. 2012, pp. 19561963.
[47] N. Jiang and W. Liu, Data-driven spatially-adaptive metric adjustment
for visual tracking, IEEE Trans. Image Process., vol. 23, no. 4,
pp. 15561568, Apr. 2014.
[48] W. Hu, X. Li, X. Zhang, X. Shi, S. J. Maybank, and Z. Zhang,
Incremental tensor subspace learning and its applications to foreground segmentation and tracking, Int. J. Comput. Vis., vol. 91, no. 3,
pp. 303327, Feb. 2011.
[49] S. Hare, A. Saffari, and P. H. S. Torr, Struck: Structured output tracking
with kernels, in Proc. ICCV, Nov. 2011, pp. 263270.
[50] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, Exploiting
the circulant structure of tracking-by-detection with kernels, in Proc.
ECCV, 2012, pp. 702715.
[51] F. Yang, H. Lu, and M.-H. Yang, Robust superpixel tracking, IEEE
Trans. Image Process., vol. 23, no. 4, pp. 16391651, Apr. 2014.
[52] H. Li, C. Shen, and Q. Shi, Real-time visual tracking using compressive
sensing, in Proc. CVPR, Jun. 2011, pp. 13051312.
[53] B. Liu, L. Yang, J. Huang, P. Meer, L. Gong, and C. Kulikowski, Robust
and fast collaborative tracking with two stage sparse optimization, in
Proc. ECCV, 2010, pp. 624637.
[54] B. Liu, J. Huang, C. Kulikowski, and L. Yang, Robust visual
tracking using local sparse appearance model and K-selection, IEEE
Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 29682981,
Dec. 2013.
[55] C. Bao, Y. Wu, H. Ling, and H. Ji, Real time robust L1 tracker using
accelerated proximal gradient approach, in Proc. CVPR, Jun. 2012,
pp. 18301837.
[56] W. Zhong, H. Lu, and M.-H. Yang, Robust object tracking via sparse
collaborative appearance model, IEEE Trans. Image Process., vol. 23,
no. 5, pp. 23562368, May 2014.
[57] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, Robust visual tracking
via structured multi-task sparse learning, Int. J. Comput. Vis., vol. 101,
no. 2, pp. 367383, Jan. 2013.
[58] X. Jia, H. Lu, and M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, in Proc. CVPR, Jun. 2012,
pp. 18221829.
[59] D. Wang, H. Lu, and M.-H. Yang, Online object tracking with sparse
prototypes, IEEE Trans. Image Process., vol. 22, no. 1, pp. 314325,
Jan. 2013.
[60] B. Zhuang, H. Lu, Z. Xiao, and D. Wang, Visual tracking via discriminative sparse similarity map, IEEE Trans. Image Process., vol. 23,
no. 4, pp. 18721881, Apr. 2014.
[61] Q. Wang, F. Chen, J. Yang, W. Xu, and M.-H. Yang, Transferring visual
prior for online object tracking, IEEE Trans. Image Process., vol. 21,
no. 7, pp. 32963305, Jul. 2012.
[62] L. Wang, H. Yan, K. Lv, and C. Pan, Visual tracking via kernel sparse
representation with multikernel fusion, IEEE Trans. Circuits Syst. Video
Technol., vol. 24, no. 7, pp. 11321141, Jul. 2014.
[63] Y. Wu, B. Ma, M. Yang, J. Zhang, and Y. Jia, Metric learning
based structural appearance model for robust visual tracking, IEEE
Trans. Circuits Syst. Video Technol., vol. 24, no. 5, pp. 865877,
May 2014.
[64] H. Zhang, N. M. Nasrabadi, Y. Zhang, and T. S. Huang, Multiobservation visual recognition via joint dynamic sparse representation,
in Proc. ICCV, Nov. 2011, pp. 595602.
[65] S. Shekhar, V. M. Patel, N. M. Nasrabadi, and R. Chellappa, Joint
sparse representation for robust multimodal biometrics recognition,
Jan. 2014.
[66] S. Bahrampour, A. Ray, N. M. Nasrabadi, and K. W. Jenkins, Qualitybased multimodal classification using tree-structured sparsity, in Proc.
CVPR, Jun. 2014, pp. 41144121.
[67] P. Gong, J. Ye, and C. Zhang, Robust multi-task feature learning, in
Proc. KDD, 2012, pp. 895903.
[68] Y. Nesterov, Gradient methods for minimizing composite functions,
Math. Program., vol. 140, no. 1, pp. 125161, Aug. 2013.
[69] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding
algorithm for linear inverse problems, SIAM J. Imag. Sci., vol. 2, no. 1,
pp. 183202, 2009.
[70] P. Tseng, On accelerated proximal gradient methods for convexconcave optimization, SIAM J. Optim., 2008.
[71] X.-T. Yuan, X. Liu, and S. Yan, Visual classification with multitask
joint sparse representation, IEEE Trans. Image Process., vol. 21, no. 10,
pp. 43494360, Oct. 2012.
[72] X. Chen, W. Pan, J. T. Kwok, and J. G. Carbonell, Accelerated
gradient method for multi-task sparse learning problem, in Proc. ICDM,
Dec. 2009, pp. 746751.
[73] A. Shrivastava, V. M. Patel, and R. Chellappa, Multiple kernel learning for sparse representation-based classification, IEEE Trans. Image
Process., vol. 23, no. 7, pp. 30133024, Jul. 2014.
[74] S. Gao, I. W. Tsang, and L.-T. Chia, Sparse representation with
kernels, IEEE Trans. Image Process., vol. 22, no. 2, pp. 423434,
Feb. 2013.
[75] Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-learning-detection,
Jul. 2012.
[76] M. Godec, P. M. Roth, and H. Bischof, Hough-based tracking of nonrigid objects, in Proc. ICCV, Nov. 2011, pp. 8188.
[77] L. Sevilla-Lara and E. G. Learned-Miller, Distribution fields for tracking, in Proc. CVPR, Jun. 2012, pp. 19101917.
[78] P. Li, Q. Wang, W. Zuo, and L. Zhang, Log-Euclidean kernels for
sparse representation and dictionary learning, in Proc. ICCV, Dec. 2013,
pp. 16011608.
[79] N. Dalal and B. Triggs, Histograms of oriented gradients for human
detection, in Proc. CVPR, Jun. 2005, pp. 886893.
[80] Q. Wang, F. Chen, W. Xu, and M.-H. Yang, Object tracking via partial
least squares analysis, IEEE Trans. Image Process., vol. 21, no. 10,
pp. 44544465, Oct. 2012.
Xiangyuan Lan (S14) received the B.Eng. degree

in computer science and technology from the South
China University of Technology, China, in 2012.
He is currently pursuing the Ph.D. degree with
the Department of Computer Science, Hong Kong
Baptist University, Hong Kong.
Mr. Lan was a recipient of the National Scholarship from the Ministry of Education of China,
in 2010 and 2011, and the IBM China Excellent
Student Scholarship from IBM and the China Scholarship Council in 2012. In 2015, he was a Visiting
Scholar with the Computer Vision Laboratory, University of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park,
USA.
Mr. Lans current research interests include computer vision and pattern
recognition, particularly, feature fusion and sparse representation for visual
tracking.
Andy J. Ma received the B.Sc. and M.Sc. degrees
in applied mathematics from Sun Yat-sen University, in 2007 and 2009, respectively, and the Ph.D.
degree from the Department of Computer Science,
Hong Kong Baptist University, in 2013. He is currently a Post-Doctoral Fellow with the Department
of Computer Science, Johns Hopkins University.
His current research interests focus on developing
machine learning algorithms for intelligent video
surveillance.
5841
Pong C. Yuen (M93SM11) received the B.Sc.

(Hons.) degree in electronic engineering from City
Polytechnic of Hong Kong, in 1989, and the
Ph.D. degree in electrical and electronic engineering from The University of Hong Kong, in 1993.
He joined the Hong Kong Baptist University,
in 1993, where he is currently a Professor and the
Head of the Department of Computer Science.
Dr. Yuen spent a six-month sabbatical leave with
The University of Maryland Institute for Advanced
Computer Studies, University of Maryland at College Park, in 1998. From 2005 to 2006, he was a Visiting Professor with the
GRAphics, VIsion and Robotics Laboratory, INRIA, Rhone Alpes, France.
He was the Director of the Croucher Advanced Study Institute on Biometric
Authentication in 2004, and the Director of Croucher ASI on Biometric
Security and Privacy in 2007. He was a recipient of the University Fellowship
to visit The University of Sydney in 1996.
Dr. Yuen has been actively involved in many international conferences as
an Organizing Committee and/or a Technical Program Committee Member.
He was the Track Co-Chair of the International Conference on Pattern
Recognition in 2006 and the Program Co-Chair of the IEEE Fifth International
Conference on Biometrics: Theory, Applications and Systems in 2012. He is
an Editorial Board Member of Pattern Recognition and the Associate Editor of
the IEEE T RANSACTIONS ON I NFORMATION F ORENSICS AND S ECURITY,
and the SPIE Journal of Electronic Imaging. He is also serving as a
Hong Kong Research Grants Council Engineering Panel Member.
Dr. Yuens current research interests include video surveillance, human face
recognition, biometric security, and privacy.
Rama
Chellappa
(S78M79SM83F92)
received the B.E. (Hons.) degree in electronics and
communication engineering from the University of
Madras, India, the M.E. (Hons.) degree from the
Indian Institute of Science, Bangalore, India, and the
M.S.E.E. and Ph.D. degrees in electrical engineering
from Purdue University, West Lafayette, IN. From
1981 to 1991, he was a Faculty Member with
the Department of Electrical Engineering-Systems,
University of Southern California (USC). Since
1991, he has been a Professor of Electrical and
Computer Engineering and an Affiliate Professor of Computer Science with
the University of Maryland (UMD), College Park. He is also affiliated
with the Center for Automation Research and the Institute for Advanced
Computer Studies (Permanent Member) and is serving as the Chair of the
Electrical and Computer Engineering Department. In 2005, he was named
Minta Martin Professor of Engineering. His current research interests span
many areas in image processing, computer vision, and pattern recognition.
He is a Golden Core Member of the IEEE Computer Society, served as
a Distinguished Lecturer of the IEEE Signal Processing Society and as
the President of the IEEE Biometrics Council. He is a fellow of IAPR,
Optical Society, American Association for the Advancement of Science,
Association for Computing Machinery, Association for the Advancement of
Artificial Intelligence, and holds four patents. He is a recipient of an NSF
Presidential Young Investigator Award and four IBM Faculty Development
Awards. He received two paper awards and the K.S. Fu Prize from the
International Association of Pattern Recognition (IAPR). He is a recipient
of the Society, Technical Achievement, and Meritorious Service Awards
from the IEEE Signal Processing Society. He also received the Technical
Achievement and Meritorious Service Awards from the IEEE Computer
Society. He is also a recipient of Excellence in teaching award from
the School of Engineering at USC. At UMD, he received the college and
university level recognitions for research, teaching, innovation, and mentoring
undergraduate students. In 2010, he was recognized as an Outstanding
Electrical and Computer Engineering by Purdue University. He served as
the Editor-in-Chief of the IEEE T RANSACTIONS ON PATTERN A NALYSIS
AND M ACHINE I NTELLIGENCE and the General and Technical Program
Chair/Co-Chair of several IEEE International and National Conferences and
Workshops.

Multi-Cue Visual Tracking

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Cue Visual Tracking

Uploaded by

Copyright:

Available Formats

5826

Joint Sparse Representation and Robust

Manuscript received March 2, 2015; revised July 26, 2015; accepted

such variations. As such, the use of multiple cues/features to

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION

not be relevant to or well represented by the corresponding

without knowing the exact mapping. Therefore, the proposed

appearance changes. Different from subspace learning

accuracy with reduced computational complexity [22].

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION

sparsity-inducing norm and a weighted scheme for multimodality classification.

the reconstruction error and a regularization term, i.e.

III. ROBUST F EATURE -L EVEL F USION

where 2 represents 2 norm, is a non-negative parameter,

(W ) = (w1 2 , , w N 2 )1 =

where wn denotes the n-th row in matrix W corresponding

In the particle filter-based multi-cue tracking framework,

Under the joint sparsity assumption, the number of chosen

where wk is a weight vector with dimension N to reconstruct

Since some visual cues may be sensitive to some variations,

where m j denotes the index for the template which

Suppose we have K  unreliable visual cues. Without loss

where 1 and 2 are non-negative parameters to balance

On the other hand, (8) for the unreliable feature k  is

Based on the above analysis, the final regularization function

Denote 1 = 1 and 2 = 2 . Substituting (W ) in (12)

The procedures to solve the optimization problem (13)

Fig. 1. Graphical illustration of unreliable feature detection on synthetic data.

than zero. Therefore, we determine the index set O of the

where the right side of this equation denotes the average

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION

can be solved efficiently by employing Accelerated Proximal

where F are differentiable convex function with Lipschitz

+ r k u k,t 22 + s k v k,t 22

where is the Lipschitz constant [69], uk,t and vk,t are

With some algebraic manipulations, (18) can be separated into

Algorithm 1 Optimization Procedure for Problem (13)

1) Gradient Mapping Step: the proximal matrices R t +1 and

where the gradient operators are given by uk,t = (X k )T y k +

E. Kernelized Framework for Robust Feature-Level Fusion

Algorithm 2 Optimization Procedure for Problem (25)

where O denote the index set of unreliable features in kernel

set for tracking result in linear space can be extended to a

where (X k ) = [(x 1k ), , (x Nk )] denote the template

Expanding the reconstruction term in optimization

IV. I MPLEMENTATION D ETAILS

The major computation time of the proposed tracker is

where x i and y j are the i -th and j -th column of X and Y ,

The proposed tracker is sparse-based. We adopt the template

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION

Hessian matrix in high dimension. In the experiment, we set

Algorithm 3 Template Update Scheme

simultaneously. The template update scheme is summarized in

where kmax is the largest eigenvalue of K(X k , X k ).

To demonstrate that the proposed method can detect

experiment on thirty-three publicly available video sequences.2

0.0007. And the number of templates is 12, and 200 samples

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION

trackers achieve much better results compared with sparse

is because the more informative features are dynamically

form as well as RJSRFFT and K-RJSRFFT which suggests

LAN et al.: JOINT SPARSE REPRESENTATION AND ROBUST FEATURE-LEVEL FUSION

and JSRFFT suggests that unreliable features detection can

feature detection scheme, both K-RJSRFFT and RJSRFFT

(W ) = (w1 2 , , w N 2 )1 =

Suppose we have K unreliable visual cues. Without loss

On the other hand, (8) for the unreliable feature k is

Denote 1 = 1 and 2 = 2 . Substituting (W ) in (12)

where (X k ) = [(x 1k ), , (x Nk )] denote the template