Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO.

9, SEPTEMBER 2010 1553

Coupled Prediction Classification


for Robust Visual Tracking
Ioannis Patras, Member, IEEE, and Edwin R. Hancock

Abstract—This paper addresses the problem of robust template tracking in image sequences. Our work falls within the discriminative
framework in which the observations at each frame yield direct probabilistic predictions of the state of the target. Our primary
contribution is that we explicitly address the problem that the prediction accuracy for different observations varies, and in some cases,
can be very low. To this end, we couple the predictor to a probabilistic classifier which, when trained, can determine the probability that
a new observation can accurately predict the state of the target (that is, determine the “relevance” or “reliability” of the observation in
question). In the particle filtering framework, we derive a recursive scheme for maintaining an approximation of the posterior probability
of the state in which multiple observations can be used and their predictions moderated by their corresponding relevance. In this way,
the predictions of the “relevant” observations are emphasized, while the predictions of the “irrelevant” observations are suppressed.
We apply the algorithm to the problem of 2D template tracking and demonstrate that the proposed scheme outperforms classical
methods for discriminative tracking both in the case of motions which are large in magnitude and also for partial occlusions.

Index Terms—Regression, tracking, state estimation, relevance determination, probabilistic tracking.

1 INTRODUCTION
are close to the “true” state, and therefore, a large number of
V ISION-BASED tracking is one of the fundamental and
most challenging low-level problems in computer
vision. Formally, it is defined as the problem of estimating
solutions need to be examined. Detection-based methods [27]
that exhaustively search all image locations for the presence
the state x (e.g., position, scale, rotation, or 3D pose) of a of a target fall into this category.
target, given a set of noisy image observations Y ¼ In order to reduce the computational complexity and to
f. . . ; y ; yg up to the current frame.1 Usually, an estimate deal with likelihoods with multiple modes, a number of
of the state at each frame is the location of a minimum of a methods have been proposed for selecting candidate
cost function, or, in the probabilistic framework, the solutions in the generative tracking framework. A common
location of a maximum of the posterior pðx j Y Þ. Alterna- choice utilizes a motion model (e.g., [11]) for the proposal
tively, a representation of the posterior pðx j Y Þ can be distribution that is for the distribution from which the
maintained for each frame of the image sequence. candidate solutions will be sampled. Typical choices for the
motion model range from general constant velocity/
1.1 Literature Review acceleration models to higher order models whose para-
The large number of methods that have been proposed in meters can be learned from training data [11], [12] or online
the last decades for maintaining a representation or for [5]. A motion model assists the estimation process so long
finding a maximum of the posterior pðx j Y Þ fall into two as the temporal evolution of the state follows it. However,
main categories. In the first category belong the generative the residual between the model-based temporal prediction
methods (e.g., [11], [9]). Such methods require the inver- and the true target state can be significant in the general
sion of the posterior pðx j Y Þ using the Bayes rule and the case of irregular motion, novel motion, or a moving camera.
evaluation of the likelihood pðy j xÞ at certain sample states Other methods utilize the fact that in certain domains, the
x. An important drawback is that at least some evaluations true target state may lay on a manifold in the state space
need to be performed at sample points in the state space that and therefore (probabilistic) priors may exist on the state of
the target. This is the case when the state encodes the
1. In this work, we denote with a an observation or (an instantiation of) position of multiple interacting targets (such as facial points
a random variable a in the previous time instant. For example, we denote [21]) or the position of the components of a constrained
with y the observation at the current frame and with y the observation in
the frame before. articulated structure such as the human body [23]. Alter-
native methods utilize the observations in the current frame
in order to sample from areas where the likelihood is
. I. Patras is with the School of Electronic Engineering and Computer expected to be higher. This may be done by performing a
Science, Queen Mary University of London, Mile End Road, E1 4NS two-stage propagation [22], or by using mixtures of learned
London, UK. E-mail: i.patras@elec.qmul.ac.uk.
. E.R. Hancock is with the Department of Computer Science, University of
detectors and dynamic models (e.g., [18]). In the latter case,
York, YO10 5DD York, UK. E-mail: erh@cs.york.ac.uk. a target detector needs to be applied at every image location
Manuscript received 16 May 2008; revised 21 Oct. 2008; accepted 20 Aug. and various scales.
2009; published online 6 Oct. 2009. The second category consists of the discriminative (or
Recommended for acceptance by P. Perez. prediction-based) methods. In contrast to generative meth-
For information on obtaining reprints of this article, please send e-mail to: ods, in discriminative tracking, an observation y delivers a
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2008-05-0291. direct prediction of the hidden state x. This alleviates the
Digital Object Identifier no. 10.1109/TPAMI.2009.175. need for a good proposal distribution and multiple
0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
1554 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

evaluations. The predictor can be obtained in two ways.


First, it can be derived analytically from modeling assump-
tions. This is the case with classical motion estimation
schemes that utilize the optical flow equation, such as the
method of Lucas and Kanade [16] and the work of
Simoncelli et al. [24]. Many motion estimation schemes are
posed as an optimization problem which is solved using the
@y
gradient @x of the observation y with respect to the state x,
for example, within a gradient descent approach. While a
point estimation is usually obtained [16], Simoncelli et al. Fig. 1. Prediction error as a function of the true horizontal and the true
[24] derive an estimate of the distribution of the posterior vertical displacement. The performance deteriorates sharply outside the
pðx j yÞ by explicitly modeling the distribution of the noise training area. Here, a Bayesian Mixture of Experts was trained for
displacements in the interval ½11 . . . 11.
in the various terms that appear in the optical flow equation.
Second, the predictor may be learned in a supervised way
from training data. In the learning framework, a number of pðx j yÞ can deteriorate sharply for observations y that are
researchers have recently proposed methods for deformable contaminated with noise or come from areas that are
motion estimation, 2D template tracking, and 3D human uninformative concerning the state of the visual target
pose estimation [8], [25], [2]. One of the first learning-based (e.g., occluded areas). In particular, in the case of the 2D
approaches is that of Cootes et al. [8], which estimates the tracking, when the motion magnitude is larger than in the
parameters that optimally warp an image to an appearance training set, then the prediction error is likely to be large and
model in an iterative way. The state update x at each tracking is likely to fail. In Fig. 1, we illustrate this effect by
iteration is estimated from the intensity differences y plotting the prediction error as a function of the true
between the warped image and the appearance model; note displacement in an artificial example for a regression-based
that the intensity residual y depends on the current state x. predictor. Similar observations are reported in [15] and [30]
Instead of estimating x using the gradient of the observa- for regression-based schemes and also hold true for
@y
tion (i.e., @x ), they learn in a supervised way a linear relation gradient-based predictors such as [16].
between them, that is, they learn a matrix A such that
x ¼ Ay. For 3D human pose tracking, Sminchisescu et al. 1.2 Contribution
[25], [6] train a Bayesian Mixture of Experts in order to learn In this paper, we propose a coupled prediction-classification
a multimodal posterior pðx j yÞ. Agarwal and Triggs [2] use scheme for prediction-based 2D visual tracking. The method
Relevance Vector Machines (RVMs) in order to learn allows the use of multiple observations in a way such that
mappings between vectors of image descriptors and the each observation yðrÞ ðr 2 fr1 ; . . . rR gÞ contributes to the
3D pose of a human body. In [1], they use Nonnegative prediction of the state of the target according to its relevance
Matrix Factorization in order to remove parts of the (or reliability). In our scheme, the corresponding reliability is
observation vector that are due to noise or occlusion. For determined by a probabilistic classifier. In this way, the
2D tracking, Williams et al. [30] use RVMs in order to learn contribution of predictions that originate from reliable
the posterior of the location of a visual target (e.g., a human observations is the most significant, while the contribution
face) given an observation at a certain image location.
of predictions of observations that originate from occluded
Finally, for 2D tracking, Jurie and Dhome [15] learn in a
areas or of unreliable observations is largely suppressed.
supervised way a linear relation between the intensity
In order to achieve this goal within a discriminative
differences between two templates and the corresponding
motion transformation. particle filtering framework [25], we introduce an addi-
In order to deal with possibly large prediction errors, tional random variable r that is used to obtain (or, in
most of the previous methods mainly rely on temporal general, utilize) multiple observations denoted by yðrÞ,
filtering. Sminchisescu et al. [25] and Agarwal and Triggs together with a binary random variable z that is related to
[2] use as observations features that are extracted from a the relevance of the observation. We use a probabilistic
single-object silhouette. They address prediction errors by classifier [26] in order to model the conditional probability
adopting a multiple hypotheses tracking framework that pðz ¼ 1 j yðrÞÞ (the probability that the observation yðrÞ is
performs temporal filtering. On the other hand, Williams relevant/reliable) and use a probabilistic predictor [29] in
et al. [30] couple the regression-based tracking to a order to model pðx j yðrÞÞ (the posterior probability of the
detection-based scheme that is employed to validate that state x given an observation yðrÞ). Both the predictor and
the target is at the predicted position or pose. In the case of the classifier are trained in a supervised way using data that
a validation failure, a full-scale detection phase is initiated. are generated by applying synthetic transformations to the
A Kalman filter is used for temporal filtering and leads to target template from the first frame. Alternatively, both may
a reduction of the error by an order of magnitude. be trained using data from an annotated database. During
However, none of these methods addresses explicitly the tracking (Fig. 2), multiple observations are generated by
problem of assessing in advance how well the observation y sampling r, and the prediction of each observation (as given
can predict the state x. Nor do they use multiple observations by the probabilistic predictor) is moderated by the
in order to increase robustness. Regression-based methods corresponding relevance/reliability weight (as given by
are known to be sensitive to observations that do not belong the probabilistic classifier).
to the space that is sampled by the training data set. Our overall contributions in this paper can be summar-
Therefore, the accuracy of the prediction of the posterior ized as follows:
PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING 1555

Fig. 2. Overview of the proposed method.

. We explicitly address the problem of the determina- Fig. 3. Graphical models (a) for classical discriminative tracking and
tion of the relevance/reliability of an observation to (b) for regression tracking with relevance determination.
the state estimation process by learning in a
supervised way the underlying conditional prob- window centered at a position r. In the absence of a motion
ability distribution. model, r is the estimated position of the target in the
. We devise a probabilistic framework that allows previous frame, that is, r ¼ x^ . However, using data from a
multiple observations yðrÞ to contribute to the single window disregards the information that is available
prediction of the state of the target according to at other positions r. Similarly, for 3D tracking [2], [25], a
their corresponding relevance/reliability. single feature vector is extracted from the object silhouette.
. We make explicit the relation between our frame- On the other hand, in the generative particle filtering
work and alternative discriminative and generative framework for 2D tracking, it is common practice that
estimation/tracking schemes. More specifically, we several parts of the observation are examined. This is
show that under certain modeling assumptions achieved by using multiple samples (particles) r and by
(simplifications), our estimation scheme is practi- assigning to each particle a weight ðy; rÞ proportional to
cally equivalent to classical generative and discrimi- the likelihood pðy j rÞ. The particles r are sampled from pðx j
native estimation schemes. Y  Þ using the transition probability pðx j x Þ and, most
The remainder of the paper is organized as follows: In usually, the sampled r determines how the observation y
Section 2, we provide an outline of the proposed discrimi- will be utilized. The latter means that the likelihood pðy j rÞ
native tracking framework with data relevance determina- is modeled as a function of yðrÞ, that is, pðy j rÞ ¼ pðyðrÞ; cÞ,
tion. In Section 2.1, we briefly describe the Bayesian Mixture where c are some model parameters. In the simplest case, a
of Experts predictor, and in Section 2.2, we present our number of measurements yðrÞ at positions rðr 2 fr1 ; . . . rR gÞ
method for observation relevance determination. Section 2.3 around the location of the target in the previous frame are
presents a procedure which, given a predictor, selects an
utilized.2 Given the above, the posterior is empirically
appropriate classifier, and in Section 2.4, we show the
approximated using a set of weighted particles, that is, a set
relation of the proposed scheme with alternative generative
of pairs fððy; r1 Þ; r1 Þ; . . . ððy; rR Þ; rR Þg. Formally,
and discriminative tracking methods. Section 3 presents
experimental results. Finally, in Section 4, we give some 1X rR
conclusions and directions for future work. An early pðx j Y Þ  ðy; rÞðx  rÞ; ð2Þ
Z r¼r1
version of the proposed scheme appears in [20].
where Z is a scaling parameter and ð:Þ is the Kronecker
2 PREDICTION-BASED TRACKING WITH RELEVANCE function.
DETERMINATION Here, we propose a discriminative particle filtering
method that utilizes the fact that several parts of the
Filtering, such as Kalman filtering or particle filtering, has observation can yield predictions of the state of the target.
been the dominant framework for recursive estimation of We do so by introducing a random variable r that determines
the conditional probability of the unknown state x given a which parts, or in general how, the observation y will be used.
set of observed random variables Y ¼ f. . . ; y ; yg. In the Without loss of generality, in the derivations that follow, we
discriminative filtering framework (Fig. 3a), the filtered will assume that r has the dimensionality and the physical
density can be derived as [25]: meaning of the hidden state x. For example, in the case of 2D
Z template tracking where x 2 R2 , the random variable r 2 R2
pðx j Y Þ ¼ dx pðx j Y  Þpðx j x ; yÞ; ð1Þ will determine the centers of the windows/patches at which
we will extract observations yðrÞ that will give predictions of
where y (y ) is the observation at the current (previous) x. In general, r will be used for obtaining a set of candidate
frame and x (x ) is the state in the current (previous) observations yðrÞ and does not need to have the dimension-
frame, respectively. Similarly, Y is the set of observations ality of x. We will also condition r on x as we expect that the
up to the current frame and Y  the set observations up to previous state can be sufficiently informative on how
the previous frame. candidate observations can be obtained. Subsequently, we
The derivation of (1) ignores the fact that for certain introduce a binary variable z and denote with pðz ¼ 1 j y; rÞ
the probability that the observation yðrÞ is relevant for
problems, different parts of the observation y can give
the prediction of the unknown state x. The dependencies of
different predictions of the state of the target. For example,
the variables are depicted in Fig. 3b where y is observed and
in [30], for 2D tracking where the evidence y is an image
frame, the prediction of the state of the target (e.g., its 2D 2. In general, in the case that the state x is not only a 2D displacement,
location) is based on the data yðrÞ extracted from a single obtaining yðrÞ requires warping.
1556 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

TABLE 1 TABLE 2
Discriminative Filtering with Data Relevance Determination Modeling Choices

Here, we opt for a learning approach (as explained in


Sections 2.1 and 2.2) in which a probabilistic classifier
determines the observation relevance/reliability ðy; rÞ and
a Bayesian Mixture of Experts (BMEs) determines the
prediction pðx j z ¼ 1; x ; y; rÞ. In the tracking phase given
a triple ðy; r; x Þ, the trained BME yields a mixture of
Gaussians that is our approximation of pðx j z ¼ 1; x ; y; rÞ.
the remainder are hidden. In this graphical model, the filtered
For pðx j z ¼ 0; x ; y; rÞ, that is, for the probabilistic predic-
density can be derived as
tions given that the observation is not relevant, we use a
Z
modeling approach and approximate it using a Gaussian
pðx j Y Þ ¼ dx pðx j Y  Þ
with a large covariance matrix S0 (alternatively, we could
Z Z  ð3Þ have used a uniform distribution). Our modeling choices
drpðr j x Þ dzpðx j z; x ; y; rÞpðz j y; rÞ : lead to an approximation of pðx j Y Þ using a mixture of
Gaussians. This allows us to deal with posteriors with
In what follows, we will describe our modeling choices multiple modes and also to recover from tracking failures.
and a computational scheme for maintaining a representa- In order to keep the number of mixture components
tion that approximates the above posterior. We will show constant (equal to M), we devise, in the Appendix, a
that an approximation of the posterior pðxjY Þ is method for approximating an L-component Gaussian
mixture with an M-component Gaussian mixture (M  L).
1 XrR
In Table 1, we describe the computational scheme that,
pðx j Y Þ  ðy; rÞpðx j z ¼ 1; x ; y; rÞ
R r¼r1 given an approximation of the posterior pðx j Y  Þ of the
! ð4Þ state at the previous frame by an M-component mixture of
X
rR
þ ð1  ðy; rÞÞpðx j z ¼ 0; x ; y; rÞ ; Gaussians, yields an approximation of the state posterior
r¼r1 pðx j Y Þ at the current frame by an M-component mixture of
4
where ðy; rÞ ¼ pðz ¼ 1 j y; rÞ is the relevance of the ob- Gaussians. In Table 2, we summarize our modeling choices.
servation yðrÞ, pðx j z ¼ 1; x ; y; rÞ is the probabilistic pre- Note that in (5), the integral is approximated using K þ 1
diction for the state x given that the observation yðrÞ is Gaussian components. In practice, in order to reduce the
relevant, and pðx j z ¼ 0; x ; y; rÞ is a probabilistic prediction number of the components, we use the approximation:
of the state x given that yðrÞ is not relevant. The r1 ; . . . ; rR are Z
samples of the hidden variable r and need to be sampled dzpðx j z; x ; y; rÞpðz j y; rÞ
properly (in the way that is described below) so that (4) ( P ð6Þ
indeed becomes an approximation of the posterior. Finally,  K i¼1 i N ði þ r; Si Þ; if  > z ;
¼
notice the similarity in form between our approximation (4) N ðx ; S0 Þ; otherwise:
and the approximation in the generative framework (2). We R R
will make the relationship explicit in Section 2.4. As a result, we approximate the term drpðr j x Þ dzpðx j z;
In order to complete the specification of our framework, x ; y; rÞpðz j y; rÞ with a mixture of unnormalized L
we need to define its three main formal components, that is, (R  L  RK) Gaussians. We reduce it to an M-component
ðy; rÞ, pðx j z ¼ 1; x ; y; rÞ and pðx j z ¼ 0; x ; y; rÞ. We mixture in step 4.
assume that these probability distributions are either The computational complexity of this scheme is
derived from modeling assumptions or learned in a training OðRK þ RJ þ RKMÞ, where OðRKÞ is the complexity of
phase. For example, modeled probabilistic distributions the BME predictor and OðRKMÞ the complexity of the EM
have been used in the context of motion estimation [24]. algorithm for the reduction of the RK-component mixture
PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING 1557

of Gaussians to an M-component mixture of Gaussians. and follow a variational approach for the estimation of their
OðRJÞ is the complexity of the kernel-based Relevance posterior distributions. As in [29], we make a Laplace
Vector Machine classifier (Section 2.2), where J is the approximation and estimate the mode and the variance of
number of the support vectors. the posteriors, which (with a slight abuse of notation) we
denote here as ðwi ; wi Þ and ði ; i Þ. In the process, we also
2.1 Bayesian Mixture of Experts for Regression estimate the optimal value for the hyperparameters i that
In what follows, we will describe a method that, given an are associated with the noise covariance Si of the prediction
observation yðrÞ and the target state at the previous frame x , of expert i (see [29] for details).
yields a probabilistic prediction of the state x at the current In [29], a procedure is described for scalar regression. In
frame. For notational simplicity, let us here denote with y the case where the target is a vector x with dimensionality D,
the Cartesian pair ðyðrÞ; x Þ. we may train D different Mixtures of Experts. Here, we have
Our method follows the work of Sminchisescu et al. [25] extended the methodology to experts that have multidimen-
and uses the Bayesian Mixtures of Experts for regression. sional output (i.e., fi ðx j yÞ is a multidimensional Gaussian
Given an observation y, the BME delivers a probabilistic with diagonal noise covariance Si ). In this case, wi is a matrix
prediction of x as a mixture of Gaussians. The rationale with the number of rows equal to the dimensionality of y and
behind this choice over alternative regression methods (e.g., number of columns equal to the dimensionality of x.
RVMs [26]) is that the BME can successfully model For prediction, we marginalize over the parameters and
predictive distributions that are multimodal in character. hyperparameters as in [29]. For a new observation y, the
Such distributions often arise in the case of 3D tracking due predictive distribution is a mixture of Gaussians given by
to, for example, front/back and left/right ambiguity [25], X
K  
[2], [23]. They are also expected to arise in the case of 2D p^ðx j yÞ ¼ gi ðyÞN wTi y; Si0 ; ð10Þ
tracking due to the aperture problem [10]. However, this i¼1
choice is not restrictive and any linear [32] or nonlinear
where the kth element of the diagonal covariance matrix Si0 ,
regression method [8] could be used as an alternative. More 0
denoted with Sik , is given by
generally, any method that can deliver a prediction of the
state x, given an observation y can be used. In the case of 2D 0
Sik ¼ yT wik y þ Sik ; ð11Þ
tracking, the Lucas-Kanade method [16] could be used to
make an estimate of the 2D target location x by delivering where Sik is the corresponding element in the covariance
an estimate of the displacement vector x with respect to matrix of the ith expert. Alternatively, we may straightfor-
the position at which the observation yðrÞ was extracted. wardly use (7), or use only the prediction of the expert with
Similarly, the method of Simoncelli et al. [24] could deliver the highest gating coefficient gi , or approximate (10) with a
a probabilistic prediction (a Gaussian) for x. single Gaussian.
The (Hierarchical) Mixtures of Experts, which was For the problem of 2D visual tracking, we aim to
introduced by Jordan and Jacobs [14], is a method for estimate the transformation x (e.g., translation, rotation,
and scaling) that a visual target undergoes in an image
regression and classification that relies on a soft probabil-
sequence. We train the BME in a supervised way with pairs
istic partitioning of the input space. This is determined by
ðyðxÞ; xÞ in which the observations yðxÞ are produced by
gating coefficients gi ðyÞ (one for each expert i) that are input
synthetically transforming (e.g., translating) the visual
dependent and have a probabilistic interpretation; that is,
target with the transformation x. In this case, we choose
the coefficients of the siblings at each level of the hierarchy
to ignore the state x at the previous frame when training
sum up to one. The prediction of each expert i is then
the BME. Subsequently, in the test phase, an observation
moderated by the corresponding gating coefficient. For- will give a probabilistic prediction according to (10).
mally, in the simple case of a flat hierarchy for regression,
2.2 Data Relevance Determination
X
K
pðx j yÞ ¼ gi ðyÞfi ðx j yÞ; ð7Þ For the determination of the relevance/reliability pðz j y; rÞ
i¼1 of an observation yðrÞ, we use a classification scheme with
the RVMs. The goal is to obtain an a priori assessment of
where fi ðx j yÞ is a probability density function, usually a whether the probabilistic prediction pðx j yðrÞÞ (10) of the
Gaussian centered around the prediction of the expert i. In state of the target is expected to be good. To this end, we
the simple case that linear experts are used: train an RVM classifier in a supervised way with a set of
T positive examples that yield good predictions and with a set
e i y of negative examples that yield bad predictions. Let us
gi ðyÞ ¼ P T y ; ð8Þ
je denote with sigm the sigmoid function, with f~ yi g the
j

training set of the classifier, and with ðyi ; yj Þ a kernel


and function (e.g., a Gaussian, or a linear one).
  Then, after training and when presented with a novel
fi ðx j yÞ ¼ N wTi y; Si ; ð9Þ
observation yðrÞ, the RVM yields a prediction of the
where the wi and i are the unknowns to be estimated. Jordan relevance of the observation yðrÞ as
and Jacobs [14] proposed a Maximum Likelihood method for !
the estimation of wi and i , while in [29], a Bayesian approach X
rvm
is used. We adopt the approach in [29] in which a set of pðz ¼ 1jyðrÞÞ ¼ sigm wi ðyðrÞ; y~i Þ ; ð12Þ
i
hyperparameters models the prior distributions of wi and i ,
1558 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

Fig. 4. Positions r of the vectors selected by the RVM-based classifier.


The inner (outer) window illustrates the range from which the training set
of the BME-based predictor (RVM-based classifier) is constructed. Fig. 5. (a) Prediction error and (b) pðz j y; rÞ as functions of the true
displacement. Here, a BME was trained to predict displacements in the
where wrvm is a sparse weight vector that is learned in the interval ½11 . . . 11 for a template of size 11  11. An RVM was trained to
classify as “relevant” observations in the interval ½22 . . . 22 that can
training phase. deliver accurate (þ= 2 pixels) predictions of the true displacement.
The training set f~ yðrÞg is constructed as follows: A
candidate observation y~ðrÞ is generated by artificially 2.3 Classifier Selection
transforming (e.g., translating) the visual target with a
transformation which we denote here with r. Then, for each The classification scheme of Section 2.2 depends on the
of the candidate observations, a probabilistic prediction is choice of the parameter r (13) and a number of internal
made using (10). We place in the set of positive examples parameters that need to be set. The former can be
candidate observations for which an appropriate norm of interpreted as the desired level of prediction accuracy. Its
the difference between the true transformation r and the selection has direct implications to the complexity and
expected value (i.e., the mean) of the prediction pðx j yðrÞÞ is accuracy of the classifier since certain levels of prediction
less than a threshold. That is, accuracy might be difficult or impossible to achieve. The
first consideration is the ability of the regression scheme to
kr  E pðxjyðrÞÞ ðxÞk < r : ð13Þ learn sufficiently well the true posterior pðx j z ¼ 1; y; x ; rÞ,
As pðx j yðrÞÞ is a mixture of Gaussians, the expected value that is, to learn a pdf with most of its mass concentrated
of x in the above equation can be obtained in closed form. around the true target state x . This depends on the size of
Alternative schemes for constructing the positive training the template, the range of x in the training set, and a
set, such as thresholding the distance between the true number of parameters (such as the number of linear
transformation r and the mode of pðx j yðrÞÞ, or by thresh- predictors and the data y itself). A second consideration is
olding the probability of the ground truth transformation r that the classification problem at a certain level of accuracy
(i.e., pðr j yðrÞÞ > r ), are also possible. The set of the (i.e., for a certain r ) might be very difficult to solve, while,
negative examples is composed of the observations for for a different value of r , it might be considerably easier. A
which (13) is not satisfied. Other examples, such as small r can lead to an empty positive training set, while a
observations from regions in the background, could also large value of r can lead to low accuracy.
be added in the negative training set. Clearly, the The threshold r and the remaining parameters, such as
transformations r that generate the candidate training set the internal parameters of the classifier, can be selected in a
need to explore larger parts of the state space than the ones cross-validation scheme. In what follows, we propose an
used to construct the training set of the BME. alternative way of determining the classification scheme.
In Fig. 4, we illustrate the 2D case in which yðrÞ is an The main idea is that ideally, the probabilistic classifier
observation taken at 2D windows around each position r. In should rank the observations according to their prediction
Fig. 4, superimposed on the original frame we depict the accuracy. In other words, the rank order of 1  pðz ¼ 1 j y; rÞ
range from which the r is taken for constructing: 1) the should coincide with the rank order of the error of the
training set of the BME-based predictor (inner window) and corresponding prediction. The divergence between the ideal
2) the training set of the RVM-based classifier (outer ranking and the ranking obtained by a specific probabilistic
window). In the same figure, we superimpose the r for classifier can therefore give a measure of the goodness of
the vectors that have been chosen by the RVM classifier. the classifier in question.
In Fig. 5, and for the toy example that we used in Fig. 1, Formally, let us assume that the observations are ranked
we illustrate the true prediction error of an BME with eight in order according to some criterion. Let q vary between 0
experts that have been trained to predict 2D displacements and 1, and let eðqÞ be the mean error of the predictions of
in the interval ½11 . . . 11. Also shown is the corresponding the observations that are on the upper fraction q according
2D plot of pðz j y; rÞ. Note that we test with observations to the ranking used. Let eo be the mean error under the
that result from displacements from both inside and outside “ideal” ranking, i.e., under the ranking according to the
the training interval. The RVM has been trained on positive error itself, and e be the mean error under the ranking of a
examples that have been selected by thresholding the L1 certain probabilistic classifier. That is, eðqÞ is the mean
error norm. It is clear that we can predict reasonably well error of the top q fraction of observations ranked according
which observations are associated with a low prediction to the learned pðz ¼ 1 j x; yÞ. We vary the threshold r and
error. Note that not all observations that fall outside the the size of the kernel of the RVM classifiers ( phi ) and, in
training range give high prediction errors. This indicates Fig. 6, we depict eo ðqÞ and eðqÞ for some of them, where the
that, in this case, the BME is capable of extrapolating. differences in the performance are clearly visible.
PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING 1559

pðy j xÞ
pðx j Y Þ ¼ pðx j Y  Þ; ð14Þ
pðy j Y  Þ

Z
pðy j xÞ
¼ dx pðx j x Þpðx j Y  Þ: ð15Þ
pðy j Y  Þ
In the generative framework, a number of candidate
solutions ri are sampled from pðx j Y  Þ and they are
subsequently weighted using the likelihood3 pðy j ri Þ. If
we denote with ðy; rÞ the weight that is assigned to sample
r, then an approximation of the posterior is given by (2).
Recall that in our framework, the approximation of the
posterior is given by (4). Also, observe the relation between
(2) and (4): In the generative case, the mass of the posterior is
on the samples ri , while, in the discriminative case, the mass
of the posterior is on the predictions pðx j z ¼ 1; x ; y; ri Þ (let
Fig. 6. Mean error versus the fraction q of positives for different
classifiers. The curve eo corresponds to the ideal classifier. us for the moment ignore the “outlier predictions”
pðx j z ¼ 0; x ; y; ri Þ). Therefore, if we choose the predictors
We make the selection according to the 2 test between eo such that pðx j z ¼ 1; x ; y; ri Þ is equal to ðx  rÞ, and let r be
and e, that is, we select a curve that lies on the lower right sampled from pðx j Y  Þ, then the estimation schemes of the
part of the plot. Such a classifier generates a large number of two frameworks will be equivalent. There are just two
positives for a given error level, a property that is important differences in the methods. First, in our framework, once a
in our estimation scheme which might rely on few sample r is obtained in this way, it should be assigned a
observations only. As Fig. 6 reveals that a number of weight  equal to pðz ¼ 1 j r; yÞ. By contrast, in the generative
classifiers (solid lines) have similar ranking properties. framework, the sample r should be assigned a weight equal
Among classifiers with similar ranking properties (with a pðyjrÞ 
to pðyjY  Þ , or a weight equal to pðy j rÞ since the term pðy j Y Þ
5 percent margin), we favor the one that delivers, on is independent of r and therefore is canceled in the normal-
average, the lower weighted error under the weighing of the ization. Second, in our case, the prediction of the outliers (i.e.,
positives according to the probabilistic weighing scheme. observations that are irrelevant/unreliable) is made explicit
2.4 Relation to Generative and Discriminative in the form of pðx j z ¼ 0; x ; y; ri Þ.
Tracking Formally, under the modeling choice pðx j z ¼ 1; x ;
In this section, we will make explicit the relation between y; rÞ ¼ ðx  rÞ, (3) becomes
the proposed tracking framework on the one hand, and pðx j Y Þ ð16Þ
both discriminative and generative tracking methods on the
other hand. More specifically, we will show that under Z Z
certain modeling assumptions, we derive estimation ¼ 
dx pðx j Y Þ 
drpðr j x Þ
schemes that are practically equivalent to classical gen- Z ð17Þ
erative and discriminative estimation schemes.
dzpðx j z; x ; y; rÞpðz j y; rÞ
The relation to classical discriminative methods (e.g.,
[25]) is rather straightforward. In the case that a single
Z
observation yðrÞ is used and the data relevance  ¼ pðz ¼
1 j y; rÞ is set to one (i.e., the single observation is considered ¼ pðz ¼ 1 j y; xÞ dx pðx j x Þpðx j Y  Þþ
relevant/reliable), our framework reduces to the discrimi- Z Z
native tracking framework of [25]. Formally, (1) can be dx pðx j Y  Þ drpðr j x Þpðx j z ¼ 0; x ; y; rÞpðz ¼ 0 j y; rÞ
derived from (3) when the following three modeling choices
ð18Þ
are made:

. First, pðr j x Þ ¼ ðr  x Þ. That is, the auxiliary ¼ pðz ¼ 1 j y; xÞpðx j Y  Þþ


variable r that controls multiple observations coin- Z Z
cides with x , and therefore, effectively is not used. dx pðx j Y Þ drpðr j x Þpðx j z ¼ 0; x ; y; rÞpðz ¼ 0 j y; rÞ:
  

. Second, pðz ¼ 1 j y; rÞ ¼ 1 (and therefore, pðz ¼


0 j y; rÞ ¼ 0). That is, the single observation that is ð19Þ
utilized is considered relevant/reliable. Let us now comment on the main differences between (18)
. Third, pðx j z; x ; y; rÞ ¼ pðx j x ; yÞ. That is, the and the filtering equation in the generative framework (15).
prediction does not dependent on the auxiliary The first difference is that instead of using the likelihood
variables z; r. pðy j xÞ, our framework uses the term pðz ¼ 1 j y; xÞ in order
We now show that under suitable modeling choices, the to weight samples r that are sampled from pðx j Y  Þ. In our
proposed framework reduces to an estimation scheme that
3. In general, a number of candidate solutions r 2 fr1 ; . . . ; rR g are
is equivalent to particle filtering in the generative frame-
work. To commence, we note that, using the Bayes rule, the sampled from a proposal distribution gðrÞ and they are subsequently
Þ
posterior is given by weighted by pðy j rÞ pðrjY
gðrÞ .
1560 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

framework, the variable z takes the interpretation of the class motion model are present. In addition, we have compared
of the observation y and pðz ¼ 1 j y; xÞ is the probability that our method with a particle filtering algorithm in the
the observation y belongs to the target class. Essentially, a generative framework.
classification scheme is used within the particle filtering For each of our experiments and in order to reduce the
framework. This is similar to other works in classification- computational complexity during training, we reduce the
based tracking [3], [7] and to more classical target detectors data dimensionality by applying Principal Component
[27]. Therefore, our scheme offers a formal framework in Analysis (PCA) to the data with which the BMEs are trained.
which classification-based approaches can be used for PCA gives some marginal improvements, and omitting it
recursive estimation of the posterior density. does not lead to a significant degradation in the performance.
The second main difference of the degenerate case of our Before being used, the training data for the BME, the training
scheme with the generative particle filtering tracking (15) is data for the RVM, and all test data are projected to the new
that in our case, the prediction of the outliers becomes space that is spanned by the leading N eigenvectors that
explicit in the form of pðx j z ¼ 0; x ; y; ri Þ, that is, the were extracted using PCA. We choose N such that 95 percent
second term of (18). This bears similarities to generative of the variance is retained, a choice that leads to values of N
methods that introduce an occlusion process and condition between 40 and 50 for the “CD” and the “Head” sequences.
the likelihood on it. In [31], two likelihood models are
In order to deal with illumination changes, we normalize
defined, one given that the target is occluded and one given
each observation by the average intensity of the window at
that the target is visible. However, while in [31] the
which it was extracted. The normalization is performed
occlusion state is inferred, in our case the relevance of the
before learning the PCA transform (i.e., during training) and
observation is determined by the probabilistic classifier.
before applying it (i.e., during tracking). In our experiments,
Also, note that in the general case of our method, the
relevance of an observation is not necessarily related to the we tracked windows of size ranging from 15  15 to 25  25
degree of occlusion but rather to the degree at which a pixels. For training the BME, we used 900 Cartesian pairs
reliable prediction can be obtained from the observation in ðyðxÞ; xÞ in which the observations yðxÞ are produced by
question. Three possible choices for the prediction given that artificially transforming (e.g., translating) the visual target
the observation is irrelevant/unreliable are the following. with the transformation x. The examples that were used for
The first is to use a Gaussian with large variance around x . training the RVM were generated using transformations
A second choice is to use a uniform distribution , where with a range two-three times that used for training the BME.
 << 1. The third is to use a prior pðxÞ in case that it is In order to reduce the complexity of RVM learning, we apply
available [19]. Since pðx j z ¼ 0; x ; y; ri Þ depends on y, we k-means clustering to the set of candidate observations and
can easily derive classical outlier processes based on robust train the RVM using the cluster centers. The class identity of a
statistics [17], or use a Gaussian with large variance as in cluster (positive or negative) is determined by the class
[13]. Formally, the second term which models the predic- identity of the majority of the examples that belong to it. The
tion density of the “unreliable observations” under two number of clusters is considered as a parameter of the
reasonable models of pðx j z ¼ 0; x ; y; rÞ becomes classification scheme and is determined automatically using
the criterion of Section 2.3. For all the experiments, we track
1
pðz ¼ 0 j y; xÞ pðxÞpðx j Y  Þ five Gaussians (i.e., M ¼ 5) and use 50 samples r (i.e.,
Z ð20Þ R ¼ 50), unless stated otherwise.
1
for pðx j z ¼ 0; x ; y; rÞ ¼ pðxÞðx  rÞ;
Z 3.1 Artificial Displacements and Noise
and We first present results for estimating the location of a facial
feature (the corner of an eye) under artificially generated
pðz ¼ 0 j y; xÞpðx j Y  Þ displacements. Both the regressor and the classifier were
ð21Þ trained on data from the first frame of the sequence. Here,
for pðx j z ¼ 0; x ; y; rÞ ¼ ðx  rÞ;
1) a 17  17 target window was used, 2) the BME was trained
where Z is a scaling parameter and  << 1. with displacements of up to 17 pixels, and 3) the RVM was
trained with displacements of up to 34 pixels. The tests were
performed on the frames 157 and 358 (Fig. 11). In frame 157,
3 EXPERIMENTAL RESULTS some degree of both deformation and illumination change
We have performed a number of experiments in order to are present. In frame 358, the target is partially occluded. For
illustrate the performance of the proposed method under these frames, artificial displacements were simulated by
different conditions, including occlusions, fast motion, and sampling x at a distance equal to the “true displacement”
moderate deformations. Here, we present both quantitative from the true target location. During the estimation phase,
and qualitative results for image sequences that are 50 samples of r are sampled from a 2D uniform distribution
annotated by hand as well as comparative results with with mean x and width S at each dimension. As we have
alternative state-of-the-art methods. More specifically, we demonstrated in Section 2.4, for a sampling range of width
compare our Coupled Prediction-Classification algorithm S ¼ 1 and when a single r is sampled, we are effectively
(CPredC) to discriminative tracking when a single ob- using classical regression-based tracking methods (e.g., [30],
servation is used (e.g., [25], [30]) and to the simplified [2]) that use a single observation.
version of the proposed algorithm in which the data In Fig. 7, we summarize the experimental results for
relevance determination mechanism is discarded. We do artificially generated motions of different magnitudes by
not use any dynamic model, nor temporal filtering in order plotting the Root Mean Square (RMS) error as a function of
to judge the performance when large deviations from the both the true displacement and of the sampling range for
PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING 1561

Fig. 7. “Head” sequence: Error (RMS) as a function of both the true


displacement and the sampling range. (a) Frame 157. (b) Frame 358. Fig. 9. “Head” sequence: Estimation error at different noise levels for
frames 1 and 157. (a) Frame 1. (b) Frame 157.
various frames of the “Head” sequence. In Fig. 7a, we show
the error for frame 157, a frame in which the target is occlusions. The importance of relevance determination is
slightly deformed. As expected, for the successful estima- clearly illustrated in Fig. 8a, where we show the RMS error
tion of larger displacements, r needs to be sampled using a for the case in which the relevance is not used (i.e.,  ¼ 1)
2D distribution with larger width S. Also, there exists a and for the sampling range with S ¼ 36. Similarly, when a
range of values of S for which the true displacement can be single observation is used the performance deteriorates
reliably estimated. In Fig. 8a, for the same frame (157), the rapidly as the magnitude of the displacement increases.
solid line depicts the error for S ¼ 36. This demonstrates In Fig. 9, we present results that illustrate the estimation
that we are able to reliably estimate motions larger in accuracy at different levels of additive Gaussian noise with
magnitude than the range over which the regressor was zero mean. Here, n denotes the noise standard deviation.
trained. In the same figure, we demonstrate the importance In Fig. 9a, we present results for the first frame of the
of using both multiple observations and observation sequence at which the training in performed. In Fig. 9b, we
relevance determination by presenting comparative results show results for the 157th frame, where the target
appearance is slightly affected by lighting and there is also
for two simplified versions of our method. The cases
some mild deformation and compression. The performance
considered are: 1) a simplified version (nPred) in which the
seems to degrade gracefully as the standard deviation of the
relevance determination is not used (i.e.,  ¼ 1) and 2) a
noise increases and the technique is quite robust especially
simplified version (sPred) in which only a single observa-
for low displacements.
tion is used (i.e.,  ¼ 1; R ¼ 1). It is clear that when
compared with the proposed CPredC algorithm, both 3.2 2D Template Tracking
simplifications lead to reduced performance in terms of Here, we present results for tracking 2D templates in a
the range of motions (displacements) that can be reliably number of image sequences under changes in the illumina-
estimated. This difference in the performance becomes even tion, large motion, rotation, and nonrigid deformations. In
larger in the case of partial occlusions, as evidenced by the Fig. 11, we present some characteristic frames for the “Head”
results for frame 358 (Fig. 11, row 4) that are depicted in image sequence. A 17  17 window is tracked using 50
Figs. 7b and 8b. In this frame, for some displacements and observations at a rate of 3.1 frames per second using an
sampling ranges S, it is the case that x and some of the unoptimized Matlab implementation (excluding time for
samples r are likely to fall within the occluded area. This input/output). The tracking is consistently good throughout
results in fewer relevant observations that can deliver good the image sequence, even in the presence of large motion (as a
predictions. However, while the performance is worse than fraction of the window size), occlusion, and some deforma-
for frame 157 under large displacements, there seems to be tions. In Fig. 10a, we show the horizontal and vertical
little difference in the performance for displacements of up components of the error in pixels, Fig. 10c shows the ground
to 15 pixels. This indicates that the RVM-based classification truth velocity for a T S ¼ 3, and Fig. 10b shows the ground
scheme manages to detect observations that fall within the truth displacement of the target from its position in the first
frame. In the first column of Fig. 11, we depict in red (blue) the
location at which relevant (unreliable) observations were
extracted, in the second column, we give the corresponding
probabilistic predictions (each ellipse represents a 2D

Fig. 8. “Head” sequence. Error as a function of the true displacement.


Blue line: proposed method (CPredC), red line: multiple observations
without relevance determination (nPred), black line: Single observation Fig. 10. “Head” sequence (SS ¼ 1; T S ¼ 3): (a) Error, (b) true target
without relevance determination (sPred). (a) Frame 157. (b) Frame 358. displacement, and (c) true velocity (pixs per frame).
1562 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

Fig. 11. Tracking results for frames 70, 118, 154, 202, 358, and 580 of the “Head” sequence (SS ¼ 1 and ST ¼ 3). (a) Location of relevant (red)
and unreliable (blue) observations. (b) Relevant (red) and unreliable (blue) predictions. (c) Probabilistic prediction (red) and point estimation
(black box).

Gaussian), and in the third column, we give the probabil- In order to illustrate the benefits of using both multiple
istic prediction (each ellipse represents one of the M 2D observations and data relevance determination, we present
Gaussians) as well as the final estimate of the target location comparative results with the two simplified versions of our
(black box). It is clear that the classification scheme reduces algorithm (sPred and nPred), together with two alternative
methods reported in the literature, namely, the CONDEN-
(or completely discards) the influence of inaccurate predic-
SATION algorithm and [32] when a single target is tracked.
tions. This is apparent both in the case of larger motion
The simplified version sPred is similar to classical regres-
(rows 1-3) in which the prediction of observations further sion-based tracking methods (e.g., [30], [2]) that use a single
away from the target (in the direction opposite from that of observation. Recall that we arrive at this simplification by
the motion) is discarded, as well as in the case of occlusions setting the number of samples of r equal to 1 (i.e., R ¼ 1) and
(row 4). the width of the sampling range very small (i.e., S ¼ 1). The
PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING 1563

TABLE 3
RMS Error and Percentage of Target Losses (in Parentheses)
at Various Spatial (SS) and Temporal (TS) Subsamplings
for Various Sequences

Fig. 13. Examples of tracking under different Occlusion Types (OTs).


From left to right: OT = 4, OT = 2, OT = 1.

such as frame 384 at which the rotation is around 80 degrees


(Fig. 12), the tracking becomes locked on clutter. This is
largely expected and illustrates the need to obtain observa-
For “Towel,” the RMedS error is reported.
tions at scales and orientations that are similar to the ones
used in the training set. In such cases, maintaining a
algorithm nPred is a simplified version of our method in (multimodal) estimate of the scale and rotation components,
which relevance determination is not used. Instead, the for example, by tracking multiple points and compensating
probabilistic predictions for all candidate observations yðrÞ for it in order to obtain warped observations, is needed. For
are weighted equally (i.e., ðy; rÞ ¼ 1). For the CONDENSA- all sequences tested, both sPred and nPred perform sig-
TION algorithm, we 1) use 500 particles, 2) use a Gaussian
nificantly worse than CPredC, and, in general, sPred is
observation model (for the CD sequence, we use an
consistently worse than nPred. Let us also note that all of the
observation model that relies on the Normalized Cross
algorithms deteriorate at higher temporal subsampling, that
Correlation), and 3) use as a state transition model pðxjx Þ of
is, as TS increases. Finally, note that in the second row of
the 2D state x a Gaussian with mean equal to x . In Table 3,
Fig. 11, a multimodal posterior is maintained by CPredC.
we present results for the “Head,” “CD,” and “Towel”
This is due to the fact that some predictions are locked on the
sequences. In the case of the “Head” and “CD” sequences,
other eye as some observations fall in its vicinity. This is a
the location of the tracked template is manually annotated
point where the other trackers occasionally fail, and is a point
every six frames. In the case of the “Towel” sequence, we
where CPredC also fails occasionally in the case of larger
have, for each frame, a manual annotation of four points of a
temporal subsampling (i.e., larger apparent motion).
large planar object of which the tracked target (a 21  21
template) is part. The ground truth location of the target is 3.3 Occlusions and Motion Large in Magnitude
estimated using an affine transform. We report the averages In what follows, we present results for experiments
over five runs. performed in order to test the behavior of our algorithm
For the proposed CPredC scheme and the Sequence of under large and persistent occlusions. In order to do so, we
Linear Predictors (SLiPs) [32], we report the RMS Error and have artificially occluded part of the target at frames in
the percentage of frames for which the error is larger than the which the ground truth is known. We have used different
dimensions of the tracked template (in the case of [32], the Occlusion Types and patterns (denoted by OT), examples of
percentage of frames for which the error is larger than which are presented in Figs. 13, 14, and 15. Both uniform
the training range). In the case of large errors, all other and textured patterns are used and either half (for odd OT)
algorithms are also reinitialized and, for them, we report the or a quarter (for even OT) of the target is occluded. Again,
number of reinitializations as a percentage of the tracked different spatial and temporal subsamplings are examined.
frames. Note that the number of reinitializations reported for Since for the given sequences, the ground truth is available
sPred (i.e., a scheme that utilizes a single observation) is the every sixth frame, subsampling with TS ¼ 3 results in the
number of times that the error was larger than the template target being occluded every second frame, and subsampling
size. It is very likely that a scheme that validates a prediction, with TS ¼ 2 results in the target being occluded every third
for example, by classifying an observation extracted at the frame. For the CD sequence, we used R ¼ 100 observations
predicted location [30], would signal that the target has been and tracked a 15  15 template at a rate of 5.8 frames/sec in
lost. In comparison to SLiPs, both the error (RMS or RMedS an unoptimized Matlab implementation (excluding time for
for the Towel) and the number of frames at which the target input/output).
is lost are consistently lower for the CPredC. For the Towel The results are summarized in Table 4, in which the RMS
sequence and for SS ¼ 4, while the tracking is significantly error is reported. While the performance of the alternative
more accurate for CPredC (as reflected in the very low algorithms clearly deteriorates, the proposed CPredC
RMedS), at frames where the target is significantly rotated method is capable of tracking robustly under partial
occlusions. In some cases, the occlusions are severe (e.g., in
the last two rows of Table 4, half of the target is occluded)
and in the case of Occlusion Type 5 (OT ¼ 5) the appearance
of the occluding area changing. The results of CPredC in
comparison to sPred and nPred indicate the significance of
using multiple observations as well as weighing their
predictions in terms of their expected relevance. For these
sequences, the difference in the performance with SLiPs is
Fig. 12. Frames 175, 346, and 494 of the Towel sequence. more pronounced in terms of the RMS, especially in the case
1564 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

Fig. 14. Tracking results for the “Head” sequence. OT = 5, SS = 2, TS = 3 (see Table 4).

Fig. 15. Tracking results for frames 49, 115, and 253 of the “CD cover” sequence (SS ¼ 2 and T S ¼ 3). A quarter (QRT) of the target is artificially
occluded every other frame.

of larger occlusions, such as the last two rows of Table 4 in probabilistic prediction using five Gaussians. Note that
which half of the target is occluded. The results in our relevance determination scheme suppressed most of
comparison to generative tracking (i.e., CONDENSATION) the observations that were close to the true target location.
illustrate the ability of prediction-based methods to utilize an This indicates that a validation scheme using the trained
observation near the target, rather than on the target itself in RVM classifier would also be likely to fail. Similar results
order to predict the target location. This is the case even are obtained for alternative types of occlusions.
when the target is partially occluded, so long as the
observation is relevant. In the case of partial occlusions, it 3.4 4D Template Tracking
is also the case that methods that rely on the validation of a Finally, we present results for the sequential estimation of
prediction, for example, by classifying the observation that is the location, scale, and rotation of visual targets in image
extracted at the predicted location [30], are likely to fail even sequences. In Figs. 17 and 18, we show some representative
when the prediction is accurate. frames for two image sequences in which the 4D state
In Fig. 16, we give some insight into the ability of the (translation, scale, and rotation) of a primary target is
algorithm to deal with partial occlusions. More specifi-
estimated and the position of a secondary target is inferred.
cally, in the case of partial occlusions, it is those
For each of the sequences, a single Bayesian Mixture of
observations that are extracted in regions neighboring
the true target position that are deemed relevant by the Experts is trained for predicting the 4D state vector. Here, r
classifier. These observations are therefore used to deliver has the dimensionality of the state and determines the
reliable predictions of the target state. In Fig. 16a, we show location, scale, and orientation parameters of the warping
the locations at which relevant (unreliable) observations required in order to obtain the observation yðrÞ. For these
were extracted in red (blue), in Fig. 16b, we show the experiments 1,500 examples were used for training the
corresponding probabilistic predictions (each ellipse re- BME-based predictor. The results for the “Soda can” [4]
presents a Gaussian), and in Fig. 16c, we show the final (Fig. 17) and the “Hand Held” (Fig. 18) image sequences

TABLE 4
Errors for Various Occlusion Types (OTs) at
Various Spatial (SS) and Temporal (TS) Subsamplings

Fig. 16. Coupled prediction classification in occlusions (frame 469,


Fig. 14, last column). (a) Location of relevant/unreliable observations (red/
blue). (b) Corresponding probabilistic predictions. (c) Final prediction.
PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING 1565

TABLE 5
Mean Absolute and RMS (in Parentheses) Errors
for the “Head” and the “Soda Can” Sequence (SS ¼ 4)

Fig. 17. Soda Sequence [4] (ST ¼ 3): Estimate of the translation/scale
and rotation of an 25  25 window. The position of the lower corner of the
can is inferred from the estimation of the state of the target (central
square).
observations are obtained at scales and orientations similar
to those used in the training phase.
give an illustration of the accuracy at which the scale and
the rotation of the primary target are estimated. In order to
create larger motions, we have temporally subsampled the 4 CONCLUSIONS
“Soda can” sequence by a factor of 3. The “Hand Held” In this paper, we have presented a method for efficient and
image sequence (Fig. 18) contains significant motion blur at robust visual tracking. We propose a discriminative frame-
certain frames (e.g., third row and second column) as well
work in which multiple observations provide predictions of
as occasional partial occlusions of the target. Quantitative
the state of the target. Each prediction is moderated by the
results are presented in Table 5 which give the estimation
relevance of the corresponding observation as this is
error of the translation, scale, and rotation components.
determined by a probabilistic classification scheme. This is
The results in the first row of Table 5 refer to the “Head”
the first work that utilizes multiple observations for
image sequence where the secondary target is the right eye
discriminative tracking or uses a classification scheme to
of the depicted person.
assess in advance the relevance of an observation (as
3.5 Synopsis opposed to the a posteriori validation of the prediction). We
In summary, the experimental results demonstrate the have illustrated the efficiency of our approach in a number
efficiency of the proposed scheme in both 2D and 4D of image sequences for the problem of 2D tracking and in
tracking. In the case of 2D template tracking, a clear particular its ability to deal with large motion and with
improvement is demonstrated in comparison to: 1) alter- partial occlusions. For future work, we intend to extend the
native methods that use a single observation (sPred and [32]), proposed scheme for tracking in higher dimensional spaces,
2) alternative methods that use multiple observations with- for example, for tracking 3D human pose under occlusions
out relevance determination (nPred), and 3) a generative and background clutter.
particle filtering method [11]. The improvements were more The estimation scheme that we propose is rather general
pronounced in the case of partial occlusions and rapid and neither limited to the problem of tracking nor tied to
motion. This is due to the ability of the proposed scheme to
the specific regression-based state prediction or RVM-based
discard the predictions of observations that originate at
observation relevance determination. In principle, any
occlusions or from areas too far from the target. The results in
sequences with large rotations and scale changes indicate the estimation scheme that employs a regression-based pre-
need to maintain an estimate of the transform so that warped dictor can be tied to an observation relevance/reliability
estimator in order to utilize multiple observations and
moderate their predictions according to their expected
accuracy (the latter being determined by the observation
relevance/reliability estimator). It would therefore be
interesting to investigate the properties of the proposed
framework in domains other than visual tracking.
Finally, in order to focus on the properties of the coupled
regressor-classifier scheme, we focused our analysis on the
problem of a single target tracking without a motion model.
In future work, we intend to extend the scheme on the
problem of multiple interacting target tracking [21], and
using our scheme with learned motion models.

APPENDIX A
FROM AN L-COMPONENT TO AN M-COMPONENT
MIXTURE OF GAUSSIANS
In this Appendix, we will briefly outline a method for
approximating a mixture of L Gaussians with a reduced
Fig. 18. “Hand Held” sequence: Estimate of the translation/scale and
rotation of a window centered around the center of the picture. The
M-component mixture. Our derivation builds on the
position of the lower corner of the picture is inferred from the estimate of method of Vlassis and Verbeek [28] for learning a Gaussian
the state of the target (central square). Mixture from noisy data.
1566 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010

Let us denote with REFERENCES


[1] A. Agarwal and B. Triggs, “A Local Basis Representation for
X
L
Estimating Human Pose from Cluttered Images,” Proc. Asian Conf.
fðxÞ ¼ pl fðx j lÞ Computer Vision, pp. 50-59, 2006.
l¼1 [2] A. Agarwal and B. Triggs, “Recovering 3D Human Pose from
Monocular Images,” IEEE Trans. Pattern Analysis and Machine
the given L-component mixture, where fðx j lÞ ¼ N ðxl ; Cl Þ,
Intelligence, vol. 28, no. 1, pp. 44-58, Jan. 2006.
for l ¼ 1 . . . L, is a Gaussian with known mean xl and [3] S. Avidan, “Support Vector Tracking,” IEEE Trans. Pattern Analysis
covariance Cl . Let us also denote with and Machine Intelligence, vol. 26, no. 8, pp. 1064-1072, Aug. 2004.
[4] M. Black and A. Jepson, “Eigentracking: Robust Matching and
X
M Tracking of Articulated Objects Using a View-Based Representa-
pðxÞ ¼ m pðx j mÞ tion,” Proc. European Conf. Computer Vision, pp. 329-342, Apr. 1996.
m¼1
[5] A.M. Buchanan and A.W. Fitzgibbon, “Combining Local and
Global Motion Models for Feature Point Tracking,” Proc. IEEE
the unknown M-component mixture, with pðxjmÞ ¼ N ðm ; Conf. Computer Vision and Pattern Recognition, 2007.
[6] D.M.C. Sminchisescu and A. Kanaujia, “BMA3 E : Discriminative
Sm Þ, for m ¼ 1 . . . M, is a Gaussian whose mean m , Density Propagation for Visual Tracking,” IEEE Trans. Pattern
covariance Sm , and mixture coefficient m we seek to Analysis and Machine Intelligence, vol. 29, no. 11, pp. 2030-2044,
estimate. Nov. 2007.
[7] R. Collins, Y. Liu, and M. Leordeanu, “Online Selection of
As in [28], we minimize the Kullback-Leibler divergence Discriminative Tracking Features,” IEEE Trans. Pattern Analysis
between pðxÞ and fðxÞ by maximizing an objective function and Machine Intelligence, vol. 27, no. 10, pp. 1631-1643, Oct. 2005.
that is a lower bound of the negative of the KL-divergence. [8] T. Cootes, G. Edwards, and C. Taylor, “Active Appearance
Models,” IEEE Trans. Pattern Analysis and Machine Intelligence,
Formally, we maximize vol. 23, no. 6, pp. 681-685, June 2001.
[9] J. Deutscher, A. Davison, and I. Reid, “Automatic Partitioning of
L Z
X High Dimensional Search Spaces Associated with Articulated
F ¼ dxfðx j lÞflog pðxÞ  KLm ½ql ðmÞkpðm j xÞg; ð22Þ Body Motion Capture,” Proc. Int’l Conf. Computer Vision and
l¼1 Pattern Recognition, Dec. 2001.
[10] B. Horn and B. Schunck, “Determining Optical Flow,” Artificial
where ql ðmÞ, for l ¼ 1 . . . L, are the auxiliary variational Intelligence, vol. 17, nos. 1-3, pp. 185-203, Aug. 1981.
distributions that are introduced for bounding from below [11] M. Isard and A. Blake, “Condensation—Conditional Density
Propagation for Visual Tracking,” Int’l J. Computer Vision,
the negative of the KL-divergence between pðxÞ and fðxÞ. vol. 29, no. 1, pp. 5-28, 1998.
The update equations are very similar to the update [12] M. Isard and A. Blake, “A Mixed-State Condensation Tracker with
equations of the EM algorithm. More specifically, the Automatic Model-Switching,” Proc. IEEE Int’l Conf. Computer
Vision, pp. 107-112, 1998.
variational distributions ql are updated as [13] A.D. Jepson, D.J. Fleet, and T.F. El-Maraghi, “Robust Online
 Appearance Models for Visual Tracking,” IEEE Trans. Pattern
1  1  Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1296-1311, Oct.
ql ðmÞ / m pðxl j mÞ exp  T r Sm Cl ; ð23Þ 2003.
2
[14] M. Jordan and R.A. Jacobs, “Hierarchical Mixtures of Experts and
while the mixture components are updated as the Em Algorithm,” Neural Computation, vol. 6, pp. 181-214, 1994.
[15] F. Jurie and M. Dhome, “Hyperplane Approximation for Template
PL Matching,” IEEE Trans. Pattern Analysis and Machine Intelligence,
pl ql ðmÞ vol. 24, no. 7, pp. 996-1000, July 2002.
m ¼ PM l¼1PL ; ð24Þ
[16] B. Lucas and T. Kanade, “An Iterative Image Registration
m¼1 l¼1 pl ql ðmÞ
Technique with an Application to Stereo Vision,” Proc. Int’l Joint
Conf. Artificial Intelligence, pp. 121-130, 1981.
PL [17] P. Meer, D. Mintz, and A. Rosenfeld, “Robust Regression Methods
pl ql ðmÞxl
m ¼ Pl¼1
L
; ð25Þ for Computer Vision: A Review,” Int’l J. Computer Vision, vol. 6,
l¼1 pl ql ðmÞ no. 1, pp. 59-70, 1991.
[18] K. Okuma, A. Taleghani, N.D. Freitas, J.J. Little, and D.G. Lowe,
PL  T  “A Boosted Particle Filter: Multitarget Detection and Tracking,”
l¼1 pl ql ðmÞ xl xl þ Cl Proc. European Conf. Computer Vision, pp. 28-39, May 2004.
Sm ¼ PL  m Tm : ð26Þ [19] M. Pantic and I. Patras, “Dynamics of Facial Expression:
p q
l¼1 l l ðmÞ Recognition of Facial Actions and Their Temporal Segments from
Face Profile Image Sequences,” IEEE Trans. Systems, Man, and
In the case that the Gaussians have zero covariance Cybernetics, Part B, vol. 36, no. 2, pp. 433-449, Apr. 2006.
matrices (i.e., Cl ¼ 0) and pl ¼ L1 , we obtain the update [20] I. Patras and E. Hancock, “Regression Tracking with Data
equations of the classical EM algorithm for learning Relevance Determination,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition, June 2007.
the mixture parameters from L data points l . For arbitrary [21] I. Patras and M. Pantic, “Particle Filtering with Factorized
Cl and pl ¼ L1 , we obtain the update equations of [28] for Likelihoods for Tracking Facial Features,” Proc. IEEE Int’l Conf.
learning the mixture parameters from l noisy data points Face and Gesture Recognition, pp. 97-102, May 2004.
[22] M. Pitt and N. Shephard, “Filtering via Simulation: Auxiliary
(with Cl being the covariance matrix of the lth point). Particle Filtering,” J. Am. Statistical Assoc., vol. 94, no. 446, pp. 590-
599, 1999.
[23] L. Sigal, S. Bhatia, S. Roth, M.J. Black, and M. Isard, “Tracking
ACKNOWLEDGMENTS Loose-Limbed People,” Proc. Int’l Conf. Computer Vision and
Pattern Recognition, 2004.
The work of Ioannis Patras is partially supported by the [24] E. Simoncelli, E. Adelson, and D. Heeger, “Probability Distribu-
Engineering and Physical Sciences Research Council, tions of Optical Flow,” Proc. IEEE Int’l Conf. Computer Vision and
research grant EP/G033935/1. The work of Edwin Hancock Pattern Recognition, pp. 310-315, June 1991.
[25] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas, “Discrimi-
was supported by a Royal Society Wolfson Research Merit native Density Propagation for 3D Human Motion Estimation,”
Award and the EU FET project SIMBAD (213250). Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING 1567

[26] M. Tipping, “The Relevance Vector Machine,” Advances in Neural Edwin R. Hancock received the BSc degree in
Information Processing Systems, Morgan Kaufmann, 2000. physics, the PhD degree in high-energy physics,
[27] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted and the DSc degree from the University of
Cascade of Simple Features,” Proc. IEEE Conf. Computer Vision and Durham in 1977, 1981, and 2008, respectively.
Pattern Recognition, vol. 1, pp. 511-518, 2001. From 1981 to 1991, he worked as a researcher
[28] N. Vlassis and J. Verbeek, “Gaussian Mixture Learning from in the fields of high-energy nuclear physics and
Noisy Data,” Technical Report IAS-UVA-04-01, Informatics Inst., pattern recognition at the Rutherford-Appleton
Univ. of Amsterdam, 2004. Laboratory (now the Central Research Labora-
[29] S. Waterhouse, D. MacKay, and T. Robinson, “Bayesian Methods tory of the Research Councils). During this
for Mixtures of Experts,” Advances in Neural Information Processing period, he also held adjunct teaching posts at
Systems, vol. 8, pp. 351-357, MIT Press, 1996. the University of Surrey and the Open University. In 1991, he moved to
[30] O. Williams, A. Blake, and R. Cipolla, “Sparse Bayesian Regres- the University of York as a lecturer in the Department of Computer
sion for Efficient Visual Tracking,” IEEE Trans. Pattern Analysis Science, where he has been held a chair in computer vision since 1998.
and Machine Intelligence, vol. 27, no. 8, pp. 1292-1304, Aug. 2005. He leads a group of some 25 faculty, research staff, and PhD students
[31] Y. Wu, G. Hua, and T. Yu, “Switching Observation Models for working in the areas of computer vision and pattern recognition. His
Contour Tracking in Clutter,” Proc. IEEE Conf. Computer Vision and main research interests are in the use of optimization and probabilistic
Pattern Recognition, pp. 295-302, June 2003. methods for high and intermediate-level vision. He is also interested in
[32] K. Zimmermann, J. Matas, and T. Svoboda, “Tracking by an the methodology of structural and statistical pattern recognition. He is
Optimal Sequence of Linear Predictors,” IEEE Trans. Pattern currently working on graph matching, shape-from-X, image databases,
Analysis and Machine Intelligence, vol. 31, no. 4, pp. 677-692, Apr. and statistical learning theory. His work has found applications in areas
2009. such as radar terrain analysis, seismic section analysis, remote sensing,
and medical imaging. He has published about 135 journal papers and
Ioannis (Yiannis) Patras received the BSc and 500 refereed conference publications. He was awarded the Pattern
MSc degrees in computer science from the Recognition Society Medal in 1991 and an Outstanding Paper Award in
Computer Science Department, University of 1997 by the journal Pattern Recognition. He has also received the best
Crete, Heraklion, Greece, in 1994 and in 1997, paper prizes at CAIP ’01, ACCV ’02, ICPR ’06, BMVC ’07, and ICIAP
respectively, and the PhD degree from the ’09. In 1998, he became a fellow of the International Association for
Department of Electrical Engineering, Delft Pattern Recognition. He is also a fellow of the Institute of Physics, the
University of Technology, The Netherlands, in Institute of Engineering and Technology, and the British Computer
2001. He has been a postdoctoral researcher in Society. He has been a member of the editorial boards of the journals
the area of multimedia analysis at the University IEEE Transactions on Pattern Analysis and Machine Intelligence,
of Amsterdam, and a postdoctoral researcher in Pattern Recognition, Computer Vision and Image Understanding, and
the area of vision-based human machine interaction at TU Delft. Image and Vision Computing. In 2006, he was appointed the founding
Between 2005 and 2007, he was a lecturer in computer vision in the editor-in-chief of the IET Computer Vision Journal. He was the
Department of Computer Science, University of York, United Kingdom. conference chair for BMVC ’94, the track chair for ICPR ’04, and
Since 2007, he has been a lecturer in computer vision in the Department the area chair at ECCV ’06 and CVPR ’08, and in 1997, he established
of Electronic Engineering, Queen Mary, University of London. He is/has the EMMCVPR workshop series. In 2009, he was awarded a Royal
been on the organizing committees of IEEE SMC ’04 and Face and Society Wolfson Research Merit Award.
Gesture Recognition ’08, and was the general chair of WIAMIS ’09. He
is an associate editor of the Image and Vision Computing Journal and
the Journal of Multimedia. His research interests lie in the areas of . For more information on this or any other computing topic,
computer vision and pattern recognition, with emphasis on motion please visit our Digital Library at www.computer.org/publications/dlib.
analysis and their applications in multimedia data management, multi-
modal human computer interaction, and visual communications.
Currently, he is interested in the analysis of human motion, including
the detection, tracking, and understanding of facial and body gestures.
He is a member of the IEEE and the IEEE Computer Society.

You might also like