Professional Documents
Culture Documents
Coupled Prediction Classification For Robust Visual Tracking
Coupled Prediction Classification For Robust Visual Tracking
Abstract—This paper addresses the problem of robust template tracking in image sequences. Our work falls within the discriminative
framework in which the observations at each frame yield direct probabilistic predictions of the state of the target. Our primary
contribution is that we explicitly address the problem that the prediction accuracy for different observations varies, and in some cases,
can be very low. To this end, we couple the predictor to a probabilistic classifier which, when trained, can determine the probability that
a new observation can accurately predict the state of the target (that is, determine the “relevance” or “reliability” of the observation in
question). In the particle filtering framework, we derive a recursive scheme for maintaining an approximation of the posterior probability
of the state in which multiple observations can be used and their predictions moderated by their corresponding relevance. In this way,
the predictions of the “relevant” observations are emphasized, while the predictions of the “irrelevant” observations are suppressed.
We apply the algorithm to the problem of 2D template tracking and demonstrate that the proposed scheme outperforms classical
methods for discriminative tracking both in the case of motions which are large in magnitude and also for partial occlusions.
1 INTRODUCTION
are close to the “true” state, and therefore, a large number of
V ISION-BASED tracking is one of the fundamental and
most challenging low-level problems in computer
vision. Formally, it is defined as the problem of estimating
solutions need to be examined. Detection-based methods [27]
that exhaustively search all image locations for the presence
the state x (e.g., position, scale, rotation, or 3D pose) of a of a target fall into this category.
target, given a set of noisy image observations Y ¼ In order to reduce the computational complexity and to
f. . . ; y ; yg up to the current frame.1 Usually, an estimate deal with likelihoods with multiple modes, a number of
of the state at each frame is the location of a minimum of a methods have been proposed for selecting candidate
cost function, or, in the probabilistic framework, the solutions in the generative tracking framework. A common
location of a maximum of the posterior pðx j Y Þ. Alterna- choice utilizes a motion model (e.g., [11]) for the proposal
tively, a representation of the posterior pðx j Y Þ can be distribution that is for the distribution from which the
maintained for each frame of the image sequence. candidate solutions will be sampled. Typical choices for the
motion model range from general constant velocity/
1.1 Literature Review acceleration models to higher order models whose para-
The large number of methods that have been proposed in meters can be learned from training data [11], [12] or online
the last decades for maintaining a representation or for [5]. A motion model assists the estimation process so long
finding a maximum of the posterior pðx j Y Þ fall into two as the temporal evolution of the state follows it. However,
main categories. In the first category belong the generative the residual between the model-based temporal prediction
methods (e.g., [11], [9]). Such methods require the inver- and the true target state can be significant in the general
sion of the posterior pðx j Y Þ using the Bayes rule and the case of irregular motion, novel motion, or a moving camera.
evaluation of the likelihood pðy j xÞ at certain sample states Other methods utilize the fact that in certain domains, the
x. An important drawback is that at least some evaluations true target state may lay on a manifold in the state space
need to be performed at sample points in the state space that and therefore (probabilistic) priors may exist on the state of
the target. This is the case when the state encodes the
1. In this work, we denote with a an observation or (an instantiation of) position of multiple interacting targets (such as facial points
a random variable a in the previous time instant. For example, we denote [21]) or the position of the components of a constrained
with y the observation at the current frame and with y the observation in
the frame before. articulated structure such as the human body [23]. Alter-
native methods utilize the observations in the current frame
in order to sample from areas where the likelihood is
. I. Patras is with the School of Electronic Engineering and Computer expected to be higher. This may be done by performing a
Science, Queen Mary University of London, Mile End Road, E1 4NS two-stage propagation [22], or by using mixtures of learned
London, UK. E-mail: i.patras@elec.qmul.ac.uk.
. E.R. Hancock is with the Department of Computer Science, University of
detectors and dynamic models (e.g., [18]). In the latter case,
York, YO10 5DD York, UK. E-mail: erh@cs.york.ac.uk. a target detector needs to be applied at every image location
Manuscript received 16 May 2008; revised 21 Oct. 2008; accepted 20 Aug. and various scales.
2009; published online 6 Oct. 2009. The second category consists of the discriminative (or
Recommended for acceptance by P. Perez. prediction-based) methods. In contrast to generative meth-
For information on obtaining reprints of this article, please send e-mail to: ods, in discriminative tracking, an observation y delivers a
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2008-05-0291. direct prediction of the hidden state x. This alleviates the
Digital Object Identifier no. 10.1109/TPAMI.2009.175. need for a good proposal distribution and multiple
0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
1554 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
. We explicitly address the problem of the determina- Fig. 3. Graphical models (a) for classical discriminative tracking and
tion of the relevance/reliability of an observation to (b) for regression tracking with relevance determination.
the state estimation process by learning in a
supervised way the underlying conditional prob- window centered at a position r. In the absence of a motion
ability distribution. model, r is the estimated position of the target in the
. We devise a probabilistic framework that allows previous frame, that is, r ¼ x^ . However, using data from a
multiple observations yðrÞ to contribute to the single window disregards the information that is available
prediction of the state of the target according to at other positions r. Similarly, for 3D tracking [2], [25], a
their corresponding relevance/reliability. single feature vector is extracted from the object silhouette.
. We make explicit the relation between our frame- On the other hand, in the generative particle filtering
work and alternative discriminative and generative framework for 2D tracking, it is common practice that
estimation/tracking schemes. More specifically, we several parts of the observation are examined. This is
show that under certain modeling assumptions achieved by using multiple samples (particles) r and by
(simplifications), our estimation scheme is practi- assigning to each particle a weight ðy; rÞ proportional to
cally equivalent to classical generative and discrimi- the likelihood pðy j rÞ. The particles r are sampled from pðx j
native estimation schemes. Y Þ using the transition probability pðx j x Þ and, most
The remainder of the paper is organized as follows: In usually, the sampled r determines how the observation y
Section 2, we provide an outline of the proposed discrimi- will be utilized. The latter means that the likelihood pðy j rÞ
native tracking framework with data relevance determina- is modeled as a function of yðrÞ, that is, pðy j rÞ ¼ pðyðrÞ; cÞ,
tion. In Section 2.1, we briefly describe the Bayesian Mixture where c are some model parameters. In the simplest case, a
of Experts predictor, and in Section 2.2, we present our number of measurements yðrÞ at positions rðr 2 fr1 ; . . . rR gÞ
method for observation relevance determination. Section 2.3 around the location of the target in the previous frame are
presents a procedure which, given a predictor, selects an
utilized.2 Given the above, the posterior is empirically
appropriate classifier, and in Section 2.4, we show the
approximated using a set of weighted particles, that is, a set
relation of the proposed scheme with alternative generative
of pairs fððy; r1 Þ; r1 Þ; . . . ððy; rR Þ; rR Þg. Formally,
and discriminative tracking methods. Section 3 presents
experimental results. Finally, in Section 4, we give some 1X rR
conclusions and directions for future work. An early pðx j Y Þ ðy; rÞðx rÞ; ð2Þ
Z r¼r1
version of the proposed scheme appears in [20].
where Z is a scaling parameter and ð:Þ is the Kronecker
2 PREDICTION-BASED TRACKING WITH RELEVANCE function.
DETERMINATION Here, we propose a discriminative particle filtering
method that utilizes the fact that several parts of the
Filtering, such as Kalman filtering or particle filtering, has observation can yield predictions of the state of the target.
been the dominant framework for recursive estimation of We do so by introducing a random variable r that determines
the conditional probability of the unknown state x given a which parts, or in general how, the observation y will be used.
set of observed random variables Y ¼ f. . . ; y ; yg. In the Without loss of generality, in the derivations that follow, we
discriminative filtering framework (Fig. 3a), the filtered will assume that r has the dimensionality and the physical
density can be derived as [25]: meaning of the hidden state x. For example, in the case of 2D
Z template tracking where x 2 R2 , the random variable r 2 R2
pðx j Y Þ ¼ dx pðx j Y Þpðx j x ; yÞ; ð1Þ will determine the centers of the windows/patches at which
we will extract observations yðrÞ that will give predictions of
where y (y ) is the observation at the current (previous) x. In general, r will be used for obtaining a set of candidate
frame and x (x ) is the state in the current (previous) observations yðrÞ and does not need to have the dimension-
frame, respectively. Similarly, Y is the set of observations ality of x. We will also condition r on x as we expect that the
up to the current frame and Y the set observations up to previous state can be sufficiently informative on how
the previous frame. candidate observations can be obtained. Subsequently, we
The derivation of (1) ignores the fact that for certain introduce a binary variable z and denote with pðz ¼ 1 j y; rÞ
the probability that the observation yðrÞ is relevant for
problems, different parts of the observation y can give
the prediction of the unknown state x. The dependencies of
different predictions of the state of the target. For example,
the variables are depicted in Fig. 3b where y is observed and
in [30], for 2D tracking where the evidence y is an image
frame, the prediction of the state of the target (e.g., its 2D 2. In general, in the case that the state x is not only a 2D displacement,
location) is based on the data yðrÞ extracted from a single obtaining yðrÞ requires warping.
1556 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
TABLE 1 TABLE 2
Discriminative Filtering with Data Relevance Determination Modeling Choices
of Gaussians to an M-component mixture of Gaussians. and follow a variational approach for the estimation of their
OðRJÞ is the complexity of the kernel-based Relevance posterior distributions. As in [29], we make a Laplace
Vector Machine classifier (Section 2.2), where J is the approximation and estimate the mode and the variance of
number of the support vectors. the posteriors, which (with a slight abuse of notation) we
denote here as ðwi ; wi Þ and ði ; i Þ. In the process, we also
2.1 Bayesian Mixture of Experts for Regression estimate the optimal value for the hyperparameters i that
In what follows, we will describe a method that, given an are associated with the noise covariance Si of the prediction
observation yðrÞ and the target state at the previous frame x , of expert i (see [29] for details).
yields a probabilistic prediction of the state x at the current In [29], a procedure is described for scalar regression. In
frame. For notational simplicity, let us here denote with y the case where the target is a vector x with dimensionality D,
the Cartesian pair ðyðrÞ; x Þ. we may train D different Mixtures of Experts. Here, we have
Our method follows the work of Sminchisescu et al. [25] extended the methodology to experts that have multidimen-
and uses the Bayesian Mixtures of Experts for regression. sional output (i.e., fi ðx j yÞ is a multidimensional Gaussian
Given an observation y, the BME delivers a probabilistic with diagonal noise covariance Si ). In this case, wi is a matrix
prediction of x as a mixture of Gaussians. The rationale with the number of rows equal to the dimensionality of y and
behind this choice over alternative regression methods (e.g., number of columns equal to the dimensionality of x.
RVMs [26]) is that the BME can successfully model For prediction, we marginalize over the parameters and
predictive distributions that are multimodal in character. hyperparameters as in [29]. For a new observation y, the
Such distributions often arise in the case of 3D tracking due predictive distribution is a mixture of Gaussians given by
to, for example, front/back and left/right ambiguity [25], X
K
[2], [23]. They are also expected to arise in the case of 2D p^ðx j yÞ ¼ gi ðyÞN wTi y; Si0 ; ð10Þ
tracking due to the aperture problem [10]. However, this i¼1
choice is not restrictive and any linear [32] or nonlinear
where the kth element of the diagonal covariance matrix Si0 ,
regression method [8] could be used as an alternative. More 0
denoted with Sik , is given by
generally, any method that can deliver a prediction of the
state x, given an observation y can be used. In the case of 2D 0
Sik ¼ yT wik y þ Sik ; ð11Þ
tracking, the Lucas-Kanade method [16] could be used to
make an estimate of the 2D target location x by delivering where Sik is the corresponding element in the covariance
an estimate of the displacement vector x with respect to matrix of the ith expert. Alternatively, we may straightfor-
the position at which the observation yðrÞ was extracted. wardly use (7), or use only the prediction of the expert with
Similarly, the method of Simoncelli et al. [24] could deliver the highest gating coefficient gi , or approximate (10) with a
a probabilistic prediction (a Gaussian) for x. single Gaussian.
The (Hierarchical) Mixtures of Experts, which was For the problem of 2D visual tracking, we aim to
introduced by Jordan and Jacobs [14], is a method for estimate the transformation x (e.g., translation, rotation,
and scaling) that a visual target undergoes in an image
regression and classification that relies on a soft probabil-
sequence. We train the BME in a supervised way with pairs
istic partitioning of the input space. This is determined by
ðyðxÞ; xÞ in which the observations yðxÞ are produced by
gating coefficients gi ðyÞ (one for each expert i) that are input
synthetically transforming (e.g., translating) the visual
dependent and have a probabilistic interpretation; that is,
target with the transformation x. In this case, we choose
the coefficients of the siblings at each level of the hierarchy
to ignore the state x at the previous frame when training
sum up to one. The prediction of each expert i is then
the BME. Subsequently, in the test phase, an observation
moderated by the corresponding gating coefficient. For- will give a probabilistic prediction according to (10).
mally, in the simple case of a flat hierarchy for regression,
2.2 Data Relevance Determination
X
K
pðx j yÞ ¼ gi ðyÞfi ðx j yÞ; ð7Þ For the determination of the relevance/reliability pðz j y; rÞ
i¼1 of an observation yðrÞ, we use a classification scheme with
the RVMs. The goal is to obtain an a priori assessment of
where fi ðx j yÞ is a probability density function, usually a whether the probabilistic prediction pðx j yðrÞÞ (10) of the
Gaussian centered around the prediction of the expert i. In state of the target is expected to be good. To this end, we
the simple case that linear experts are used: train an RVM classifier in a supervised way with a set of
T positive examples that yield good predictions and with a set
e i y of negative examples that yield bad predictions. Let us
gi ðyÞ ¼ P T y ; ð8Þ
je denote with sigm the sigmoid function, with f~ yi g the
j
pðy j xÞ
pðx j Y Þ ¼ pðx j Y Þ; ð14Þ
pðy j Y Þ
Z
pðy j xÞ
¼ dx pðx j x Þpðx j Y Þ: ð15Þ
pðy j Y Þ
In the generative framework, a number of candidate
solutions ri are sampled from pðx j Y Þ and they are
subsequently weighted using the likelihood3 pðy j ri Þ. If
we denote with ðy; rÞ the weight that is assigned to sample
r, then an approximation of the posterior is given by (2).
Recall that in our framework, the approximation of the
posterior is given by (4). Also, observe the relation between
(2) and (4): In the generative case, the mass of the posterior is
on the samples ri , while, in the discriminative case, the mass
of the posterior is on the predictions pðx j z ¼ 1; x ; y; ri Þ (let
Fig. 6. Mean error versus the fraction q of positives for different
classifiers. The curve eo corresponds to the ideal classifier. us for the moment ignore the “outlier predictions”
pðx j z ¼ 0; x ; y; ri Þ). Therefore, if we choose the predictors
We make the selection according to the 2 test between eo such that pðx j z ¼ 1; x ; y; ri Þ is equal to ðx rÞ, and let r be
and e, that is, we select a curve that lies on the lower right sampled from pðx j Y Þ, then the estimation schemes of the
part of the plot. Such a classifier generates a large number of two frameworks will be equivalent. There are just two
positives for a given error level, a property that is important differences in the methods. First, in our framework, once a
in our estimation scheme which might rely on few sample r is obtained in this way, it should be assigned a
observations only. As Fig. 6 reveals that a number of weight equal to pðz ¼ 1 j r; yÞ. By contrast, in the generative
classifiers (solid lines) have similar ranking properties. framework, the sample r should be assigned a weight equal
Among classifiers with similar ranking properties (with a pðyjrÞ
to pðyjY Þ , or a weight equal to pðy j rÞ since the term pðy j Y Þ
5 percent margin), we favor the one that delivers, on is independent of r and therefore is canceled in the normal-
average, the lower weighted error under the weighing of the ization. Second, in our case, the prediction of the outliers (i.e.,
positives according to the probabilistic weighing scheme. observations that are irrelevant/unreliable) is made explicit
2.4 Relation to Generative and Discriminative in the form of pðx j z ¼ 0; x ; y; ri Þ.
Tracking Formally, under the modeling choice pðx j z ¼ 1; x ;
In this section, we will make explicit the relation between y; rÞ ¼ ðx rÞ, (3) becomes
the proposed tracking framework on the one hand, and pðx j Y Þ ð16Þ
both discriminative and generative tracking methods on the
other hand. More specifically, we will show that under Z Z
certain modeling assumptions, we derive estimation ¼
dx pðx j Y Þ
drpðr j x Þ
schemes that are practically equivalent to classical gen- Z ð17Þ
erative and discriminative estimation schemes.
dzpðx j z; x ; y; rÞpðz j y; rÞ
The relation to classical discriminative methods (e.g.,
[25]) is rather straightforward. In the case that a single
Z
observation yðrÞ is used and the data relevance ¼ pðz ¼
1 j y; rÞ is set to one (i.e., the single observation is considered ¼ pðz ¼ 1 j y; xÞ dx pðx j x Þpðx j Y Þþ
relevant/reliable), our framework reduces to the discrimi- Z Z
native tracking framework of [25]. Formally, (1) can be dx pðx j Y Þ drpðr j x Þpðx j z ¼ 0; x ; y; rÞpðz ¼ 0 j y; rÞ
derived from (3) when the following three modeling choices
ð18Þ
are made:
framework, the variable z takes the interpretation of the class motion model are present. In addition, we have compared
of the observation y and pðz ¼ 1 j y; xÞ is the probability that our method with a particle filtering algorithm in the
the observation y belongs to the target class. Essentially, a generative framework.
classification scheme is used within the particle filtering For each of our experiments and in order to reduce the
framework. This is similar to other works in classification- computational complexity during training, we reduce the
based tracking [3], [7] and to more classical target detectors data dimensionality by applying Principal Component
[27]. Therefore, our scheme offers a formal framework in Analysis (PCA) to the data with which the BMEs are trained.
which classification-based approaches can be used for PCA gives some marginal improvements, and omitting it
recursive estimation of the posterior density. does not lead to a significant degradation in the performance.
The second main difference of the degenerate case of our Before being used, the training data for the BME, the training
scheme with the generative particle filtering tracking (15) is data for the RVM, and all test data are projected to the new
that in our case, the prediction of the outliers becomes space that is spanned by the leading N eigenvectors that
explicit in the form of pðx j z ¼ 0; x ; y; ri Þ, that is, the were extracted using PCA. We choose N such that 95 percent
second term of (18). This bears similarities to generative of the variance is retained, a choice that leads to values of N
methods that introduce an occlusion process and condition between 40 and 50 for the “CD” and the “Head” sequences.
the likelihood on it. In [31], two likelihood models are
In order to deal with illumination changes, we normalize
defined, one given that the target is occluded and one given
each observation by the average intensity of the window at
that the target is visible. However, while in [31] the
which it was extracted. The normalization is performed
occlusion state is inferred, in our case the relevance of the
before learning the PCA transform (i.e., during training) and
observation is determined by the probabilistic classifier.
before applying it (i.e., during tracking). In our experiments,
Also, note that in the general case of our method, the
relevance of an observation is not necessarily related to the we tracked windows of size ranging from 15 15 to 25 25
degree of occlusion but rather to the degree at which a pixels. For training the BME, we used 900 Cartesian pairs
reliable prediction can be obtained from the observation in ðyðxÞ; xÞ in which the observations yðxÞ are produced by
question. Three possible choices for the prediction given that artificially transforming (e.g., translating) the visual target
the observation is irrelevant/unreliable are the following. with the transformation x. The examples that were used for
The first is to use a Gaussian with large variance around x . training the RVM were generated using transformations
A second choice is to use a uniform distribution , where with a range two-three times that used for training the BME.
<< 1. The third is to use a prior pðxÞ in case that it is In order to reduce the complexity of RVM learning, we apply
available [19]. Since pðx j z ¼ 0; x ; y; ri Þ depends on y, we k-means clustering to the set of candidate observations and
can easily derive classical outlier processes based on robust train the RVM using the cluster centers. The class identity of a
statistics [17], or use a Gaussian with large variance as in cluster (positive or negative) is determined by the class
[13]. Formally, the second term which models the predic- identity of the majority of the examples that belong to it. The
tion density of the “unreliable observations” under two number of clusters is considered as a parameter of the
reasonable models of pðx j z ¼ 0; x ; y; rÞ becomes classification scheme and is determined automatically using
the criterion of Section 2.3. For all the experiments, we track
1
pðz ¼ 0 j y; xÞ pðxÞpðx j Y Þ five Gaussians (i.e., M ¼ 5) and use 50 samples r (i.e.,
Z ð20Þ R ¼ 50), unless stated otherwise.
1
for pðx j z ¼ 0; x ; y; rÞ ¼ pðxÞðx rÞ;
Z 3.1 Artificial Displacements and Noise
and We first present results for estimating the location of a facial
feature (the corner of an eye) under artificially generated
pðz ¼ 0 j y; xÞpðx j Y Þ displacements. Both the regressor and the classifier were
ð21Þ trained on data from the first frame of the sequence. Here,
for pðx j z ¼ 0; x ; y; rÞ ¼ ðx rÞ;
1) a 17 17 target window was used, 2) the BME was trained
where Z is a scaling parameter and << 1. with displacements of up to 17 pixels, and 3) the RVM was
trained with displacements of up to 34 pixels. The tests were
performed on the frames 157 and 358 (Fig. 11). In frame 157,
3 EXPERIMENTAL RESULTS some degree of both deformation and illumination change
We have performed a number of experiments in order to are present. In frame 358, the target is partially occluded. For
illustrate the performance of the proposed method under these frames, artificial displacements were simulated by
different conditions, including occlusions, fast motion, and sampling x at a distance equal to the “true displacement”
moderate deformations. Here, we present both quantitative from the true target location. During the estimation phase,
and qualitative results for image sequences that are 50 samples of r are sampled from a 2D uniform distribution
annotated by hand as well as comparative results with with mean x and width S at each dimension. As we have
alternative state-of-the-art methods. More specifically, we demonstrated in Section 2.4, for a sampling range of width
compare our Coupled Prediction-Classification algorithm S ¼ 1 and when a single r is sampled, we are effectively
(CPredC) to discriminative tracking when a single ob- using classical regression-based tracking methods (e.g., [30],
servation is used (e.g., [25], [30]) and to the simplified [2]) that use a single observation.
version of the proposed algorithm in which the data In Fig. 7, we summarize the experimental results for
relevance determination mechanism is discarded. We do artificially generated motions of different magnitudes by
not use any dynamic model, nor temporal filtering in order plotting the Root Mean Square (RMS) error as a function of
to judge the performance when large deviations from the both the true displacement and of the sampling range for
PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING 1561
Fig. 11. Tracking results for frames 70, 118, 154, 202, 358, and 580 of the “Head” sequence (SS ¼ 1 and ST ¼ 3). (a) Location of relevant (red)
and unreliable (blue) observations. (b) Relevant (red) and unreliable (blue) predictions. (c) Probabilistic prediction (red) and point estimation
(black box).
Gaussian), and in the third column, we give the probabil- In order to illustrate the benefits of using both multiple
istic prediction (each ellipse represents one of the M 2D observations and data relevance determination, we present
Gaussians) as well as the final estimate of the target location comparative results with the two simplified versions of our
(black box). It is clear that the classification scheme reduces algorithm (sPred and nPred), together with two alternative
methods reported in the literature, namely, the CONDEN-
(or completely discards) the influence of inaccurate predic-
SATION algorithm and [32] when a single target is tracked.
tions. This is apparent both in the case of larger motion
The simplified version sPred is similar to classical regres-
(rows 1-3) in which the prediction of observations further sion-based tracking methods (e.g., [30], [2]) that use a single
away from the target (in the direction opposite from that of observation. Recall that we arrive at this simplification by
the motion) is discarded, as well as in the case of occlusions setting the number of samples of r equal to 1 (i.e., R ¼ 1) and
(row 4). the width of the sampling range very small (i.e., S ¼ 1). The
PATRAS AND HANCOCK: COUPLED PREDICTION CLASSIFICATION FOR ROBUST VISUAL TRACKING 1563
TABLE 3
RMS Error and Percentage of Target Losses (in Parentheses)
at Various Spatial (SS) and Temporal (TS) Subsamplings
for Various Sequences
Fig. 14. Tracking results for the “Head” sequence. OT = 5, SS = 2, TS = 3 (see Table 4).
Fig. 15. Tracking results for frames 49, 115, and 253 of the “CD cover” sequence (SS ¼ 2 and T S ¼ 3). A quarter (QRT) of the target is artificially
occluded every other frame.
of larger occlusions, such as the last two rows of Table 4 in probabilistic prediction using five Gaussians. Note that
which half of the target is occluded. The results in our relevance determination scheme suppressed most of
comparison to generative tracking (i.e., CONDENSATION) the observations that were close to the true target location.
illustrate the ability of prediction-based methods to utilize an This indicates that a validation scheme using the trained
observation near the target, rather than on the target itself in RVM classifier would also be likely to fail. Similar results
order to predict the target location. This is the case even are obtained for alternative types of occlusions.
when the target is partially occluded, so long as the
observation is relevant. In the case of partial occlusions, it 3.4 4D Template Tracking
is also the case that methods that rely on the validation of a Finally, we present results for the sequential estimation of
prediction, for example, by classifying the observation that is the location, scale, and rotation of visual targets in image
extracted at the predicted location [30], are likely to fail even sequences. In Figs. 17 and 18, we show some representative
when the prediction is accurate. frames for two image sequences in which the 4D state
In Fig. 16, we give some insight into the ability of the (translation, scale, and rotation) of a primary target is
algorithm to deal with partial occlusions. More specifi-
estimated and the position of a secondary target is inferred.
cally, in the case of partial occlusions, it is those
For each of the sequences, a single Bayesian Mixture of
observations that are extracted in regions neighboring
the true target position that are deemed relevant by the Experts is trained for predicting the 4D state vector. Here, r
classifier. These observations are therefore used to deliver has the dimensionality of the state and determines the
reliable predictions of the target state. In Fig. 16a, we show location, scale, and orientation parameters of the warping
the locations at which relevant (unreliable) observations required in order to obtain the observation yðrÞ. For these
were extracted in red (blue), in Fig. 16b, we show the experiments 1,500 examples were used for training the
corresponding probabilistic predictions (each ellipse re- BME-based predictor. The results for the “Soda can” [4]
presents a Gaussian), and in Fig. 16c, we show the final (Fig. 17) and the “Hand Held” (Fig. 18) image sequences
TABLE 4
Errors for Various Occlusion Types (OTs) at
Various Spatial (SS) and Temporal (TS) Subsamplings
TABLE 5
Mean Absolute and RMS (in Parentheses) Errors
for the “Head” and the “Soda Can” Sequence (SS ¼ 4)
Fig. 17. Soda Sequence [4] (ST ¼ 3): Estimate of the translation/scale
and rotation of an 25 25 window. The position of the lower corner of the
can is inferred from the estimation of the state of the target (central
square).
observations are obtained at scales and orientations similar
to those used in the training phase.
give an illustration of the accuracy at which the scale and
the rotation of the primary target are estimated. In order to
create larger motions, we have temporally subsampled the 4 CONCLUSIONS
“Soda can” sequence by a factor of 3. The “Hand Held” In this paper, we have presented a method for efficient and
image sequence (Fig. 18) contains significant motion blur at robust visual tracking. We propose a discriminative frame-
certain frames (e.g., third row and second column) as well
work in which multiple observations provide predictions of
as occasional partial occlusions of the target. Quantitative
the state of the target. Each prediction is moderated by the
results are presented in Table 5 which give the estimation
relevance of the corresponding observation as this is
error of the translation, scale, and rotation components.
determined by a probabilistic classification scheme. This is
The results in the first row of Table 5 refer to the “Head”
the first work that utilizes multiple observations for
image sequence where the secondary target is the right eye
discriminative tracking or uses a classification scheme to
of the depicted person.
assess in advance the relevance of an observation (as
3.5 Synopsis opposed to the a posteriori validation of the prediction). We
In summary, the experimental results demonstrate the have illustrated the efficiency of our approach in a number
efficiency of the proposed scheme in both 2D and 4D of image sequences for the problem of 2D tracking and in
tracking. In the case of 2D template tracking, a clear particular its ability to deal with large motion and with
improvement is demonstrated in comparison to: 1) alter- partial occlusions. For future work, we intend to extend the
native methods that use a single observation (sPred and [32]), proposed scheme for tracking in higher dimensional spaces,
2) alternative methods that use multiple observations with- for example, for tracking 3D human pose under occlusions
out relevance determination (nPred), and 3) a generative and background clutter.
particle filtering method [11]. The improvements were more The estimation scheme that we propose is rather general
pronounced in the case of partial occlusions and rapid and neither limited to the problem of tracking nor tied to
motion. This is due to the ability of the proposed scheme to
the specific regression-based state prediction or RVM-based
discard the predictions of observations that originate at
observation relevance determination. In principle, any
occlusions or from areas too far from the target. The results in
sequences with large rotations and scale changes indicate the estimation scheme that employs a regression-based pre-
need to maintain an estimate of the transform so that warped dictor can be tied to an observation relevance/reliability
estimator in order to utilize multiple observations and
moderate their predictions according to their expected
accuracy (the latter being determined by the observation
relevance/reliability estimator). It would therefore be
interesting to investigate the properties of the proposed
framework in domains other than visual tracking.
Finally, in order to focus on the properties of the coupled
regressor-classifier scheme, we focused our analysis on the
problem of a single target tracking without a motion model.
In future work, we intend to extend the scheme on the
problem of multiple interacting target tracking [21], and
using our scheme with learned motion models.
APPENDIX A
FROM AN L-COMPONENT TO AN M-COMPONENT
MIXTURE OF GAUSSIANS
In this Appendix, we will briefly outline a method for
approximating a mixture of L Gaussians with a reduced
Fig. 18. “Hand Held” sequence: Estimate of the translation/scale and
rotation of a window centered around the center of the picture. The
M-component mixture. Our derivation builds on the
position of the lower corner of the picture is inferred from the estimate of method of Vlassis and Verbeek [28] for learning a Gaussian
the state of the target (central square). Mixture from noisy data.
1566 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 9, SEPTEMBER 2010
[26] M. Tipping, “The Relevance Vector Machine,” Advances in Neural Edwin R. Hancock received the BSc degree in
Information Processing Systems, Morgan Kaufmann, 2000. physics, the PhD degree in high-energy physics,
[27] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted and the DSc degree from the University of
Cascade of Simple Features,” Proc. IEEE Conf. Computer Vision and Durham in 1977, 1981, and 2008, respectively.
Pattern Recognition, vol. 1, pp. 511-518, 2001. From 1981 to 1991, he worked as a researcher
[28] N. Vlassis and J. Verbeek, “Gaussian Mixture Learning from in the fields of high-energy nuclear physics and
Noisy Data,” Technical Report IAS-UVA-04-01, Informatics Inst., pattern recognition at the Rutherford-Appleton
Univ. of Amsterdam, 2004. Laboratory (now the Central Research Labora-
[29] S. Waterhouse, D. MacKay, and T. Robinson, “Bayesian Methods tory of the Research Councils). During this
for Mixtures of Experts,” Advances in Neural Information Processing period, he also held adjunct teaching posts at
Systems, vol. 8, pp. 351-357, MIT Press, 1996. the University of Surrey and the Open University. In 1991, he moved to
[30] O. Williams, A. Blake, and R. Cipolla, “Sparse Bayesian Regres- the University of York as a lecturer in the Department of Computer
sion for Efficient Visual Tracking,” IEEE Trans. Pattern Analysis Science, where he has been held a chair in computer vision since 1998.
and Machine Intelligence, vol. 27, no. 8, pp. 1292-1304, Aug. 2005. He leads a group of some 25 faculty, research staff, and PhD students
[31] Y. Wu, G. Hua, and T. Yu, “Switching Observation Models for working in the areas of computer vision and pattern recognition. His
Contour Tracking in Clutter,” Proc. IEEE Conf. Computer Vision and main research interests are in the use of optimization and probabilistic
Pattern Recognition, pp. 295-302, June 2003. methods for high and intermediate-level vision. He is also interested in
[32] K. Zimmermann, J. Matas, and T. Svoboda, “Tracking by an the methodology of structural and statistical pattern recognition. He is
Optimal Sequence of Linear Predictors,” IEEE Trans. Pattern currently working on graph matching, shape-from-X, image databases,
Analysis and Machine Intelligence, vol. 31, no. 4, pp. 677-692, Apr. and statistical learning theory. His work has found applications in areas
2009. such as radar terrain analysis, seismic section analysis, remote sensing,
and medical imaging. He has published about 135 journal papers and
Ioannis (Yiannis) Patras received the BSc and 500 refereed conference publications. He was awarded the Pattern
MSc degrees in computer science from the Recognition Society Medal in 1991 and an Outstanding Paper Award in
Computer Science Department, University of 1997 by the journal Pattern Recognition. He has also received the best
Crete, Heraklion, Greece, in 1994 and in 1997, paper prizes at CAIP ’01, ACCV ’02, ICPR ’06, BMVC ’07, and ICIAP
respectively, and the PhD degree from the ’09. In 1998, he became a fellow of the International Association for
Department of Electrical Engineering, Delft Pattern Recognition. He is also a fellow of the Institute of Physics, the
University of Technology, The Netherlands, in Institute of Engineering and Technology, and the British Computer
2001. He has been a postdoctoral researcher in Society. He has been a member of the editorial boards of the journals
the area of multimedia analysis at the University IEEE Transactions on Pattern Analysis and Machine Intelligence,
of Amsterdam, and a postdoctoral researcher in Pattern Recognition, Computer Vision and Image Understanding, and
the area of vision-based human machine interaction at TU Delft. Image and Vision Computing. In 2006, he was appointed the founding
Between 2005 and 2007, he was a lecturer in computer vision in the editor-in-chief of the IET Computer Vision Journal. He was the
Department of Computer Science, University of York, United Kingdom. conference chair for BMVC ’94, the track chair for ICPR ’04, and
Since 2007, he has been a lecturer in computer vision in the Department the area chair at ECCV ’06 and CVPR ’08, and in 1997, he established
of Electronic Engineering, Queen Mary, University of London. He is/has the EMMCVPR workshop series. In 2009, he was awarded a Royal
been on the organizing committees of IEEE SMC ’04 and Face and Society Wolfson Research Merit Award.
Gesture Recognition ’08, and was the general chair of WIAMIS ’09. He
is an associate editor of the Image and Vision Computing Journal and
the Journal of Multimedia. His research interests lie in the areas of . For more information on this or any other computing topic,
computer vision and pattern recognition, with emphasis on motion please visit our Digital Library at www.computer.org/publications/dlib.
analysis and their applications in multimedia data management, multi-
modal human computer interaction, and visual communications.
Currently, he is interested in the analysis of human motion, including
the detection, tracking, and understanding of facial and body gestures.
He is a member of the IEEE and the IEEE Computer Society.