Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Probabilistic Identity Characterization for Face Recognition∗

Shaohua Kevin Zhou and Rama Chellappa


Center for Automation Research (CfAR) and
Department of Electrical and Computer Engineering
University of Maryland, College Park, MD 20742
{shaohua, rama}@cfar.umd.edu

Abstract video sequences, we mean dependent groups of images.


Approaches that use I-groups can be rough divided into
We present a general framework for characterizing the ob- two categories. The first category is based on manifold
ject identity in a single image or a group of images with matching. In [11], hypothetical identity surfaces are con-
each image containing a transformed version of the ob- structed by computing the linear coefficients of view space.
ject, with applications to face recognition. In terms of the Illumination variations are not accounted for. Discriminant
transformation, the group is made of either many still im- features are then extracted to overcome other variations. In
ages or frames of a video sequence. The object identity [6], manifolds are formed for every I-group. Recognition is
is either discrete- or continuous-valued. This probabilistic performed by computing the shortest distance between two
framework integrates all the evidence of the set and handles manifolds. The manifold takes a certain parameterized form
the localization problem, illumination and pose variations and the parameters are directly learned from the visual ap-
through subspace identity encoding. Issues and challenges pearances. Robustness to pose and illumination variations
arising in this framework are addressed and efficient com- are not reported. The second category is based on statistical
putational schemes are presented. Good face recognition learning. In [14], a multi-variate Gaussian density is fit-
results using the PIE database are reported. ted for every I-group. Recognition is achieved by comput-
ing the Kullback-Leibler distance [4] between two Gaussian
densities. However, the Gaussian assumption is easily vio-
1 Introduction lated if pose and illumination variations exist. In [17], prin-
cipal subspaces are learned for each I-group and principal
Visual face recognition is an important task. Even though a angel between two principal subspace are used for recog-
lot of research has been carried out, current state-of-the-art nition. The computation of principal angle is also carried
recognizers still yield unsatisfactory results especially when on the feature space embedded by kernel functions. One
confronted with pose and illumination variations. In addi- common disadvantage of the above approaches is that they
tion, the recognizers are further complicated by the registra- also assume that the face regions have already been cropped
tion requirement as the images that the recognizers process beforehand, using either a detector or a tracker.
contain transformed appearances of the object. Below, we
simply use the term ‘transformation’ to model the variations Approaches using video sequences utilize temporal in-
involved, be it registration, pose and/or illumination varia- formation for recognition as well. In [20], simultaneous
tions. tracking and recognition is implemented in a probabilistic
While most recognizers process a single image, there is a framework. The joint posterior probability of the track-
growing interest in using a group of images [11, 20, 14, 12, ing parameter and the identity variable is approximated us-
10, 17, 6]. In terms of the transformations embedded in the ing the sequential important sampling (SIS) algorithm and
group or the temporal continuity between the transforma- the marginal posterior probability of the identity variable is
tions, the group can be either independent or not. Examples used for recognition. However, only affine localization pa-
of the independent group (I-group) are face databases that rameter is used for tracking and handling pose and illumina-
store multiple appearances for one object. Examples of the tion variations are not reported. In addition, exemplars are
dependent group are video sequence with temporal continu- learned from the gallery videos to cover pose and illumina-
ity. If the temporal information is stripped, video sequences tion variations. In [12], hidden Markov models are used to
reduce to I-groups. In this paper, whenever we mention learn the dynamics before successive appearances. In [10],
pose variations are handled by learning the view-discretized
∗ Partially supported by the DARPA/ONR Grant N00014-03-1-0520. appearance manifolds from the training ensemble. Transi-

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04)
1063-6919/04 $20.00 © 2004 IEEE

tion probabilities from one view to another view are used = π(α) p(y1:N |θ1:N , α)p(θ1:N )dθ1:N
to regularize the search space. However, in [12, 10], the θ1:N
cropped images are used for testing.  
N
In this paper, we propose a generic framework which = π(α) p(yt |θt , α)p(θt |θ1:t−1 )dθ1:N , (1)
possesses the following features: (i) It processes either a θ1:N t=1

single image or a group of images (including I-group and


where the following rules, namely (a) observational condi-
video sequence). (ii) It handles the localization problem,
tional independence and (b) chain rule, are applied:
illumination and pose variations. (iii) The identity descrip-
tion could be either discrete or continuous. The continuous 
N
identity encoding typically arises from a subspace model- (a) p(y1:N |θ1:N , α) = p(yt |θt , α); (2)
ing. (iv) It is probabilistic and integrates all the available t=1
evidence.
In Section 2 we introduce the generic framework which 
N
.
provides a probabilistic characterization of the object iden- (b) p(θ1:N ) = p(θt |θ1:t−1 ); p(θ1 |θ0 ) = p(θ1 ). (3)
t=1
tity. In Section 3 we address issues and challenges arising
in this framework. In Section 4 we focus on how to achieve Equation (1) involves two key quantities: the observa-
an identity encoding which is invariant to localization, illu- tion likelihood p(yt |θt , α) and the state transition proba-
mination and pose variations. In Section 5, we present some bility p(θt |θ1:t−1 ). The former is essential to a recognition
efficient computational methods. In Section 6, we present task, the ideal case being that it possesses a discriminative
experimental results. In Section 7, we conclude our paper power in the sense that it always favors the correct identity
and briefly summarize potential future works. and disfavors the others; the latter is also very helpful espe-
cially when processing video sequences, which constrains
the search space.
2 Probabilistic Identity Characteri- We now study two special cases of p(θt |θ1:t−1 ).
zation
2.1 Independent group (I-group)
Suppose α is the identity signature, which represents the
identity in an abstract manner. It can be either be discrete- In this case, the transformations {θt ; t = 1, . . . , N } are in-
or continuous- valued. If we have a C-class problem, α is dependent of each other, i.e.
discrete taking value in {1, 2, ..., C}. If we associate the
p(θt |θ1:t−1 ) = p(θt ). (4)
identity with image intensity or feature vectors derived from
say subspace projections, α is continuous-valued. Given a Eq. (1) becomes
.
group of images y1:N = {y1 , ..., yN } containing the ap-
pearances of the same but unknown identity, probabilistic N 

identity characterization is equivalent to finding the poste- p(α|y1:N ) ∝ π(α) p(yt |θt , α)p(θt )dθt . (5)
rior probability p(α|y1:N ). Probabilistic modeling is com- t=1 θt

monly used in computer vision and applications. See [9].


In this context, the probability p(θt ) can be regarded as a
As the image only contains a transformed version of the
prior for θt , which is often assumed to be Gaussian with
object, we also need to associate it a transformation param-
mean θ̂ or non-informative.
eter θ, which lies in a transformation space Θ. The trans-
The most widely studied case in the literature is N = 1,
formation space Θ is usually application dependent. Affine
i.e. there is only a single image in the group. Due to its im-
transformation is often used to compensate for the localiza-
portance, sometime we will distinguish it from the I-group
tion problem. To handle illumination variation, the lighting
(with N > 1) depending on the context. We will present
direction is used. If pose variation is involved, 3D transfor-
in Section 3 the shortcomings of many contemporary ap-
mation is needed or a discrete set is used if we quantize the
proaches.
continuous view space.
It all boils down to how to compute the integral in (5) in
We assume that the prior probability of α is π(α), which
real application. In the sequel, we will show how to approx-
is assumed to be, in practice, a non-informative prior. A
imate it efficiently.
non-informative prior is uniform in the discrete case and
treated as a constant, say 1, in the continuous case.
The key to our probabilistic identity characterization is 2.2 Video sequence
as follows:
In the case of video sequence, temporal continuity between
p(α|y1:N ) ∝ π(α)p(y1:N |α) successive video frames implies that the transformations

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04)
1063-6919/04 $20.00 © 2004 IEEE
{θt ; t = 1, . . . , N } follow a Markov chain. Without loss a single image (an I-group with N = 1), an I-group with
of generality, we assume a first-order Markov chain, i.e. N ≥ 2, or a video sequence; (ii) the identity signature is ei-
ther discrete- or continuous-valued; and (iii) the transforma-
p(θt |θ1:t−1 ) = p(θt |θt−1 ). (6) tion space takes into account all available variations, such as
Eq. (1) becomes localization and variations in illumination and pose.
 
N
p(α|y1:N ) ∝ π(α) p(yt |θt , α)p(θt |θt−1 )dθ1:N . (7) 3.1 Discrete identity signature
θ1:N t=1
In a typical pattern recognition scenario, say a C-class prob-
The difference between (5) and (7) is whether the prod- lem, the identity signature for y1:N , α̂, is determined by the
uct lies inside or outside the integral. In (5), the product lies Bayesian decision rule:
outside the integral, which divides the quantity of interest
into ‘small’ integrals that can be computed efficiently; while α̂ = arg min p(α|y1:N ). (10)
{1,2,...,C}
(7) does not have such a decomposition, causing computa-
tional difficulty. Usually p(y|θ, α) is a class-dependent density, either pre-
specified or learned. This is a well studied problem and we
will not focus on this.
2.3 Difference from Bayesian estimation
Our framework is very different from the traditional 3.2 Continuous identity signature
Bayesian parameter estimation setting, where a certain pa-
rameter β should be estimated from the i.i.d. observa- If the identity signature is continuous-valued, two recog-
tions {x1 , x2 , ..., xN } generated from a parametric density nition schemes are possible. The first is to derive a point
p(x|β). If we assume that β has a prior probability π(β), estimate α̂ (e.g. conditional mean, mode) from p(α|y1:N )
then the posterior probability p(β|x1:N ) is computed as to represent the identity of image group y1:N . Recognition
is performed by matching α̂’s belonging to different groups

N
of images using a metric k(., .). Say, α̂1 is for group 1 and
p(β|x1:N ) ∝ π(β)p(x1:N |β) = π(β) p(xt |β) (8) α̂2 for group 2, the point distance
t=1
.
k̂1,2 = k(α̂1 , α̂2 )
and used to derive the parameter estimate β̂. One should not
confuse our transformation parameter θ with the parameter is computed to characterize the difference between groups
β. Notice that β is fixed in p(xt |β) for different t’s. How- 1 and 2.
ever, each yt is associates with a θt . Also, α is different Instead of comparing the point estimates, the second
from β in the sense that α describes the identity and β helps scheme directly compares different distributions that char-
to describe the parametric density. acterize the identities for different groups of images. There-
To make our framework more general, we can also incor- fore, for two groups 1 and 2 with the corresponding poste-
porate the β parameter by letting the observation likelihood rior probabilities p(α1 ) and p(α2 ), we use the following
be p(y|θ, α, β). Equation (1) then becomes expected distance [18]
 
p(α|y1:N ) ∝ π(α)p(y1:N |α) (9) .
k̄1,2 = k(α1 , α2 )p(α1 )p(α2 )dα1 dα2 .

α1 α2
= π(α) p(y1:N |θ1:N , α, β)p(θ1:N )π(β)dθ1:N dβ
β,θ1:N Ideally, we wish to compare the two probability distribu-
 
N tions using quantities such as the Kullback-Leibler distance
= π(α) p(yt |θt , α, β)p(θt |θ1:t−1 )π(β)dθ1:N dβ, [4]. However, computing such quantities is numerically
t=1 prohibitive when α is of high dimensionality.
where θ1:N and β are assumed to be statistically indepen- The second scheme is preferred as it utilizes the com-
dent. In this paper, we will focus only on (1) as if we already plete statistical information, while in the first one, point es-
know the true parameter β in (9). This greatly simplifies our timates use partial information. For examples, if only the
computation. conditional mean is used, the covariance structure or higher-
order statistics is thrown away. However, there are circum-
stances when the first scheme makes sense: the posterior
3 Recognition Setting and Issues distribution p(α|Y1:N ) is highly peaked or even degenerate
at α̂. This might occur when (i) the variance parameters are
Equation (1) lays a theoretical foundation, which is univer- taken to be very small; or (ii) we let N go to ∞, i.e. keep
sal for all recognition settings: (i) recognition is based on observing the same object for a long time.

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04)
1063-6919/04 $20.00 © 2004 IEEE
3.3 The effects of the transformation 4 Subspace Identity Encoding
Even though recognition based on single images has been The main challenge is to specify the likelihood p(y|θ, α).
conducted for a long time, most efforts assume only Practical considerations require that (i) the identity encod-
one alignment parameter θ̂ and compute the probability ing coefficient α is compact so that our target space where
p(y|θ̂, α). Any recognition algorithm computing some dis- α resides is of low dimensional; and (ii) α should be invari-
tance measures can be thought of as using a properly de- ant to transformations and tightly clustered so that we can
fined Gibbs distribution. The underlying assumption is that safely focus on a small portion of the spaces.
Inspired by the popularity of subspace analysis, we as-
p(θ) = δ(θ − θ̂), (11) sume that the observation y can be well explained by a sub-
space, whose basis vectors are encoded in a matrix denoted
where δ(.) is an impulse function. Using (11), (5) becomes by B, i.e. there exists linear coefficients α such that y ≈ Bα.
 Clearly, α naturally encodes the identity. However, the
observation under the transformation condition (parameter-
p(α|y) ∝ π(α) p(y|θ, α)δ(θ − θ̂)dθ = π(α)p(y|θ̂, α).
θ ized by θ) deviates from the canonical condition (parame-
(12) terized by say θ̄) under which the B matrix is defined. To
Incidentally, if the Laplace’s method is used to approx- achieve an identity encoding that is invariant to the transfor-
imate the integral (refer to the Appendix I for details) and mation, there are two possible ways. One way is to inverse-
the maximizer θ̂α = arg minθ p(y|θ, α)p(θ) does not de- warp the observation y from the transformation condition θ
pend on α, say θ̂α = θ̂, then to the canonical condition θ̄ and the other way is to warp
the basis matrix B from the canonical condition θ̄ to the

transformation condition θ. In practice, inverse-warping is
p(α|y) ∝ π(α) p(y|θ, α)p(θ)dθ typically difficult. For example, we cannot easily warp an
θ
 off-frontal view to a frontal view without explicit 3D depth
 π(α)p(y|θ̂, α)p(θ̂) (2π)r /|I(θ̂)|. (13) information that is unavailable. Hence, we follow the sec-
ond approach, which is also known as analysis-by-synthesis
This gives rise to the same decision rule as implied by (12) approach. We denote the basis matrix under the transforma-
and also partly explains why the simple assumption (11) can tion condition θ by Bθ .
work in practice.
The alignment parameter is therefore very crucial for a 4.1 Invariant to localization, illumination,
good recognition performance. Even a slightly erroneous
and pose
θ̂ may affect the recognition system significantly. It is
very beneficial to have a continuous density p(θ) such as Localization parameter, denoted by ε, includes the face lo-
a Gaussian or even a non-informative since marginalization cation, scale and in-plane rotation. Typically, an affine
of p(θ, α|y) over θ yields a robust estimate of p(α|y). transformation is used. We absorb the localization param-
In addition, our Bayesian framework also provides a way eter ε in the observation using Tε {y}, where the Tε is a lo-
to estimate the best alignment parameter through the poste- calization operator, cropping the patch of interest and nor-
rior probability: malizing it match with the size of the basis.
 The illumination parameter, denoted by λ, is a vec-
tor specifying the illuminant direction (and intensity if re-
p(θ|y) ∝ p(y|θ, α)π(α)dα. (14)
α quired). The pose parameter, denoted by υ, is a continuous-
valued random variable. However, practical systems [3,
8] often discretize this due to the difficulty in handling
3.4 Asymptotic behaviors 3D to 2D projection. Suppose the quantized pose set
is {1, . . . , V }. To achieve pose invariance, we concate-
When we have an I-group or a video sequence, we are often nate all the images [8] {y1 , . . . , yV } under all the views
interested in discovering the asymptotic (or large-sample) and a fixed illumination λ to form a ’huge’ vector Yλ =
behaviors of the posterior distribution p(α|y1:N ) when N is [y1,λ , ..., yV,λ ]T . To further achieve invariance to illumina-
large. In [20], the discrete case of α in a video sequence is tion, we invoke the Lambertian reflectance model, ignoring
studied. However it is very challenging to extend this study shadow pixels. Now, λ is actually a 3-D vector describing
to a continuous case. Experimentally (refer to Section 6), the illuminant. Since all yv ’s are illuminated by the same λ,
we find that p(α|y1:N ) becomes more and more peaked as the Lambertian model gives,
N increase, which seems to suggest a degenerancy in the
true value αtrue . Yλ = Wλ. (15)

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04)
1063-6919/04 $20.00 © 2004 IEEE
Following [19] we assume that Monte Carlo simulation. The underlying principle is the
law of large number (LLN). If {x1 , x2 , . . . , xK } are K i.i.d.

m
samples of the density p(x), for any bounded function h(x),
W= αi Wi , (16)

1 
i=1 K
lim h(xk ) = h(x)p(x)dx = Ep [h]. (21)
we have K→∞ K x
k=1
λ

m
Y = αi Wi λ, (17) Alternatively, when drawing i.i.d. samples from p(x) is
i=1 difficult, we can use importance sampling [13]. Suppose
that the importance function q(x) has i.i.d. realizations
where Wi ’s are illumination-invariant bilinear basis and
{x1 , x2 , . . . , xK }. The pdf p(x) can be represented by a
α = [α1 , . . . , αm ]T provides an illuminant-invariant iden-
weighted set samples {(xk , wp k
)}K
k=1 , where the weight for
tity signature. Those bilinear basis can be easily learned as k
shown in [7, 19]. Thus α is also pose-invariant because, for the sample x is
a given view υ, we take the part in Y corresponding to this k
wp = p(xk )/q(xk ), (22)
view and still have
in the sense that for any bounded function h(x),

m
yλ,υ
= αi Wυi λ. (18) 
K 
K
p(xk )
i=1 lim k
wp h(xk ) = h(xk ) = Ep [h]. (23)
K→∞ q(xk )
k=1 k=1
In summary, the basis matrix Bθ for θ = (ε, λ, υ) with ε
Laplace’s method [2, 13]. The general approach of this
absorbed in y is expressed as Bλ,υ = [Wυ1 λ, . . . , Wυm λ].
method is presented in Appendix-I. This is a good approx-
We focus on the following likelihood:
imation to the integral, only if the integrand is uniquely
peaked and reasonably mimics the Gaussian function.
p(y|θ) = p(y|ε, λ, υ, α)
In our context, we use importance sampling (or i.i.d sam-
= Z−1 λ,υ,α exp{−D(Tε {y}, Bλ,υ α)}, (19) pling if possible) for ε and the Laplace’s method for λ and
enumerate υ. We draw i.i.d. samples {ε1 , ε2 , . . . , εK }
where D(y, Bθ α) is some distance measure and Zλ,υ,α is from q(ε) and, for each sample εk , compute the weight
the so-called partition function which plays a normalization wεk = p(εk )/q(εk ). If the i.i.d. sampling is used, the
role. In particular, if we take D as weights are always ones. Putting things together, we have
(assuming π(α) is a non-informative prior)
D(Tε {y}, Bλ,υ α) = (Tε {y}−Bλ,υ α)T Σ−1 (Tε {y}−Bλ,υ α)/2, 
(20) p(α|y) ∝ p(y|ε, λ, υ, α)p(ε)p(λ)p(υ)dεdλdυ
with a given Σ (say Σ = σ 2 I where I is an identity ma- ε,λ,υ
trix), then (19) becomes a multivariate Gaussian and the
1  1 
K V
partition function Zλ,υ,α does not depend on the parame-  wεk p(y|εk , λ̂εk ,υ,α , υ, α) ×
ters any more. However, even though (19) is a multivariate K V υ=1
k=1
Gaussian, the posterior distribution p(α|y1:N ) is no longer 
p( λ̂ k ,υ,α ) (2π)r /|I(λ̂εk ,υ,α )|, (24)
Gaussian. ε

where λ̂εk ,υ,α is the maximizer


5 Computational Issues λ̂εk ,υ,α = arg min p(y|εk , λ, υ, α)p(λ), (25)
λ

5.1 The integral r is the dimensionality of λ, and I(λ̂ε,υ,α ) is a properly de-


fined matrix. Refer to Appendix-II for computing λ̂ε,υ,α
 space Θ is discrete, it is easy to eval-
If the transformation
uate the integral1 θ p(y|θ, α)p(θ)dθ, which becomes a and I(λ̂ε,υ,α ) if the likelihood is given as (19) and (20) and
sum. If Θ is continuous, in general, computing integral
a non-informative prior p(λ) is assumed. Similar deriva-
p(y|θ, α)p(θ)dθ is a difficult task. Many techniques tions can be conducted for an I-group of observations y1:N .
θ
are available in the literature. Here we mainly focus on
two techniques: Monte Carlo simulation [13] and Laplace’s 5.2 The distances k̄ and k̂
method [2, 13].
To evaluate the expected distance k̄, we resort to Monte
1 We drop the subscript [.]t notation as this is a general treatment. Carlo method. In our context, the target distribution is

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04)
1063-6919/04 $20.00 © 2004 IEEE
which cover from the left profile to the frontal to the right
c22 profile. In total, we have 68 ∗ 12 ∗ 9 = 7344 images. Fig
1 displays one PIE object under the illumination and pose
c02
variations.
c37 We randomly divide the 68 subjects into two parts. The
first 34 subjects are used in the training set and the remain-
c05 ing 34 subjects are used in the gallery and probe sets. It
is guaranteed that there is no identity overlap between the
c27 training set and the gallery and probe sets.
During training, the images are pre-preprocessed by
c29 aligning the eyes and mouth to desired positions. No flow
computation is carried on for further alignment. After the
c11
pre-processing step, the used face image is of size 48 by 40,
c14 i.e. d = 48 ∗ 40 = 1920. Also, we only study gray images
by taking the average of the red, green, and blue channels
c34 of their color versions.
f16 f15 f13 f21 f12 f11 f08 f06 f10 f18 f04 f02 The training set is used to learn the basis matrix Bθ or
the bilinear basis Wi ’s. As mentioned before, θ includes
Figure 1: Examples of the face images of one PIE object under the illumination direction λ and the view pose υ, where s
the selected illumination and poses actually used in recognition. is a continuous-valued random vector and υ is a discrete
random variable taking values in {1, . . . , V } with p = 9
(corresponding to C).
p(α|y1:N ). Based on the above derivations, we know how
The images belonging to the remaining 34 subjects are
to evaluate the target distribution, but not to draw sample
used in the gallery and probe sets. The construction of the
from it. Therefore, we use the importance sampling. Other
gallery and probe sets conforms the following: To form a
sampling techniques such as Monte Carlo Markov chain
gallery set of the 34 subjects, for each subject, we use an
[13] can be applied too.
I-group of 12 images under all the illumination under one
Suppose that, say for group 1, the importance function
pose υp ; to form a probe set, we use I-groups under the other
is q1 (α1 ), and weighted sample set is {α1i , w1i }Ii=1 , the ex-
pose υg . We mainly concentrate on the case with υp = υg .
pected distance [18] is approximated as
Thus, we have 9 ∗ 8 = 72 tests, with each test giving rise
I J to a recognition score. The 1-NN (nearest neighbor) rule is
i=1 w1i w2j k(α1i , α2j )
j=1
k̄1,2  I J j
. (26) applied to find the identity for a probe I-group.
i
i=1 w1 j=1 w2 During testing, we no longer use the pre-processed im-
ages and therefore the unknown transformation parameter
The point distance is approximated as includes the affine localization parameter, the light direc-
I J tion, and the discrete view pose. The prior distribution
i i
i=1 w1 α1 j=1 w2j α2j p(εt ) is assumed to be a Gaussian, whose mean is found
k̂1,2  k( I , J ). (27)
i w2j by a background subtraction algorithm and whose covari-
i=1 w1 j=1
ance matrix is manually specified. We use i.i.d. sampling
from p(εt ) since it is Gaussian. The metric k(., .) actually
used in our experiments is the correlation coefficient:
6 Experimental Results
k(x, y) = {(xT y)2 }/{(xT x)(yT y)}.
We use the ‘illum’ subset of the PIE database [15] in our
experiments. This subset has 68 subjects under 21 illumi- Fig. 2 shows the marginal posterior distribution of the
nation and 13 poses. Out of the 21 illumination, we select first element α1 of the identity variable α, i.e., p(α1 |y1:N ),
12 of them denoted by F , with different N ’s. From Fig. 2, we notice that (i) the pos-
terior probability p(α1 |y1:N ) has two modes, which might
F = {f16 , f15 , f13 , f21 , f12 , f11 , f08 , f06 , f10 , f18 , f04 , f02 }, fail those algorithms using the point estimate, and (ii) it be-
comes more peaked and tightly-supported as N increases,
which typically span the set of variations. Out of the 13 which empirically supports the asymptotic behavior men-
poses, we select 9 of them denoted by C, tioned in Section 3.
Fig. 3 shows the recognition rates for all the 72 tests.
C = {c22 , c02 , c37 , c05 , c27 , c29 , c11 , c14 , c34 }, In general, when the poses of the gallery and probe sets are

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04)
1063-6919/04 $20.00 © 2004 IEEE
0.1 0.1 0.1
0.14
0.09 0.09 0.09

0.13
0.08 0.08 0.08

0.07 0.07 0.07 0.12

0.06 0.06 0.06 0.11

)
)

1:12
1:6
p(α |y1)

p(α |y

p(α |y
0.05 0.05 0.05 0.1

1
0.04 0.04 0.04 0.09

0.03 0.03 0.03


0.08

0.02 0.02 0.02


0.07
0.01 0.01 0.01
0.06
0 0 0
−5 −2.5 0
α1
2.5 5
(a) −5 −2.5 0
α1
2.5 5
(b) −5 −2.5 0
α1
2.5 5
(c) 0.05
c22 c02 c37 c05 c27 c29 c11 c14 c34 (d)

Figure 2: The posterior distributions p(α1 |y1:N ) with different N ’s: (a) p(α1 |y1 ); (b) p(α1 |y1:6 ); and (c) p(α1 |y1:12 ), and (d) the
posterior distribution p(υ|y1:12 ). Notice that p(α1 |y1:N ) has two modes and becomes more peaked as N increases.

c22 c22 c22 c22

c02 c02 c02 c02

c37 c37 c37 c37

c05 c05 c05 c05

c27 c27 c27 c27

c29 c29 c29 c29

c11 c11 c11 c11

c14 c14 c14 c14

c34 c34 c34 c34

c22 c02 c37 c05 c27 c29 c11 c14 c34 c22 c02 c37 c05 c27 c29 c11 c14 c34 c22 c02 c37 c05 c27 c29 c11 c14 c34 c22 c02 c37 c05 c27 c29 c11 c14 c34

55 60 65 70 75 80 85 90 95 100 (a) 50 55 60 65 70 75 80 85 90 95 100 (b) 10 20 30 40 50 60 70 80 90 100 (c) 0 10 20 30 40 50 60 70 80 90 100 (d)

Figure 3: The recognition rates of all tests. (a) Our method based on k̄. (b) Our method based on k̂. (c) The PCA approach [16]. (d) The
KL approach. Notice the different ranges of values for different methods and the diagonal entries should be ignored.

far apart, the recognition rates decrease. The best gallery one probe I-group. In this case, the actual pose is υ = 5
sets for recognition are those in frontal poses and the worst (i.e. camera c27 ), which has the maximum probability in
gallery sets are those in profile views. For a comparison, Fig. 2(d). Similarly, we can find an estimation for ε, which
Table 1 shows the average recognition rates for four differ- is quite accurate as the back ground subtraction algorithm
ent methods: our two probabilistic approaches using k̄ and already provides a clean position.
k̂, respectively, the PCA approach [16], and the statistical
approach [14] using the KL distance. When implement-
ing the PCA approach, we learned a generic face subspace 7 Conclusions and Discussions
from all the training images, stripping their illumination and
pose conditions; while implementing the KL approach, we We presented a generic framework of modeling human
fit a Gaussian density on every I-group and the learning identity for a group of images. This framework provides
set is not used. Our approaches outperform the other two a complete statistic description of the identity. We also pro-
approaches significantly due to the transformation-invariant posed a subspace identity encoding that is invariant to lo-
subspace modeling. The KL approach [14] performs even cation, illumination and pose variations. Our experimental
worse than the PCA approach simply because no illumina- results confirm the effectiveness of this approach.
tion and pose learning is used in the KL approach while We are now investigating the following: (i) Robustness
the PCA approach has a learning algorithm based on im- to occlusions. We currently take input images without oc-
age ensembles taken under different illumination and poses clusions. The existence of occlusions will degrade our per-
(though this specific information is stripped). The state-of- formances. Robust statistics can be used to handle occlu-
the-art face recognition algorithm using the PIE database sions. (ii) Without too much difficulty, we can extend our
is [1]. However, they used the ‘lights’ portion of the PIE approach to perform recognition from video sequences with
database with an ambient light always present and the color localization, illumination, and pose variations. We plan to
images, which make the recognition task relatively easy. use Sequential Monte Carlo methods [5] to exploit temporal
continuity.
Method k̄ k̂ PCA KL [14]
Rec. Rate (top 1) 82% 76% 36% 6%
Rec. Rate (top 3) 94% 91% 56% 15% References
Table 1: Recognition rates of different methods. [1] V. Blanz and T. Vetter. Face recognition base on fitting a 3D
morphable model. IEEE Trans. PAMI, 25:1063–1074, 2003.
[2] R. Bolle and D. Cooper. On optimally combining pieces
As earlier mentioned in Section 3.3, we can infer the of information with application to estimating 3-d complex-
transformation parameters using the posterior probability object position from range data. IEEE Trans. on Pattern
p(θ|y1:N ). Fig. 2 also shows the obtained p(υ|y1:12 ) for Analysis and Machine Intelligence, 8:619–638, 1986.

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04)
1063-6919/04 $20.00 © 2004 IEEE
[3] T. Cootes, K. Walker, and C. Taylor. View-base active ap- θ̂ is the maximizer of p(θ) or equivalently log p(θ) which
pearance models. Proc. Intl. Conf. on Automatic Face and satisfies
Gesture Recognition, 2000. ∂p(θ) ∂ log p(θ)
[4] T. M. Cover and J. A. Thomas. Elements of Information |θ̂ = 0 or |θ̂ = 0. (28)
Theory. Wiley, 1991. ∂θ ∂θ
[5] A. Doucet, N. d. Freitas, and N. Gordon. Sequential Monte We expand log p(θ) around θ̂ using a Taylor series:
Carlo Methods in Practice. Springer, 2001.
1
[6] A. Fitzgibbon and A. Zisserman. Joint manifold distance: a log p(θ)  log p(θ̂) − (θ − θ̂)T I(θ̂)(θ − θ̂), (29)
new approach to appearance based clustering. Proc. IEEE 2
Conference on Computer Vision and Pattern Recognition, where I(θ) is an r × r matrix whose ij th element is
2003.
[7] W. T. Freeman and J. B. Tenenbaum. Learning bilinear mod- ∂ 2 log p(θ)
els for two-factor problems in vision. Proc. of IEEE Conf. Iij (θ) = − . (30)
∂θi ∂θj
on Computer Vision and Pattern Recognition, 1997.
[8] R. Gross, I. Matthews, and S. Baker. Eigen light-fields and Note that the first-order term in (29) is cancelled using (28).
face recognition across pose. Proc. of Intl. Conf. on Auto- If p(θ) is a pdf function with parameter θ, then I(θ) is the
matic Face and Gesture Recognition, 2002. famous Fisher information matrix [13]. Substituting (29)
[9] J. Hornegger, D. Paulus, and N. H. Probabilistic modeling into J gives
in computer vision. Handbook of Computer Vision and Ap- 
1
plications (Eds. Jaehne and Haussecker), 1999. J  p(θ̂) exp{− (θ − θ̂)T I(θ̂)(θ − θ̂)}dθ
[10] K. Lee, J. Ho, M. Yang, and D. Kriegman. Video-based face 2

recognition using probabilistic appearance manifolds. Proc.
IEEE Conference on Computer Vision and Pattern Recogni- = p(θ̂) (2π)r /|I(θ̂)|. (31)
tion, 2003.
[11] Y. Li, S. Gong, and H. Liddell. Constructing facial identity
surfaces in a nonlinear discriminating space. Proc. IEEE Appendix II – About λ̂ε,υ,α
Conference on Computer Vision and Pattern Recognition,
2001. If a non-information prior p(λ) is assumed2 , the maximizer
[12] X. Liu and T. Chen. Video-based face recognition using λ̂ε,υ,α satisfies
adaptive hidden markov models. Proc. IEEE Conference on
Computer Vision and Pattern Recognition, 2003. λ̂ε,υ,α = arg max p(y|ε, λ, υ, α) (32)
[13] C. Robert and G. Casella. Monte Carlo Statistical Methods. λ
Springer, 1999. = arg min(Tε {y} − Bλ,υ α)T (Tε {y} − Bλ,υ α)
[14] G. Shakhnarovich, J. Fisher, and T. Darrell. Face recognition λ
from long-term observations. Proc. European Conference = arg min L(ε, υ, λ, α)
λ
on Computer Vision, 2002.
.
[15] T. Sim, S. Baker, and M. Bsat. The CMU pose, illuminatin, where L(ε, υ, λ, α) = (Tε {y} − Bλ,υ α)T (Tε {y} − Bλ,υ α).
and expression (PIE) database. Proc. Intl. Conf. on Face and Using the fact that
Gesture Recognition, 2002.
. 
m
[16] M. Turk and A. Pentland. Eigenfaces for recognition. Jour-
nal of Cognitive Neuroscience, 3:72–86, 1991. Bλ,υ α = [Wυ1 λ, . . . , Wυm λ]α = Bα,υ λ; Bα,υ = αi Wυi ,
[17] L. Wolf and A. Shashua. Kernel principal angles for classi- i=1
fication machines with applications to image sequence inter- (33)
pretation. Proc. IEEE Conference on Computer Vision and The term L(ε, υ, λ, α) becomes
Pattern Recognition, 2003. L(ε, υ, λ, α) = (Tε {y} − Bα,υ λ)T (Tε {y} − Bα,υ λ), (34)
[18] C. Yang, R. Duraiswami, A. Elgammal, and D. L. Real-time
kernel-based tracking in joint feature-spatial spaces. Tech which is quadratic in λ. The optimum λ̂ε,υ,α is unique and
Report CS-TR-4567, Univ. of Maryland, 2004. its value is
[19] S. Zhou and R. Chellappa. Rank constrained recognition un-
der unknown illuminations. IEEE Intl. Workshop on Analy- λ̂ε,υ,α = (Bα,υ T Bα,υ )−1 Bα,υ T y = Bα,υ † Tε {y}. (35)
sis and Modeling of Faces and Gestures, 2003.
[20] S. Zhou, V. Krueger, and R. Chellappa. Probabilistic recog- where [.]† is the pseudo-inverse. Substituting (35) into
nition of human faces from video. Computer Vision and L(ε, υ, λ, α) yields
Image Understanding, 91:214–245, 2003.
L(ε, υ, λ̂ε,υ,α , α) = Tε {y}T (Id −Bα,υ Bα,υ † )Tε {y}. (36)
It is easy to show that I(λ) is no longer a function of λ and
Appendix I – Laplace’s method [2, 13] equals to
I = σ −2 Bα,υ T Bα,υ . (37)
We are interested in computing the following quantity, for
θ = [θ1 , θ2 , . . . , θr ]T ∈ Rr , J = p(θ)dθ. Suppose that 2 If a Gaussian prior is assumed, a similar derivation can be carried.

Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04)
1063-6919/04 $20.00 © 2004 IEEE

You might also like