Professional Documents
Culture Documents
2001 Frey LearningMixturesforImageMatching
2001 Frey LearningMixturesforImageMatching
2001 Frey LearningMixturesforImageMatching
Abstra t
By representing images and image prototypes by linear subspa es
spanned by \tangent ve tors" (derivatives of an image with respe t
to translation, rotation, et .), impressive invarian e to known types
of uniform distortion an be built into feedforward dis riminators.
We des ribe a new probability model that an jointly luster data
and learn mixtures of nonuniform, smooth deformation elds. Our
elds are based on low-frequen y wavelets, so they use very few
parameters to model a wide range of smooth deformations (unlike,
e.g., fa tor analysis, whi h uses a large number of parameters to
model deformations). We give results on handwritten digit re og-
nition and fa e re ognition.
1 Introdu tion
Many omputer vision and image pro essing tasks bene t from invarian es to spatial
deformations in the image. Examples in lude handwritten hara ter re ognition,
fa e re ognition and motion estimation in video sequen es. When the input images
are subje ted to possibly large transformations from a known nite set of transfor-
mations (e.g., translations in images), it is possible to model the transformations
using a dis rete latent variable and perform transformation-invariant lustering and
dimensionality redu tion using EM (Frey and Joji 1999a; Joji and Frey 2000). Al-
though this method produ es ex ellent results on pra ti al problems, the amount
of omputation grows linearly with the total number of possible transformations in
the input.
In many ases, we an assume the deformations are small, e.g., due to dense tempo-
ral sampling of a video sequen e, from blurring the input, or be ause of well-behaved
handwriters. Suppose (Æ x ; Æ y ) is a deformation eld (a ve tor eld that spe i es
where to shift pixel intensity), where (Æxi ; Æyi ) is the 2-D real ve tor asso iated
with pixel i. Given a ve tor of pixel intensities f for an image, and assuming the
deformation ve tors are small, we an approximate the deformed image by
~f = f + fÆÆ +
f
ÆÆ ; (1)
x x
y y
(e)
ients that is a small fra tion of the number of pixels in the image. (In ontrast,
ea h fa tor in fa tor analysis has a number of oeÆ ients that is equal to the number
of pixels.)
An advantage of wavelets is their spa e/frequen y lo alization. The global trends
in the image an be aptured in the low-frequen y oeÆ ients while at the same
time, the deformations lo alized in smaller regions of the image an be expressed
by more spatially lo alized wavelets.
The deformed image an be expressed as
~f = f + (Gx f ) Æ (Rax ) + (Gy f ) Æ (Ray ); (3)
where the derivatives in (1) are approximated by sparse matri es Gx and Gy that
operate on f to ompute nite di eren es.
(3) is bilinear in the deformation oeÆ ients a and the original image f , i.e., it is
linear in f given a and it is linear in a given f . To rewrite the element-wise produ t
as a matrix produ t, we onvert either the ve tor Gf or the ve tor Ra to a diagonal
matrix using the diag() fun tion:
~f = f + D(f )a; where D(f ) = [diag(Gx f )R diag(Gy f )R℄ (4)
~f = T(a)f ; where T(a) = [I + diag(Rax )Gx + diag(Ray )Gy ℄ : (5)
The rst equation shows by applying a simple pseudo inverse, we an estimate the
oeÆ ients of the image deformation that transforms f into ~f : a = D(f ) 1 (~f f ).
This low-dimensional ve tor of oeÆ ients minimizes the distan e jjf ~f jj. Under
easily satis ed onditions on the di eren ing matri es Gx and Gy , T(a) in (5) an
be made invertible regardless of the image f , so that f = T(a) 1 ~f .
Given a test image g, we ould mat h f to g by omputing the deformation o-
eÆ ients, a = D(f ) 1 (g f ), that minimize jjf gjj. However, more extreme
deformations an be su essfully mat hed by deforming g as well:
g~ = g + (Gx g) Æ (Rbx ) + (Gy g) Æ (Rby ); (6)
where b are the deformation oeÆ ients for g. The di eren e between the two
deformed images is
~f g~ = f g + [D(f ) D(g)℄ ba : (7)
a b
Figure 2: (a) A Bayes net for deformable image mat hing. (b) A generative version of the
net onditioned on e = 0.
20 20
15 15
10 10
5 5
(a) 5 10 15 20 5 10 15 20
f Ra ~f g~ Rb g
10
20
30
40
(b)
50
60
Matrix
Figure 3: Estimating the image deformation due to a hange in fa ial expression and a subset
of the learned parameters for the model of handwritten digits
CDROM (Hull, 1994). To ompare our method with other generative models, we
used a training set of 2000 images to learn 10 digit models using the EM algorithm
and tested the algorithms on a test set of 1000 digit images.
Deformable image mat hing. In Fig. 3a we estimate the optimal deformation
elds ne essary to mat h two images of a fa e of the same person but with di erent
fa ial expression. We set the matrix to identity and we set by hand to allow a
ouple of pixels of deformations. See Se tion 2 for nomen lature.
Comparison with the mixture of diagonal Gaussians (MDG). MDG needs
10-20 lasses per digit to a hieve the optimal error rate of only about 8% (Frey and
Joji 1999a) on the handwritten digit re ognition task. Note that our network re-
du es to MDG when ` is set to zero. To demonstrate the e e tiveness of adding a
deformation model to MDG, we trained our model with 15 lasses per digit and only
a single transformation model (L = 1) for all digits, with a total of 64 deformation
oeÆ ients (8 for ea h dimension in the latent and the observed images). In Fig. 3b
we show one of the learned luster means, the omponents in the orresponding
deformation matrix D and the learned ovarian e matrix . shows anti orrela-
tion among the deformation oeÆ ients for the latent and the observed image, as
the network usually applies opposite deformations on these two images to a hieve
the mat h. However, there is also strong orrelation between bx and by and less
orrelation between ax and ay as the network uses mostly a rotational adjustment
on the input image, while the latent image is more freely deformed (Fig. 1e). Our
model a hieved the error rate of 3.6%. Even if we keep only the diagonal elements
in , the model a hieves a 5% error rate.
Comparison with fa tor analysis. In fa tor analysis (FA) or in a mixture of
fa tor analyzers (MFA), the deformation matrix D is alled fa tor loading matrix
and is not tied to the mean as in our model (Fig. 3b). The fa tor ovarian e
matrix is set to the identity matrix, as the extra freedom in the hoi e of the fa tor
varian es an be aptured in the fa tor loading matrix. So, while FA/MFA try
to apture the variability in the data by learning the omponents in the fa tor
loading matrix and keeping the distribution over the fa tors xed, our model does
the opposite by tying the fa tor loading matrix to the mean image and learning the
distribution over the fa tors (deformation oeÆ ients). By doing this, we we are
able to expand other images using the same deformation model. This allows us to
share the deformation model a ross lusters and also to deform the input images.
The omparable error rate in lassi ation of handwritten digits for FA/MFA (3.3%)
and our model (3.6%) indi ates that most of the variability in images of handwriten
digits an be aptured by modeling smooth, non-uniform deformations without
allowing full FA learning.
Our deformable image mat hing network ould be used for a variety of omputer
vision tasks su h as opti al ow estimation, deformation invariant re ognition and
modeling orrelations in deformations. For example, our learning algorithm ould
learn to jointly deform the mouth and eyes when modeling fa ial expressions.
Referen es
A. P. Dempster, N. M. Laird and D. B. Rubin 1977. Maximum likelihood from in omplete data via the EM
algorithm. Pro eedings of the Royal Statisti al So iety B-39, 1{38.
B. J. Frey and N. Joji 1999a. Estimating mixture models of images and inferring spatial transformations using
the EM algorithm. Pro eedings of the IEEE Conferen e on Computer Vision and Pattern Re ognition, Ft. Collins,
CO. IEEE Computer So iety Press, Los Alamitos, CA.
Z. Ghahramani and G. E. Hinton 1997. The EM algorithm for mixtures of fa tor analyzers. University of Toronto
Te hni al Report CRG-TR-96-1. Available at www.gatsby.u l.a .uk/zoubin.
G. E. Hinton, P. Dayan and M. Revow 1997. Modeling the manifolds of images of handwritten digits. IEEE
Trans. on Neural Networks 8, 65{74.
N. Joji and B. J. Frey 2000. Topographi transformation as a dis rete latent variable. In S.A. Solla, T. K. Leen,
and K.-R. Muller (eds) Advan es in Neural Information Pro essing Systems 12, MIT Press, Cambridge, MA.
P. Y. Simard, Y. Le Cun and J. Denker 1993. EÆ ient pattern re ognition using a new transformation distan e.
In S. J. Hanson, J. D. Cowan and C. L. Giles, Advan es in Neural Information Pro essing Systems 5, Morgan
Kaufmann, San Mateo, CA.
N. Vas on elos and A. Lippman 1998. Multiresolution tangent distan e for aÆne invariant lassi ation. In M. I.
Jordan and M. I. Kearns and S. A. Solla (eds) Advan es in Neural Information Pro essing Systems 10, MIT Press,
Cambridge, MA.
Appendix: EM for deformable image mat hing network
To t the network to a set of training data, we assume that the error images for the training ases are zero and
estimate the maximum likelihood parameters using EM (Dempster et al. 1977). In deriving the M-step, both
forms of the deformation equations (4) and (5) are useful, depending on whi h parameters are being optimized.
Using hi to denote an average over the training set, the update equations are:
P ;` =hP ( ; `jet = 0; gt )i (15)
= 0; gt )E[T(a)0
X
^ =h
P ( ;` jet
1 T(a) j ; `; et = 0; gt ℄i 1
`
hX P ( ;` je t = 0; gt )E[T(a)0 1 T(b)g
t j ; `; et = 0; gt ℄i; (16)
`
P a 0 0
je = 0; gt )E
^` =
P ( b a b ; `; et = 0; gt
;` t
(17)
h P ( ; `jet = 0; gt )i P
!
^ =diag h ` P ( ; ljet = 0; gPt )E[(f g~t ) Æ (f g~t )j ; l; et = 0; gt ℄i
P ~ ~
(18)
h ` P ( ; ljet = 0; gt )i
The expe tations needed to evaluate the above update equations are given by:
;` = ov a
b ; `; et = 0; gt =( `
1 + M0 1M ) 1
=E a et = 0; gt = 1M 0 1 ( gt ) (19)
;` b ; l;
;`
h
a 0 0 i
0
E b a b ; `; et = 0; gt = ;` + ;` ;` (20)
E[(~f g~ t ) Æ (~f g~ t )j ; l; et = 0; gt ℄ = gt + M ;`
Æ
gt + M ;`
+ diag(M ( 0
;` )M )
Then, the expe tations E[a℄ and E[b℄ are the two halves of the ve tor ;` , while E[ad1 a0d2 ℄ and E[ad1 b0d2 ℄, for
2 fx; yg, are square blo ks of the matrix in (20).
d1 ; d2