2001 Frey LearningMixturesforImageMatching

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

A epted for poster presentation, AISTATS 2001, Key West, FL.

Learning mixtures of smooth, nonuniform


deformation models for probabilisti
image mat hing

Nebojsa Joji 1 , Brendan J. Frey2 , Patri e Simard1 , David He kerman1


1 Mi rosoft Resear h 2 Computer S ien e
Redmond, Washington University of Waterloo

Abstra t
By representing images and image prototypes by linear subspa es
spanned by \tangent ve tors" (derivatives of an image with respe t
to translation, rotation, et .), impressive invarian e to known types
of uniform distortion an be built into feedforward dis riminators.
We des ribe a new probability model that an jointly luster data
and learn mixtures of nonuniform, smooth deformation elds. Our
elds are based on low-frequen y wavelets, so they use very few
parameters to model a wide range of smooth deformations (unlike,
e.g., fa tor analysis, whi h uses a large number of parameters to
model deformations). We give results on handwritten digit re og-
nition and fa e re ognition.

1 Introdu tion
Many omputer vision and image pro essing tasks bene t from invarian es to spatial
deformations in the image. Examples in lude handwritten hara ter re ognition,
fa e re ognition and motion estimation in video sequen es. When the input images
are subje ted to possibly large transformations from a known nite set of transfor-
mations (e.g., translations in images), it is possible to model the transformations
using a dis rete latent variable and perform transformation-invariant lustering and
dimensionality redu tion using EM (Frey and Joji 1999a; Joji and Frey 2000). Al-
though this method produ es ex ellent results on pra ti al problems, the amount
of omputation grows linearly with the total number of possible transformations in
the input.
In many ases, we an assume the deformations are small, e.g., due to dense tempo-
ral sampling of a video sequen e, from blurring the input, or be ause of well-behaved
handwriters. Suppose (Æ x ; Æ y ) is a deformation eld (a ve tor eld that spe i es
where to shift pixel intensity), where (Æxi ; Æyi ) is the 2-D real ve tor asso iated
with pixel i. Given a ve tor of pixel intensities f for an image, and assuming the
deformation ve tors are small, we an approximate the deformed image by
~f = f +  fÆÆ +
f
ÆÆ ; (1)
x x
y y

where Æ is element-wise produ t and  f =x is a gradient image omputed by shifting


the original image to the right a small amount and then subtra ting o the original
(a) (b) () (d)

(e)

Figure 1: (a) An image of a hand-written digit. (b) A smooth, non-uniform deformation


eld. ( ) The resulting deformed image. (d) Rotation and translation deformation elds. (e)
Examples of deformed images produ ed by learned distributions over wavelet-based elds.

image. Suppose Æ y = 0 and Æ x = 1, where is a s alar. Then, (1) shifts the


image to the right by an amount proportional to . Fig. 1 shows some more omplex
examples of deformations omputed in this way.
Simard et al. (1992, 1993) onsidered a deformation eld that is a linear ombi-
nation of the uniform elds for translation, rotation, s aling and shearing plus the
nonuniform eld for line thi kness. When the deformation eld is parameterized by
a s alar (e.g., x-translation), xf Æ Æ x + yf Æ Æ y an be viewed as the gradient of
f with respe t to . Sin e the above approximation holds for small , this gradient
is tangent to the true 1-D deformation manifold of f .
By pro essing the input from oarse to ne resolution, this tangent-based onstru -
tion of a deformation eld has also be used to model large deformations in an
approximate manner (Vas on elos and Lippman 1998).
The tangent approximation an also be in luded in generative models, in luding
linear fa tor analyzer models (Hinton et al., 1997) and nonlinear generative models
(Joji and Frey 2000).
Another approa h to modeling small deformations is to jointly luster the data and
learn a lo ally linear deformation model for ea h luster, e.g., using EM in a fa tor
analyzer (Ghahramani and Hinton 1997). An advantage of this approa h over the
tangent approa h is that the types of deformation need not be spe i ed beforehand.
So, unknown, nonuniform types of deformation an be learned. However, a large
amount of data is needed to a urately model the deformations, and learning is
sus eptible to lo al optima that onfuse deformed data from one luster with data
from another luster. (Some fa tors tend to \erase" parts of the image and \draw"
new parts, instead of just perturbing the image.)
We des ribe a new probability model that an jointly luster data and learn mix-
tures of nonuniform, smooth deformation elds. In ontrast to the tangent ap-
proa h, where the deformation eld is a linear ombination of prespe i ed uniform
deformation elds (su h as translation), in our model the deformation eld is a
linear ombination of low-frequen y wavelets. A mixture model of these wavelet
oeÆ ients is learned from the data, so our model an apture multiple types of
nonuniform, smooth image deformations. In ontrast to fa tor analysis, using a
low-frequen y wavelet basis allows our model to use signi antly fewer parameters
to represent a wide range of realisti deformations. For example, our model is mu h
less likely to use a deformation eld to \erase" part of an image and \draw" a new
part, sin e the ne essary eld is usually not smooth.
Our generative model also in orporates the idea of \symmetri tangent distan e"
(Simard et al, 1993) by in luding deformations of the observed image. This al-
lows the linear model for deformations to hold for larger transformations, as the
prototype image and the observed image are both deformed to a hieve a mat h.

2 Smooth, wavelet-based deformation elds


We ensure the deformation eld (Æ x ; Æ y ) is smooth by onstru ting it from low-
frequen y wavelets,
Æ x = Rax ; Æ y = Ray ; (2)
where
 the olumns of R ontain low-frequen y wavelet basis ve tors, and a =
ax
are the deformation oeÆ ients. We use a number of deformation oef-
a y

ients that is a small fra tion of the number of pixels in the image. (In ontrast,
ea h fa tor in fa tor analysis has a number of oeÆ ients that is equal to the number
of pixels.)
An advantage of wavelets is their spa e/frequen y lo alization. The global trends
in the image an be aptured in the low-frequen y oeÆ ients while at the same
time, the deformations lo alized in smaller regions of the image an be expressed
by more spatially lo alized wavelets.
The deformed image an be expressed as
~f = f + (Gx f ) Æ (Rax ) + (Gy f ) Æ (Ray ); (3)
where the derivatives in (1) are approximated by sparse matri es Gx and Gy that
operate on f to ompute nite di eren es.
(3) is bilinear in the deformation oeÆ ients a and the original image f , i.e., it is
linear in f given a and it is linear in a given f . To rewrite the element-wise produ t
as a matrix produ t, we onvert either the ve tor Gf or the ve tor Ra to a diagonal
matrix using the diag() fun tion:
~f = f + D(f )a; where D(f ) = [diag(Gx f )R diag(Gy f )R℄ (4)
~f = T(a)f ; where T(a) = [I + diag(Rax )Gx + diag(Ray )Gy ℄ : (5)
The rst equation shows by applying a simple pseudo inverse, we an estimate the
oeÆ ients of the image deformation that transforms f into ~f : a = D(f ) 1 (~f f ).
This low-dimensional ve tor of oeÆ ients minimizes the distan e jjf ~f jj. Under
easily satis ed onditions on the di eren ing matri es Gx and Gy , T(a) in (5) an
be made invertible regardless of the image f , so that f = T(a) 1 ~f .
Given a test image g, we ould mat h f to g by omputing the deformation o-
eÆ ients, a = D(f ) 1 (g f ), that minimize jjf gjj. However, more extreme
deformations an be su essfully mat hed by deforming g as well:
g~ = g + (Gx g) Æ (Rbx ) + (Gy g) Æ (Rby ); (6)
where b are the deformation oeÆ ients for g. The di eren e between the two
deformed images is
 
~f g~ = f g + [D(f ) D(g)℄ ba : (7)
a b
Figure 2: (a) A Bayes net for deformable image mat hing. (b) A generative version of the
net onditioned on e = 0.

Again, minimizing jj~f g~ jj is a simple quadrati optimization with respe t to the


deformation oeÆ ients a, b. To favor some deformation elds over others, we an
in lude a ost term that depends on the deformation oeÆ ients.
Finally, a versatile image distan e an be de ned as:
  
d(f ; g) = min (~f g~)0 1
(~f g~) + [a0 b0 ℄ 1 a : (8)
a;b b
Matrix is a diagonal matrix whose non-zero elements ontain varian es of ap-
propriate pixels. This distan e allows di erent pixels to have di erent importan e.
For example, if we are mat hing two images of a tree in the wind, the deformation
oeÆ ients should be apable of aligning the trunk and large bran hes, while the
variability in the appearan e of the leaves would be aptured in . aptures
the ovarian e stru ture of the wavelet oeÆ ients of the allowed deformations.
This distan e an be used in the same appli ations as tangent distan e, but being
Bayesians (Patri e ex luded!), we pro eed with a probabilisti model.

3 Bayes net for deformable image mat hing


In Fig. 2a we show a Bayes net that an be used to ompute the likelihood that
the input image mat hes the images modeled by the network. For lassi ation, we
learn one of these networks for ea h lass of data.
The generative mat hing pro ess begins by lamping the test image g. Then, an
image luster index is drawn from P ( ) and given , a latent image f is drawn from
a Gaussian, N (f ;  ;  ). In this paper, we assume  = 0, so p(f j ) = Æ (f  ).
This allows us to use exa t EM to learn the parameters of the model. We are
investigating te hniques whi h would allow us to learn  as well.
Next, a deformation type index ` is pi ked a ording to P (`j ). This index deter-
mines the ovarian e l of the deformation oeÆ ients for both the latent image f
and the test image g:    
a a
p( b j`) = N ( b ; 0; ` ): (9)
` ould be a diagonal matrix with larger elements orresponding to lower-frequen y
basis fun tions, to apture a wide range of smooth non-uniform deformations. How-
ever, ` ould also apture orrelations among deformations in di erent parts of the
image. The deformation oeÆ ients for the latent image a and for the observed im-
age b should be strongly orrelated, so we model the joint distribution instead of
modeling a and b separately.
On e the deformation oeÆ ients a, b have been generated, the deformed latent
image ~f and the deformed test image g~ are produ ed from f and g a ording to (3)
and (6). Using the fun tions D() and T() introdu ed above, we have
p(~f jf ; a) = Æ(~f f D(f )a) = Æ(~f T(a)f ); (10)
p(~gjg; b) = Æ(~g g D(g)b) = Æ(~g T(b)g): (11)
As an illustration of the generative pro ess up to this point, in Fig. 1 we show
several images produ ed by randomly sele ting 8 deformation oeÆ ients from a
unit- ovarian e Gaussian and applying the resulting deformation eld to an image.
The last random variable in the model is an error image e ( alled a \referen e
signal" in ontrol theory), whi h is formed by adding a small amount of diagonal
Gaussian noise to the di eren e between the deformed images ~f and g~ :
p(ej~f ; g~; ) = N (e; ~f g~; ): (12)
For good model parameters, it is likely that one of the luster means an be slightly
deformed to mat h a slightly deformed observed image. However, due to the on-
strained nature of these deformations, an exa t mat h may not be a hievable. Thus,
to allow an exa t mat h, the model helps the image di eren e with a small amount
of non-uniform, luster dependent noise. is diagonal and the non-zero elements
ontain the pixel varian es. A natural pla e to in lude luster dependen e is in fa t
in the luster noise  . Sin e we have hosen to ollapse this noise model to zero,
it is helpful to add luster dependen e into .
This model an now be used to evaluate how likely it is to a hieve a zero error image
e by randomly sele ting hidden variables onditioned on their parents in the fashion
des ribed above. If the model has the right luster means, right noise levels and
the right variability in the deformation oeÆ ients, then the likelihood p(e = 0jg)
will be high. Thus, this likelihood an be used for lassi ation of images when
the parameters of the models for di erent lasses are known. Also, we an use
the EM algorithm to estimate the parameters of the model that will maximize this
likelihood for all observed images gt in a training data set (see the Appendix).
By onditioning on e = 0, we an transform the network into the generative network
shown in Fig. 2b.1
After ollapsing the deterministi nodes in the network, the joint distribution on-
ditioned on the input g is  
p( ; l; a; b; ejg) =P ;l N ( ba ; 0; ` )N (e;  + D( )a g D(g)b; ) (13)
By integrating out the deformation oeÆ ients we obtain  p( ; `; ejg) =
P ;` N e;  g; [ 1 1
M ;` M0 1 ℄ 1 , where M = D( ) D(g)
and ;` = ( ` 1 + M0 1 M ) 1 . This density fun tion an be normalized over ,
` to obtain P ( ; `je; g). The likelihood an be omputed by summing over the lass
and transformation indi es:
XX
C L
1
p(ejg) = P ;l N e;  g; [ 1 1
M ;` M0 1
℄ : (14)
=1 l=1
By using this likelihood instead of the distan e measure in (8), we are integrating
over all possible deformations instead of nding the optimal deformation (whi h is
given by (19) in the Appendix).

4 Experiments and on lusions


We tested our algorithm on 20x28 greys ale images of people with di erent fa-
ial expressions and 8x8 greys ale images of handwritten digits from the CEDAR
1 To do so in a straightforward fashion, we assume that j T(b)j = 1.
25 25

20 20

15 15

10 10

5 5

(a) 5 10 15 20 5 10 15 20

f Ra ~f g~ Rb g
10

20

30

40

(b)
50

60

Mean  and omponents of D( ) for one luster


10 20 30 40 50 60

Matrix
Figure 3: Estimating the image deformation due to a hange in fa ial expression and a subset
of the learned parameters for the model of handwritten digits

CDROM (Hull, 1994). To ompare our method with other generative models, we
used a training set of 2000 images to learn 10 digit models using the EM algorithm
and tested the algorithms on a test set of 1000 digit images.
Deformable image mat hing. In Fig. 3a we estimate the optimal deformation
elds ne essary to mat h two images of a fa e of the same person but with di erent
fa ial expression. We set the matrix to identity and we set by hand to allow a
ouple of pixels of deformations. See Se tion 2 for nomen lature.
Comparison with the mixture of diagonal Gaussians (MDG). MDG needs
10-20 lasses per digit to a hieve the optimal error rate of only about 8% (Frey and
Joji 1999a) on the handwritten digit re ognition task. Note that our network re-
du es to MDG when ` is set to zero. To demonstrate the e e tiveness of adding a
deformation model to MDG, we trained our model with 15 lasses per digit and only
a single transformation model (L = 1) for all digits, with a total of 64 deformation
oeÆ ients (8 for ea h dimension in the latent and the observed images). In Fig. 3b
we show one of the learned luster means, the omponents in the orresponding
deformation matrix D and the learned ovarian e matrix . shows anti orrela-
tion among the deformation oeÆ ients for the latent and the observed image, as
the network usually applies opposite deformations on these two images to a hieve
the mat h. However, there is also strong orrelation between bx and by and less
orrelation between ax and ay as the network uses mostly a rotational adjustment
on the input image, while the latent image is more freely deformed (Fig. 1e). Our
model a hieved the error rate of 3.6%. Even if we keep only the diagonal elements
in , the model a hieves a 5% error rate.
Comparison with fa tor analysis. In fa tor analysis (FA) or in a mixture of
fa tor analyzers (MFA), the deformation matrix D is alled fa tor loading matrix
and is not tied to the mean  as in our model (Fig. 3b). The fa tor ovarian e
matrix is set to the identity matrix, as the extra freedom in the hoi e of the fa tor
varian es an be aptured in the fa tor loading matrix. So, while FA/MFA try
to apture the variability in the data by learning the omponents in the fa tor
loading matrix and keeping the distribution over the fa tors xed, our model does
the opposite by tying the fa tor loading matrix to the mean image and learning the
distribution over the fa tors (deformation oeÆ ients). By doing this, we we are
able to expand other images using the same deformation model. This allows us to
share the deformation model a ross lusters and also to deform the input images.
The omparable error rate in lassi ation of handwritten digits for FA/MFA (3.3%)
and our model (3.6%) indi ates that most of the variability in images of handwriten
digits an be aptured by modeling smooth, non-uniform deformations without
allowing full FA learning.
Our deformable image mat hing network ould be used for a variety of omputer
vision tasks su h as opti al ow estimation, deformation invariant re ognition and
modeling orrelations in deformations. For example, our learning algorithm ould
learn to jointly deform the mouth and eyes when modeling fa ial expressions.
Referen es
A. P. Dempster, N. M. Laird and D. B. Rubin 1977. Maximum likelihood from in omplete data via the EM
algorithm. Pro eedings of the Royal Statisti al So iety B-39, 1{38.
B. J. Frey and N. Joji 1999a. Estimating mixture models of images and inferring spatial transformations using
the EM algorithm. Pro eedings of the IEEE Conferen e on Computer Vision and Pattern Re ognition, Ft. Collins,
CO. IEEE Computer So iety Press, Los Alamitos, CA.
Z. Ghahramani and G. E. Hinton 1997. The EM algorithm for mixtures of fa tor analyzers. University of Toronto
Te hni al Report CRG-TR-96-1. Available at www.gatsby.u l.a .uk/zoubin.
G. E. Hinton, P. Dayan and M. Revow 1997. Modeling the manifolds of images of handwritten digits. IEEE
Trans. on Neural Networks 8, 65{74.

N. Joji and B. J. Frey 2000. Topographi transformation as a dis rete latent variable. In S.A. Solla, T. K. Leen,
and K.-R. Muller (eds) Advan es in Neural Information Pro essing Systems 12, MIT Press, Cambridge, MA.
P. Y. Simard, Y. Le Cun and J. Denker 1993. EÆ ient pattern re ognition using a new transformation distan e.
In S. J. Hanson, J. D. Cowan and C. L. Giles, Advan es in Neural Information Pro essing Systems 5, Morgan
Kaufmann, San Mateo, CA.
N. Vas on elos and A. Lippman 1998. Multiresolution tangent distan e for aÆne invariant lassi ation. In M. I.
Jordan and M. I. Kearns and S. A. Solla (eds) Advan es in Neural Information Pro essing Systems 10, MIT Press,
Cambridge, MA.
Appendix: EM for deformable image mat hing network
To t the network to a set of training data, we assume that the error images for the training ases are zero and
estimate the maximum likelihood parameters using EM (Dempster et al. 1977). In deriving the M-step, both
forms of the deformation equations (4) and (5) are useful, depending on whi h parameters are being optimized.
Using hi to denote an average over the training set, the update equations are:
P ;` =hP ( ; `jet = 0; gt )i (15)
= 0; gt )E[T(a)0
X
^ =h
 P ( ;` jet
1 T(a) j ; `; et = 0; gt ℄i 1
`

 hX P ( ;` je t = 0; gt )E[T(a)0 1 T(b)g
t j ; `; et = 0; gt ℄i; (16)
`
   
P a  0 0
je = 0; gt )E
^` =
P ( b a b ; `; et = 0; gt
;` t
(17)
h P ( ; `jet = 0; gt )i P
!
^ =diag h ` P ( ; ljet = 0; gPt )E[(f g~t ) Æ (f g~t )j ; l; et = 0; gt ℄i
P ~ ~
(18)
h ` P ( ; ljet = 0; gt )i
The expe tations needed to evaluate the above update equations are given by:
  
;` = ov a
b ; `; et = 0; gt =( `
1 + M0 1M ) 1
  
=E a et = 0; gt = 1M 0 1 ( gt ) (19)
;` b ; l;
;`
 h 
a 0 0 i
0
E b a b ; `; et = 0; gt = ;` + ;` ;` (20)

E[(~f g~ t ) Æ (~f g~ t )j ; l; et = 0; gt ℄ =  gt + M ;`

Æ

 gt + M ;`

+ diag(M ( 0
;` )M )

Expe tations in (16) are omputed using


T(a)0 G0d diag(Rad )
1 T(a) 1 X 1
= + (21)
d 2fx;yg
G0d a0d2 R0 )Gd2
X 1 diag(Ra X 1 diag(Ra
+ )Gd + d1
2fx;yg 1
d
d 2fx;yg d1 ;d2

T(a)0 G0d diag(Rad )


1 T(b)g 1g X 1g
t = t + t (22)
d2fx;yg
G0d b0d2 R0 )Gd2 gt :
X 1 diag(Rb X 1 diag(Ra
+ )Gd gt + d1
2fx;yg 1
d
d 2fx;yg d1 ;d2

Then, the expe tations E[a℄ and E[b℄ are the two halves of the ve tor ;` , while E[ad1 a0d2 ℄ and E[ad1 b0d2 ℄, for
2 fx; yg, are square blo ks of the matrix in (20).
d1 ; d2

You might also like