Professional Documents
Culture Documents
Clustering With Gaussian Mixtures: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Clustering With Gaussian Mixtures: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Clustering With Gaussian Mixtures: Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Clustering with
Gaussian Mixtures
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 2
Gaussian Bayes Classifier
Reminder
p(x | y i) P( y i )
P ( y i | x)
p ( x)
1 1
m/2 1/ 2
exp x k μ i T
Σ i x k μ
i pi
(2 ) || Σ i || 2
P ( y i | x)
p ( x)
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 3
Predicting wealth from age
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 4
Predicting wealth from age
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 5
Learning modelyear , 21 12
1m
12 2 2 2m
mpg ---> maker Σ
2 m
1m 2 m
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 6
General: O(m ) 2 21 12
1m
12 2 2 2m
parameters Σ
2 m
1m 2 m
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 7
21 0 0 0 0
Aligned: O(m)
0 22 0 0 0
0 0 23 0 0
parameters Σ
0 0 0 2 m 1 0
0 0 0 0 2 m
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 8
21 0 0 0 0
Aligned: O(m)
0 22 0 0 0
0 0 23 0 0
parameters Σ
0 0 0 2 m 1 0
0 0 0 0 2 m
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 9
2 0 0 0 0
Spherical: O(1)
0 2 0 0 0
0 0 2 0 0
cov parameters Σ
0 0 0 2 0
0 0 0 0 2
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 10
2 0 0 0 0
Spherical: O(1)
0 2 0 0 0
0 0 2 0 0
cov parameters Σ
0 0 0 2 0
0 0 0 0 2
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 11
Making a Classifier from a
Density Estimator
Dec Tree
Classifier category Naïve BC
Joint DE Gauss DE
Inputs Inputs
Density Prob-
Estimator ability Naïve DE
Predict
Regressor real no.
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 12
Next… back to Density Estimation
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 13
The GMM assumption
• There are k components. The
i’th component is called i
• Component i has an
associated mean vector i 2
1
3
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 14
The GMM assumption
• There are k components. The
i’th component is called i
• Component i has an
associated mean vector i 2
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 15
The GMM assumption
• There are k components. The
i’th component is called i
• Component i has an
associated mean vector i 2
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 16
The GMM assumption
• There are k components. The
i’th component is called i
• Component i has an
associated mean vector i 2
• Component i has an
associated mean vector i 2
Sometimes easy
IN CASE YOU’RE
WONDERING WHAT
THESE DIAGRAMS ARE,
THEY SHOW 2-d
UNLABELED DATA (X
VECTORS)
Sometimes impossible DISTRIBUTED IN 2-d
SPACE. THE TOP ONE
HAS THREE VERY
CLEAR GAUSSIAN
CENTERS
and sometimes
in between
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 19
Computing likelihoods in
unsupervised case
We have x1 , x2 , … xN
We know P(w1) P(w2) .. P(wk)
We know σ
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 20
likelihoods in unsupervised case
We have x1 x2 … xn
We have P(w1) .. P(wk). We have σ.
We can define, for any x , P(x|wi , μ1, μ2 .. μk)
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 21
Unsupervised Learning:
Mediumly Good News
We now have a procedure s.t. if you give me a guess at μ1, μ2 .. μk,
I can tell you the prob of the unlabeled data given those μ‘s.
Graph of
log P(x1, x2 .. x25 | μ1, μ2 )
against μ1 () and μ2 ()
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 23
Duda & Hart’s Example
We can graph the
prob. dist. function
of data given our
μ1 and μ2
estimates.
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 25
Expectation
Maximalization
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 26
UR
The E.M. Algorithm
T O
DE
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 27
Silly Example
Let events be “grades in a class”
w1 = Gets an A P(A) = ½
w2 = Gets a B P(B) = μ
w3 = Gets a C P(C) = 2μ
w4 = Gets a D P(D) = ½-3μ
(Note 0 ≤ μ ≤1/6)
Assume we want to estimate μ from data. In a given class
there were
a A’s
b B’s
c C’s
d D’s
What’s the maximum likelihood estimate of μ given a,b,c,d ?
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 28
Silly Example
Let events be “grades in a class”
w1 = Gets an A P(A) = ½
w2 = Gets a B P(B) = μ
w3 = Gets a C P(C) = 2μ
w4 = Gets a D P(D) = ½-3μ
(Note 0 ≤ μ ≤1/6)
Assume we want to estimate μ from data. In a given class there were
a A’s
b B’s
c C’s
d D’s
What’s the maximum likelihood estimate of μ given a,b,c,d ?
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 29
Trivial Statistics
P(A) = ½ P(B) = μ P(C) = 2μ P(D) = ½-3μ
P( a,b,c,d | μ) = K(½)a(μ)b(2μ)c(½-3μ)d
log P( a,b,c,d | μ) = log K + alog ½ + blog μ + clog 2μ + dlog (½-3μ)
LogP
FOR MAX LIKE μ, SET 0
μ
LogP b 2c 3d
0
μ μ 2μ 1 / 2 3μ
bc
Gives max like μ
6b c d
So if class got A B C D
14 6 9 10
1
tru e!
Max like μ ut
10 in g,b
Copyright © 2001, 2004, Andrew W. Moore
Bo r Clustering with Gaussian Mixtures: Slide 30
Same Problem with Hidden Information
REMEMBER
Someone tells us that P(A) = ½
Number of High grades (A’s + B’s) = h P(B) = μ
Number of C’s =c P(C) = 2μ
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 31
Same Problem with Hidden Information
REMEMBER
Someone tells us that P(A) = ½
Number of High grades (A’s + B’s) = h P(B) = μ
Number of C’s =c P(C) = 2μ
R k
p xi w j , μ1...μ k Pw j
i 1 j 1
R k
1 2
K exp 2 xi μ j Pw j
i 1 j 1 2σ
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 35
E.M. for GMMs
For Max likelihood we know log Pr obdata μ1...μ k 0
μ i
Some wild' n' crazy algebra turns this into : " For Max likelihood, for each j,
R
Pw j xi , μ1...μ k xi
μj i 1
R See
Pw
i 1
j xi , μ1...μ k http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf
Pwi xk , t
pxk wi , t Pwi t
p xk wi , i (t ), 2 I pi (t )
pxk t
c
M-step.
k j j
p x
j 1
w , (t ), 2
I p j (t )
Pw x , x
i k t k
μ t 1 k
Pw x ,
i
i k t
k
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 37
E.M.
Convergence
• Your lecturer will
(unless out of
time) give you a
nice intuitive
explanation of
why this rule
works.
• As with all EM
procedures, • This algorithm is REALLY USED. And
convergence to a in high dimensional state spaces, too.
local optimum E.G. Vector Quantization for Speech
guaranteed. Data
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 38
E.M. for General GMMs
pi(t) is shorthand
for estimate of
P(i) on t’th
Iterate. On the t’th iteration let our estimates be
iteration
t = { μ1(t), μ2(t) … μc(t), 1(t), 2(t) … c(t), p1(t), p2(t) … pc(t) }
k w j , j (t ), j (t ) p j (t )
M-step. j 1
Pw x , x
i k t k Pwi xk , t xk i t 1xk i t 1 T
μ t 1 k i t 1 k
i
Pw x , i k t Pw x ,
k
i k t
k
Pw i xk , t
pi t 1 k
R = #records
R
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 39
Gaussian
Mixture
Example:
Start
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 41
After 2nd
iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 42
After 3rd
iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 43
After 4th
iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 44
After 5th
iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 45
After 6th
iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 46
After 20th
iteration
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 47
Some Bio
Assay
data
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 48
GMM
clustering
of the
assay data
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 49
Resulting
Density
Estimator
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 50
Where are we now?
Inputs
Inference
P(E1|E2) Joint DE, Bayes Net Structure Learning
Engine Learn
Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,
Inputs
Density Prob-
ability DE, Bayes Net Structure Learning, GMMs
Estimator
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 51
The old trick…
Inputs
Inference
P(E1|E2) Joint DE, Bayes Net Structure Learning
Engine Learn
Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,
Inputs
Density Prob-
ability DE, Bayes Net Structure Learning, GMMs
Estimator
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 52
Three
classes of
assay
(each learned with
it’s own mixture
model)
(Sorry, this will again be
semi-useless in black and
white)
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 53
Resulting
Bayes
Classifier
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 54
Resulting Bayes
Classifier, using
posterior
probabilities to
alert about
ambiguity and
anomalousness
Yellow means
anomalous
Cyan means
ambiguous
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 55
Unsupervised learning with symbolic
attributes missing
# KIDS
NATION
MARRIED
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 56
Final Comments
• Remember, E.M. can get stuck in local minima, and
empirically it DOES.
• Our unsupervised learning example assumed P(wi)’s
known, and variances fixed and known. Easy to
relax this.
• It’s possible to do Bayesian unsupervised learning
instead of max. likelihood.
• There are other algorithms for unsupervised
learning. We’ll visit K-means soon. Hierarchical
clustering is also interesting.
• Neural-net algorithms called “competitive learning”
turn out to have interesting parallels with the EM
method we saw.
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 57
What you should know
• How to “learn” maximum likelihood parameters
(locally max. like.) in the case of unlabeled data.
• Be happy with this kind of probabilistic analysis.
• Understand the two examples of E.M. given in these
notes.
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 58
Other unsupervised learning
methods
• K-means (see next lecture)
• Hierarchical clustering (e.g. Minimum spanning
trees) (see next lecture)
• Principal Component Analysis
simple, useful tool
• Non-linear PCA
Neural Auto-Associators
Locally weighted PCA
Others…
Copyright © 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 59