Dis10 Sol PDF

CS 189 Introduction to Machine Learning
Spring 2018 DIS10
1 One Dimensional Mixture of Two Gaussians

Suppose we have a mixtures of two Gaussians in R that can be described by a pair of random
variables (X, Z) where X takes values in R and Z takes value in the set 1, 2. The joint-distribution
of the pair (X, Z) is given to us as follows:
Z ∼ Bernoulli(β),
(X|Z = k) ∼ N (µk , σk ), k ∈ 1, 2,
We use θ to denote the set of all parameters β, µ1 , σ1 , µ2 , σ2 .
(a) Write down the expression for the joint likelihood pθ (X = xi , Zi = 1) and pθ (X = xi , Zi =
2). What is the marginal likelihood pθ (X = xi )?
Solution:
Joint likelihood:
pθ (X = xi , Zi = 1) = pθ (X = xi |Zi = k)p(Zi = 1)
= βN (xi |µ1 , σ12 )
pθ (X = xi , Zi = 2) = pθ (X = xi |Zi = 2)p(Zi = 2)
= (1 − β)N (xi |µ2 , σ22 )
Marginal likelihood:
X
pθ (X = xi ) = pθ (X = xi , Zi = k)
k
X
= pθ (X = xi |Zi = k)p(Zi = k)
k
= βN (xi |µ1 , σ12 ) + (1 − β)N (xi |µ2 , σ22 )
(b) What is the log-likelihood `θ (x)? Why is this hard to optimize?

Solution:
Log-likelihood:

`θ (x) = log pθ (X = x1 , . . . , X = xn )
DIS10, ©UCB CS 189, Spring 2018. All Rights Reserved. This may not be publicly shared without explicit permission. 1
n
X
= log pθ (X = xi )
i=1
Xn
log βN (xi |µ1 , σ12 ) + (1 − β)N (xi |µ2 , σ22 )

=
i=1
Taking the derivative with respect to µ1 , for example, would give:
n
∂`θ (x) X βN (xi |µ1 , σ12 ) xi − µ 1
= 2 2
( 2
)
∂µ1 i=1
βN (x i |µ 1 , σ1 ) + (1 − β)N (x i |µ 2 , σ2 ) σ 1
This ratio of exponentials and linear terms makes it difficult to solve for a maximum likelihood
expression. Recall that there is no rule for splitting up log(a + b) which prevents us from
applying the log to the exponential.
(c) (Optional) You’d like to optimize the log likelihood: `θ (x). However, we just saw this can be
hard to solve for
an
MLE closed form solution. Show that a lower bound for the log likelihood

is `θ (xi ) ≥ Eq log pqθθ (Z
(X=xi ,Zi =k)
i =k|X=xi )
.
Solution:
 
X
`θ (xi ) = log  pθ (X = xi , Zi = k) Marginalizing over possible Gaussian labels
k
 
X qθ (Zi = k|X = xi )pθ (X = xi , Zi = k)
= log   Introducing arbitrary distribution q
k
q θ (Zi = k|X = xi )
!
pθ (X = xi , Zi = k)
= log Eq Rewriting as expectation
qθ (Zi = k|X = xi )
" #
≥ Eq log Using Jensen’s inequality
where Jensen’s inequality says φ(E[X]) ≤ E[φ(X)] for convex function φ.

(d) (Optional) The EM algorithm first initially starts with two randomly placed Gaussians (µ1 , σ1 )
and (µ2 , σ2 ), which are both particular realizations of θ.
• E-step: qt+1
i,k = pθ (Zi = k|X = xi ). For each data point, determine which Gaussian
generated it, being either (µ1 , σ1 ) or (µ2 , σ2 ).
h i
• M-step: : θ t+1 = argmaxθ ni=1 Eq log pθ (X = xi , Zi = k) . After labeling all data-
P
points in the E-step, adjust (µ1 , σ1 ) and (µ2 , σ2 ).
Why does alternating between the E-step and M-step result in maximizing the lower bound?
Solution: To show the M-step (so-called because we are maximizing with respect to the pa-
rameters) is maximizing the lower bound:
" #
Eq log
h i h i
= Eq log pθ (X = xi , Zi = k) − Eq log qθ (Zi = k|X = xi )
The M-step is maximizing the first term.

To show the E-step is maximizing the bound we can rewrite the lower bound as:
" #
pθ (X = xi )pθ (Zi = k|X = xi )
Eq log
" #
h i qθ (Zi = k|X = xi )
= Eq log pθ (X = xi ) − Eq log
pθ (Zi = k|X = xi )
This expression is minimized if the second term is 0, which occurs when qθ (Zi = k|X =
xi ) = p(Zi = k|X = xi ).
(e) E-step: What are expressions for probabilistically imputing the classes for all the datapoints,
t+1 t+1
i.e. qi,1 and qi,2 ?
Solution:
P (xi |Z = 1; θ t )P (Z = 1)
t+1
qi,1 = P (Z = 1|X = xi ; θ t ) =
P (xi |Z = 1; θ t )P (Z = 1) + P (xi |Z = 2; θ t )P (Z = 2)
P (xi |Z = 2; θ t )P (Z = 2)
t+1
qi,2 = P (Z = 2|X = xi ; θ t ) =
P (xi |Z = 1; θ t )P (Z = 1) + P (xi |Z = 2; θ t )P (Z = 2)
(x −µ )2
where P (xi |Z = 1) = √ 1 exp(− i 21 )
2πσ1 2σ1
To be clear, you would have to compute nC such qi,k values at each time step where C is the
number of classes. Here, C=2.
(f) What is the expression for µt+1
1 for the M-step?
Solution: From Homework 10, we know that
Pn t+1 t+1 t+1 t+1
t+1 i=1 qi,1 xi q1,1 x1 + q2,1 x2 + · · · + qn,1 xn
µ1 = Pn t+1 = t+1 t+1 t+1
i=1 qi,1 q1,1 + q2,1 + · · · + qn,1
Pn t+1 t+1 t+1 t+1
t+1 i=1 qi,2 xi q1,2 x1 + q2,2 x2 + · · · + qn,2 xn
µ2 = Pn t+1 = t+1 t+1 t+1
i=1 qi,2 q1,2 + q2,2 + · · · + qn,2
Pn t+1 t+1 2
qi,1 (xi − µ1 )
(σ12 )(t+1) = i=1 Pn t+1
i=1 qi,1
Figure 1: EM examples in 1D for two clusters (yellow and blue). The shadings of the datapoints (circles)
indicate the respective estimated probabilities of coming from either the yellow or blue cluster.
Pn t+1 t+1 2
i=1 qi,2 (xi − µ2 )
(σ22 )(t+1) = Pn t+1
i=1 qi,2
We show how to obtain µt+1

1 as an example:
n h
X i
Eq log pθ (X = xi , Zi = k)
i=1
n h
X
t+1 t+1
i
log βN (xi |µ1 , σ12 ) + qi,2 log (1 − β)N (xi |µ2 , σ22 )

= qi,1
i=1
 ! !
n 2 2
X
t+1 (xi − µ1 ) t+1 (xi − µ2 )
= qi,1 log(β) − − log(σ1 ) + qi,2 log(1 − β) − − log(σ2 )  + const
i=1
2σ12 2σ22
Taking a derivative with respect to µ1 and setting to 0 to obtain the maximum gives:
n
X
t+1 (xi − µ1 )
qi,1 =0
i=1
σ12
n
X n
X
t+1 t+1
qi,1 xi − qi,1 µ1 = 0
i=1 i=1
Pn t+1
i=1 qi,1 xi
µ1 = Pn t+1
i=1 qi,1
(g) Compare and contrast k-means, soft k-means, and mixture of Gaussians fit with EM.
Solution: For k-means, we implicitly assume clusters are spherical and so this doesn’t work
for complex geometrical shaped data. Additionally, it uses hard assignment, meaning the qi,1
probabilities are 0 or 1. This can be easier to interpret, but doesn’t incorporate information
from all data points to update each centroid. K-means will also usually have trouble with
clusters that have large overlap (see Figure 2)
For soft k-means and EM we have soft assignments. For soft k-means, the weighted mean
amounts to

exp −B||xi − µ1 ||2
ri,1 =
exp −B||xi − µ1 ||2 + exp −B||xi − µ2 ||2

exp −B||xi − µ2 ||2
ri,2 =
exp −B||xi − µ1 ||2 + exp −B||xi − µ2 ||2
Figure 2: K-means for two clusters in 1D. ’x’ points indicate coming from the µ1 while ’o’ indicates points
coming from µ2 . The colors blue and green indicate the predicted clustering. Black dots indicate the true
means, while red indicates the predicted means.
Pn t+1
ri,1 xi
µt+1
1 = Pi=1
n t+1
i=1 ri,1
Pn t+1
ri,2 xi
µt+1
2 = Pi=1
n t+1
i=1 ri,2
where we have a stiffness parameter β, which can be intrepreted as the inverse variance. In
cases where the clusters have different geometry, one might resort to EM. Note that EM is not
unrelated to LDA/QDA. The setup is similar in that we probablistically determine the prob-
abilities of coming from cluster k, but LDA/QDA does hard classification, EM probabilistic
performs soft classification.

Dis10 Sol PDF

Uploaded by

Copyright:

Available Formats

You might also like

Dis10 Sol PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dis10 Sol PDF

Uploaded by

Copyright:

Available Formats

CS 189 Introduction to Machine Learning

Spring 2018 DIS10

1 One Dimensional Mixture of Two Gaussians

We use θ to denote the set of all parameters β, µ1 , σ1 , µ2 , σ2 .

(b) What is the log-likelihood `θ (x)? Why is this hard to optimize?

Taking the derivative with respect to µ1 , for example, would give:

where Jensen’s inequality says φ(E[X]) ≤ E[φ(X)] for convex function φ.

points in the E-step, adjust (µ1 , σ1 ) and (µ2 , σ2 ).

The M-step is maximizing the first term.

We show how to obtain µt+1

You might also like