Dsci303-19 GM - em

Gaussian mixture models
&
Expectation-Maximization
1
Clustering Methods
• Clustering
• Hard clustering: clusters do not overlap
• Assign each example to one cluster (e.g. K-means)
• Soft clustering: clusters do overlap
• Assign data to cluster with some probability (EM algorithm)
• Strength of association between clusters and instances
• Mixture Models
• Probability based Soft Clustering
• Each cluster -> generative models (Gaussian or multinomial)
• Estimate parameters (mean, covariance are unknown)
Gaussian Mixture Model
• Gaussian Mixture Models (GMMs) assume that there
are a certain number of Gaussian distributions, and
each of these distributions represent a cluster.
• A Gaussian Mixture Model tends to group the data

points belonging to a single distribution together
Introduction to GMM
• Gaussian • Mixture Model

“Gaussian is a “mixture model is a
characteristic symmetric probabilistic model which
"bell curve" shape that assumes the underlying
quickly falls off towards 0 data to belong to a
(practically)” mixture distribution”
2
Normal (Gaussian)
Distribution
• µ is the mean
• σ2 is the variance
Mixture Models
• Formally a Mixture Model is the weighted sum
of a number of pdfs where the weights are
determined by a distribution,
Gaussian Mixture Models
• GMM: the weighted sum of a number of
Gaussians where the weights are determined
by a distribution,
0.5
0.4 Component 1 Component 2

0.3
p(x)
0.2
0.1
0
-5 0 5 10
0.5
0.4
Mixture Model
0.3
p(x)
0.2
0.1
0
-5 0 5 10
x
0.5
0.4 Component 1 Component 2

0.3
p(x)
0.2
0.1
0
-5 0 5 10
0.5
0.4
Mixture Model
0.3
p(x)
0.2
0.1
0
-5 0 5 10
x
2
1.5 Component Models

p(x)
0.5
0
-5 0 5 10
0.5
0.4
Mixture Model
0.3
p(x)
0.2
0.1
0
-5 0 5 10
x
GMM example
Quiz:
The formula for the Gaussian density is:
Which of the following is the formula for the density of this figure?
Respond at www.pollev.com/akanesano925
The formula for the Gaussian density is:
Which of the following is the formula for the density of this figure?
Gaussian Mixture Models
• Rather than identifying clusters by “nearest”
centroids
• Fit a Set of k Gaussians to the data
• Maximum Likelihood over a mixture model
1-D Data:
come from 2 gaussian models
Which sources does each data come from?
X1 X2 X3 X10
EM algorithm
Find the best fit models to maximize the likelihood of the data
Gaussian a? Gaussian b?
X1 X2 X3 X10
EM algorithm
1) Start with two random gaussians: N(mean(a), sigma(a)), N(mean(b), sigma(b))
X1 X2 X3 X10
EM algorithm
2) Compute P(b | xi) : does this look like it comes from b? (E-step)
1 (𝑥𝑥𝑥𝑥 − 𝜇𝜇𝑏𝑏)2
P (xi | b) = exp(− )
2 2𝜎𝜎𝑏𝑏2
2𝜋𝜋𝜎𝜎𝑏𝑏
X1 X2 X3 X10
EM algorithm
P (xi | b) = exp(− )
3) Adjust N(mean(a), sigma(a)), N(mean(b), sigma(b)) to fit points

(M-step) (find mean, sigma and pi parameters to maximize the likelihood=>
derivative of the likelihood = 0 in respect to mean, sigma, pi)
X1 X2 X3 X10
EM algorithm
P (xi | b) = exp(− )
3) Adjust N(mean(a), sigma(a)), N(mean(b), sigma(b)) to fit points

(M-step) (find mean, sigma and pi parameters to maximize the likelihood=>
derivative of the likelihood = 0 in respect to mean, sigma, pi)
4) iterate until it converges (the likelihood change is small)
X1 X2 X3 X10
EM: 1-d example
want to discover two gaussian
𝑎𝑎 𝑏𝑏
EM: 1-d example
Step1 : Define random gaussian
1 (𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑏𝑏 )2
𝑝𝑝 𝑥𝑥𝑖𝑖 𝑏𝑏) = exp −
EM: 1-d example
𝑃𝑃 𝑥𝑥𝑖𝑖 | 𝑏𝑏 𝑃𝑃(𝑏𝑏)
𝑏𝑏𝑖𝑖 = P(𝑏𝑏 | 𝑥𝑥𝑖𝑖 ) =
𝑃𝑃 𝑥𝑥𝑖𝑖 | 𝑏𝑏 𝑃𝑃 𝑏𝑏 +𝑃𝑃 𝑥𝑥𝑖𝑖 | 𝑎𝑎 𝑃𝑃(𝑎𝑎
𝑎𝑎𝑖𝑖 = P 𝑎𝑎 | 𝑥𝑥𝑖𝑖 = 1 − 𝑏𝑏𝑖𝑖

EM: 1-d example
𝑏𝑏1 𝑥𝑥1 + 𝑏𝑏2 𝑥𝑥2 + … + 𝑏𝑏𝑛𝑛 𝑥𝑥𝑛𝑛

𝜇𝜇𝑏𝑏 =
𝑏𝑏1 + 𝑏𝑏2 + ⋯ + 𝑏𝑏𝑛𝑛
EM: 1-d example

𝜇𝜇𝑏𝑏 =
𝑎𝑎1 𝑥𝑥1 + 𝑎𝑎2 𝑥𝑥2 + … + 𝑎𝑎𝑛𝑛 𝑥𝑥𝑛𝑛

𝜇𝜇𝑎𝑎 =
𝑎𝑎1 + 𝑎𝑎2 + ⋯ + 𝑎𝑎𝑛𝑛
EM: 1-d example

Step2 : Update parameters
𝜇𝜇𝑏𝑏 =

𝜇𝜇𝑎𝑎 =
EM: 1-d example
Step1 : Initialize gaussian

𝜇𝜇𝑏𝑏 =

𝜇𝜇𝑎𝑎 =
Step2 : Update parameters

Step3 : Iterate the process
Normal (Gaussian)
Distribution
• Multivariate Gaussian
Normal (Gaussian)
Distribution (d-dim)
• Multivariate Gaussian
• x is now a vector
• µ is the mean vector
• Σ is the covariance matrix (d x d)
Quiz: Consider the following multivariate
Gaussian:
Quiz: Consider the following multivariate
Gaussian.
Gaussian mixture models: d>1
• Data with d attributes from k sources
• Each source c is gaussian
• Iteratively estimate parameters to maximize likelihood
• Prior: what % of instances come from source c?
• Mean: expected value of attribute j from c

• Covariance: how correlated are attributes
j and k in source c?
Expectation Maximization
• The training of GMMs can be accomplished
using Expectation Maximization
– Step 1: Expectation (E-step)
• Evaluate the “responsibilities” of each cluster with the
current parameters => compute likelihood
– Step 2: Maximization (M-step)
• Re-estimate parameters using the existing
“responsibilities” => maximize likelihood
by optimizing mean, variance and weights
• Iterate step 1 & 2 until likelihood converges
Quiz: Choose reasonable criterion for stopping EM
algorithms
A) Parameters stabilized
B) Log-likelihood reached the predefined constant value
C) The prior probability weights in GMM should be non negative and sum up to one
D) Log-likelihood stabilized
Quiz: Choose reasonable criterion for stopping EM
algorithms
A) Parameters stabilized
B) Log-likelihood reached the predefined constant value
C) The prior probability weights in GMM should be non negative and sum up to one
D) Log-likelihood stabilized
Quiz: EM algorithm
Which of the following is(are) TRUE for EM algorithm?
1) EM is prune to converging to a local optimum so running EM algorithm multiple

times with different random initialization is good
2) In contract to k-means clustering, EM algorithm always converges to a global

optimum so we don’t need to run the algorithm multiple times
3) Log-likelihood is monotonically increasing with # of iterations
4) Log-likelihood is monotonically decreasing with # of iterations
Quiz: EM algorithm
Which of the following is(are) TRUE for EM algorithm?
1) EM is prune to converging to a local optimum so running EM algorithm multiple

times with different random initialization is good
2) In contract to k-means clustering, EM algorithm always converges to a global

optimum so we don’t need to run the algorithm multiple times
3) Log-likelihood is monotonically increasing with # of iterations
4) Log-likelihood is monotonically decreasing with # of iterations

How to pick K?
• Probabilistic Model
• Try to fit the model (maximize the likelihood)
• Select best K that maximize likelihood?

• when K=n (each point as own model), L is max)
• Split the data into training and validation sets ( T and V)

• For each K, fit parameters of T and measure likelihood of V
• Pick simplest models that fit “Occam’s razor”

• BIC (Bayesian Information Criteria)
• AIC (Akaike Information Criteria)
• L: Likelihood (how well our model fits the data)
• p: Number of parameters (how simple the model is)
Quiz: What happens if we use too
few k or too many k?
A) Too small k: underfiting, too large k: overfitting
B) Too small k: overfitting, too large k: underfiting
C) K does not affects underfiting/overfitting but it does affect computational cost
Quiz: What happens if we use too
few k or too many k?
A) Too small k: underfiting, too large k: overfitting
B) Too small k: overfitting, too large k: underfiting
C) K does not affects underfiting/overfitting but it does affect computational cost

Maximum Likelihood over a GMM
• As usual: Identify a likelihood function
• And set partial derivative to zero

to optimize three parameters
MLE of a GMM
How can we compute these?
(optional)
To find optimal µ,
Preparation
To find optimal µ,
Preparation
•To get optimal sigma,
Preparation
To optimize π,
We have constraint about
Therefore using (*4)

To optimize π,
We have constraint about
Therefore using (*4)

MLE of a GMM
Visual example of EM
Potential Problems
• Incorrect number of Mixture Components
• Singularities
Incorrect Number of Gaussians
Incorrect Number of Gaussians
Singularities
• A minority of the data can have a
disproportionate effect on the model
likelihood.
• For example…
GMM example
Singularities
• When a mixture component collapses on a
given point, the mean becomes the point, and
the variance goes to zero.
• Consider the likelihood function as the
covariance goes to zero.
• The likelihood approaches infinity.

Relationship to K-means
• K-means makes hard decisions.
– Each data point gets assigned to a single cluster.
• GMM/EM makes soft decisions.
– Each data point can yield a posterior p(z|x)
• Soft K-means is a special case of EM.
ANEMIA PATIENTS AND CONTROLS
4.4
4.3
Red Blood Cell Hemoglobin Concentration
4.2
4.1
3.9
3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Red Blood Cell Volume
EM ITERATION 1
4.4
4.3
4.2
4.1
3.9
3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
EM ITERATION 3
4.4
4.3
4.2
4.1
3.9
3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
EM ITERATION 5
4.4
4.3
4.2
4.1
3.9
3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
EM ITERATION 10
4.4
4.3
4.2
4.1
3.9
3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
EM ITERATION 15
4.4
4.3
4.2
4.1
3.9
3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
EM ITERATION 25
4.4
4.3
4.2
4.1
3.9
3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS
490
480
470
460
Log-Likelihood
450
440
430
420
410
400
0 5 10 15 20 25
EM Iteration
ANEMIA DATA WITH LABELS
4.4
4.3
4.2
Control Group
4.1
3.9 Anemia Group
3.8
3.7
3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
Real-world stress, mood and health prediction
Tomorrow
11%
92%
27%
History
Wellbeing/psychiatric condition prediction using multimodal data
Location
Physiology
Behavioral surveys Smartphone Logs
Weather
SNAPSHOT:
~200 people, 30 days each
Mobility
• Phone also logs GPS coordinates

• Downsample to 1 location per 5 minutes
• Interpolate up to 3 missing segments
• Features:
• Distance traveled
• Radius of minimum circle enclosing all locations in a day [3]
• Time spent indoors vs. outdoors (based on Wifi or Cellular)
• Time spent on campus (using coordinates of campus)
• Regularity Index
Mobility
• Used Gaussian Mixture Models (GMMs) to learn a probability distribution

over each participant’s typical locations, using up to K Gaussians
• Used Bayesian Information Criterion (BIC) to select the best model
The GMM for one participant

o Points are locations
o Contours are probability
Mobility
• GMMs were used to compute

o Log likelihood of each day - how typical is this day?
o Akaike Information Criterion (AIC) of a day
Quiz: Of the clustering algorithms covered in class, Gaussian
Mixture Models used for clustering always outperforms k-
means clustering
A. True
B. False
Quiz: Of the clustering algorithms covered in class, Gaussian
Mixture Models used for clustering always outperforms k-
means clustering
A. True
B. False
GM/EM: Summary
• Maximize likelihood of the data using EM algorithm
• Similar to k means
• Sensitive to starting points, converges to local minimum
• Converge: when change in P(x1, x2, …) is sufficiently small
• Cannot discover K (likelihood increases as K increases)
• Different from k means

• Soft-clustering: instance can come from multiple clusters
Announcements
1) Final Project Progress meeting
• Sign up on https://dsci303-finalproject-update.youcanbook.me/
• Sign up will be closed on Nov 5 this Thursday at noon.
• All members in the team must attend the meeting.
2) Homework 4
• individual assignment
• not allowed to work with anyone else
• Allowed to ask questions on piazza
• HW4_5: oral problem

• Nov 20, Dec 3 or 4
• ~ 5 mins
• Sign up at https://akanesano.youcanbook.me/
• You see Nov 13 but this is not the option (have some issues with
youcanbook me)
• Sign up will be closed on Nov 13.

Dsci303-19 GM - em

Uploaded by

Copyright:

Available Formats

You might also like

Dsci303-19 GM - em

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dsci303-19 GM - em

Uploaded by

Copyright:

Available Formats

Gaussian mixture models

• A Gaussian Mixture Model tends to group the data

• Gaussian • Mixture Model

0.4 Component 1 Component 2

0.4 Component 1 Component 2

1.5 Component Models

3) Adjust N(mean(a), sigma(a)), N(mean(b), sigma(b)) to fit points

3) Adjust N(mean(a), sigma(a)), N(mean(b), sigma(b)) to fit points

𝑎𝑎𝑖𝑖 = P 𝑎𝑎 | 𝑥𝑥𝑖𝑖 = 1 − 𝑏𝑏𝑖𝑖

𝑎𝑎𝑖𝑖 = P 𝑎𝑎 | 𝑥𝑥𝑖𝑖 = 1 − 𝑏𝑏𝑖𝑖

𝑏𝑏1 𝑥𝑥1 + 𝑏𝑏2 𝑥𝑥2 + … + 𝑏𝑏𝑛𝑛 𝑥𝑥𝑛𝑛

𝑎𝑎𝑖𝑖 = P 𝑎𝑎 | 𝑥𝑥𝑖𝑖 = 1 − 𝑏𝑏𝑖𝑖

𝑏𝑏1 𝑥𝑥1 + 𝑏𝑏2 𝑥𝑥2 + … + 𝑏𝑏𝑛𝑛 𝑥𝑥𝑛𝑛

𝑎𝑎1 𝑥𝑥1 + 𝑎𝑎2 𝑥𝑥2 + … + 𝑎𝑎𝑛𝑛 𝑥𝑥𝑛𝑛

𝑎𝑎𝑖𝑖 = P 𝑎𝑎 | 𝑥𝑥𝑖𝑖 = 1 − 𝑏𝑏𝑖𝑖

𝑎𝑎1 𝑥𝑥1 + 𝑎𝑎2 𝑥𝑥2 + … + 𝑎𝑎𝑛𝑛 𝑥𝑥𝑛𝑛

𝑎𝑎𝑖𝑖 = P 𝑎𝑎 | 𝑥𝑥𝑖𝑖 = 1 − 𝑏𝑏𝑖𝑖

𝑎𝑎1 𝑥𝑥1 + 𝑎𝑎2 𝑥𝑥2 + … + 𝑎𝑎𝑛𝑛 𝑥𝑥𝑛𝑛

Step2 : Update parameters

• Mean: expected value of attribute j from c

1) EM is prune to converging to a local optimum so running EM algorithm multiple

2) In contract to k-means clustering, EM algorithm always converges to a global

3) Log-likelihood is monotonically increasing with # of iterations

4) Log-likelihood is monotonically decreasing with # of iterations

1) EM is prune to converging to a local optimum so running EM algorithm multiple

2) In contract to k-means clustering, EM algorithm always converges to a global

3) Log-likelihood is monotonically increasing with # of iterations

4) Log-likelihood is monotonically decreasing with # of iterations

• Select best K that maximize likelihood?

• Split the data into training and validation sets ( T and V)

• Pick simplest models that fit “Occam’s razor”

A) Too small k: underfiting, too large k: overfitting

B) Too small k: overfitting, too large k: underfiting

C) K does not affects underfiting/overfitting but it does affect computational cost

A) Too small k: underfiting, too large k: overfitting

B) Too small k: overfitting, too large k: underfiting

C) K does not affects underfiting/overfitting but it does affect computational cost

• And set partial derivative to zero

Therefore using (*4)

Therefore using (*4)

• The likelihood approaches infinity.

3.9 Anemia Group

Behavioral surveys Smartphone Logs

• Phone also logs GPS coordinates

• Used Gaussian Mixture Models (GMMs) to learn a probability distribution

• Used Bayesian Information Criterion (BIC) to select the best model

The GMM for one participant

• GMMs were used to compute

• Different from k means

• HW4_5: oral problem

You might also like