Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Introduction to Mixture Models

12/04/2023

Prerequisites
This document assumes basic familiarity with probability theory.

Overview
We often make simplifying modeling assumptions when analyzing a data set such as assuming each observation
comes from one specific distribution (say, a Gaussian distribution). Then we proceed to estimate parameters
of this distribution, like the mean and variance, using maximum likelihood estimation.
However, in many cases, assuming each sample comes from the same unimodal distribution is too restrictive
and may not make intuitive sense. Often the data we are trying to model are more complex. For example,
they might be multimodal – containing multiple regions with high probability mass. In this note, we
describe mixture models which provide a principled approach to modeling such complex data.

Example 1
Suppose we are interested in simulating the price of a randomly chosen book. Since paperback books are
typically cheaper than hardbacks, it might make sense to model the price of paperback books separately from
hardback books. In this example, we will model the price of a book as a mixture model. We will have two
mixture components in our model – one for paperback books, and one for hardbacks.
Let’s say that if we choose a book at random, there is a 50% chance of choosing a paperback and 50% of
choosing hardback. These proportions are called mixture proportions. Assume the price of a paperback
book is normally distributed with mean $9 and standard deviation $1 and the price of a hardback is normally
distributed with a mean $20 and a standard deviation of $2. We could simulate book prices Pi in the following
way:
1. Sample Zi ∼ Bernoulli(0.5)

2. If Zi = 0 draw Pi from the paperback distribution N (9, 1). If Zi = 1, draw Pi from the hardback
distribution N (20, 2).
We implement this simulation in the code below:
NUM.SAMPLES <- 5000
prices <- numeric(NUM.SAMPLES)
for(i in seq_len(NUM.SAMPLES)) {
z.i <- rbinom(1,1,0.5)
if(z.i == 0) prices[i] <- rnorm(1, mean = 9, sd = 1)
else prices[i] <- rnorm(1, mean = 20, sd = 2)
}
hist(prices)

1
Histogram of prices
1500
1000
Frequency

500
0

5 10 15 20 25

prices We see
that our histogram is bimodal. Indeed, even though the mixture components are each normal
distributions, the distribution of a randomly chosen book is not. We illustrate the true densities below:
0.4

paperback
hardback
all books
0.3
Density

0.2
0.1
0.0

5 10 15 20 25

Price ($) We see


that the resulting probability density for all books is bimodal, and is therefore not normally distributed. In
this example, we modeled the price of a book as a mixture of two components where each component was
modeled as a Gaussian distribution. This is called a Gaussian mixture model (GMM).

2
Example 2
Now assume our data are the expectancy at birth (years) of Chinese. Assume the expectancy of a randomly
chosen male is normally distributed with a mean equal to 74.7 and a standard deviation of 5 years and the
height of a randomly chosen female is N (80.5, 5). However, instead of 50/50 mixture proportions, assume
that 49% of the population is female, and 51% is male. We simulate expectancy in a similar fashion as above,
with the corresponding changes to the parameters:
NUM.SAMPLES <- 5000
prop <- 0.51
expectancy <- numeric(NUM.SAMPLES)
for(i in seq_len(NUM.SAMPLES)) {
z.i <- rbinom(1, 1, prop)
if(z.i == 0) expectancy[i] <- rnorm(1, mean = 80.5, sd = 5)
else expectancy[i] <- rnorm(1, mean = 74.7, sd = 5)
}
mean(expectancy)

## [1] 77.47113
hist(expectancy, main = "Life expectancy at birth by WHO 2019", breaks = 20)

Life expectancy at birth by WHO 2019


500
Frequency

300
100
0

60 70 80 90

expectancy
Now we see that histogram is unimodal. Are expectancy normally distributed under this model? We plot the
corresponding densities below:

3
0.08
male
female
0.06 population
Density

0.04
0.02
0.00

60 70 80 90 100 110 120

Expectancy (years)
Here we see that the Gaussian mixture model is unimodal because there is so much overlap between the two
densities.
These two illustrative examples above give rise to the general notion of a mixture model which assumes
each observation is generated from one of K mixture components. We formalize this notion in the next
section.
Before moving on, we make one small pedagogical note that sometimes confuses students new to mixture
models. You might recall that if X and Y are independent normal random variables, then Z = X + Y is
also a normally distributed random variable. From this, you might wonder why the mixture models above
aren’t normal. The reason is that X + Y is not a bivariate mixture of normals. It is a linear combination
of normals. A random variable sampled from a simple Gaussian mixture model can be thought of as a two
stage process. First, we randomly sample a component (e.g. male or female), then we sample our observation
from the normal distribution corresponding to that component. This is clearly different than sampling X
and Y from different normal distributions, then adding them together.

Definition
Assume we observe X1 , . . . , Xn and that each Xi is sampled from one of K mixture components. In the
second example above, the mixture components were {male,female}. Associated with each random variable
Xi is a label Zi ∈ {1, . . . , K} which indicates which component Xi came from. In our height example, Zi
would be either 1 or 2 depending on whether Xi was a male or female height. Often times we don’t observe
Zi (e.g. we might just obtain a list of heights with no gender information), so the Zi ’s are sometimes called
latent variables.
From the law of total probability, we know that the marginal probability of Xi is:
K
X K
X
P (Xi = x) = P (Xi = x|Zi = k) P (Zi = k) = P (Xi = x|Zi = k)πk
| {z }
k=1 k=1
πk

Here, the πk are called mixture proportions or mixture weights and they represent the probability
that Xi belongs to the k-th mixture component. The mixture proportions are nonnegative and they sum

4
PK
to one, k=1 πk = 1. We call P (Xi |Zi = k) the mixture component, and it represents the distribution
of Xi assuming it came from component k. The mixture components in our examples above were normal
distributions.
For discrete random variables these mixture components can be any probability mass function p(. | Zk ) and
for continuous random variables they can be any probability density function f (. | Zk ). The corresponding
pmf and pdf for the mixture model is therefore:

K
X
p(x) = πk p(x | Zk = k)
k=1
K
X
fx (x) = πk fx|Zk (x | zk )
k=1

If we observe independent samples X1 , . . . , Xn from this mixture, with mixture proportion vector π =
(π1 , π2 , . . . , πK ), then the likelihood function is:
n
Y n X
Y K
L(π) = P (Xi |π) = P (Xi |Zi = k)πk
i=1 i=1 k=1

Now assume we are in the Gaussian mixture model setting where the k-th component is N (µk , σk ) and the
mixture proportions are πk . A natural next question to ask is how to estimate the parameters {µk , σk , πk }
from our observations X1 , . . . , Xn . We illustrate one approach in the introduction to EM vignette.
Acknowledgement: The “Examples” section above was taken from lecture notes written by Ramesh
Sridharan.

You might also like