Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Computer Vision Group

Prof. Daniel Cremers

6. Clustering
Motivation
• Supervised learning is good for interaction with
humans, but labels from a supervisor are
sometimes hard to obtain
• Clustering is unsupervised learning, i.e. it tries to
learn only from the data
• Main idea: find a similarity measure and group
similar data objects together
• Clustering is a very old research field, many
approaches have been suggested
• Main problem in most methods: how to find a
good number of clusters
Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 2 Computer Vision Group
Categories of Learning
Learning

Unsupervised Supervised Reinforcement


Learning Learning Learning
clustering, density learning from a training no supervision, but
estimation data set, inference on a reward function
the test data

In unsupervised learning, there is no ground truth


information given.
Most Unsupervised Learning methods are based on
Clustering.

Machine Learning for Computer PD Dr. Rudolph Triebel



3 Computer Vision Group
Vision
K-means Clustering
• Given: data set {x1 , . . . , xN }, number of clusters K
• Goal: find cluster centers {µ1 , . . . , µK } so that


 XN X K

 J= rnk kxn µk k
2


 n=1 k=1

is minimal, where rnk = 1 if xn is assigned to µk


• Idea: compute rnk and µk iteratively
• Start with some values for the cluster centers
• Find optimal assignments rnk
• Update cluster centers using these assignments
• Repeat until assignments or centers don’t change
Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 4 Computer Vision Group
K-means Clustering
Initialize cluster means: {µ1 , . . . , µK }

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 5 Computer Vision Group
K-means Clustering
Find optimal assignments:
(
1 if k = arg minj kxn µj k
rnk =
0 otherwise

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 6 Computer Vision Group
K-means Clustering
N
X
Find new optimal means: @J !
=2 rnk (xn µk ) = 0
@µk n=1
PN
n=1 rnk xn
) µk = PN
n=1 rnk

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 7 Computer Vision Group
K-means Clustering
Find new optimal assignments:
(
1 if k = arg minj kxn µj k
rnk =
0 otherwise

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 8 Computer Vision Group
K-means Clustering
Iterate these steps until means and
assignments do not change any more

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 9 Computer Vision Group
2D Example

• Real data set • Magenta line is “decision


• Random initialization boundary”

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 10 Computer Vision Group
The Cost Function

• After every step the cost function J is minimized


• Blue steps: update assignments
• Red steps: update means
• Convergence after 4 rounds
Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 11 Computer Vision Group
K-means for Segmentation

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 12 Computer Vision Group
K-Means: Additional Remarks
• K-means converges always, but the minimum is
not guaranteed to be a global one
• There is an online version of K-means
• After each addition of xn, the nearest center μk is
updated:
µnew
k = µ old
k + ⌘ (x
n n µ old
k )

• The K-medoid variant:


• Replace the Euclidean distance by a general measure
V. N X
X K
J˜ = rnk V(xn , µk )
n=1 k=1

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 13 Computer Vision Group
Mixtures of Gaussians
• Assume that the data consists of K clusters
• The data within each cluster is Gaussian
• For any data point x we introduce a K-dimensional
binary random variable z so that: 


 K
X

 p(x) = p(zk = 1) N (x | µk , ⌃k )
| {z }

 k=1
=:⇡k

where 
 K
X
zk 2 {0, 1}, zk = 1
k=1

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 14 Computer Vision Group
A Simple Example

• Mixture of three Gaussians with mixing coefficients


• Left: all three Gaussians as contour plot
• Right: samples from the mixture model, the red
component has the most samples
Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 15 Computer Vision Group
Parameter Estimation
• From a given set of training data {x1 , . . . , xN } we
want to find parameters
 (⇡1,...,K , µ1,...,K , ⌃1,...,K )
so that the likelihood is maximized (MLE): 



 YN X K
p(x1 , . . . , xN | ⇡1,...,K , µ1,...,K , ⌃1,...,K ) = ⇡k N (xn | µk , ⌃k )

 n=1 k=1

or, applying the logarithm:


N
X K
X
log p(X | ⇡, µ, ⌃) = log ⇡k N (xn | µk , ⌃k )
n=1 k=1
• However: this is not as easy as maximum-
likelihood for single Gaussians!

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 16 Computer Vision Group
Problems with MLE for Gaussian Mixtures
• Assume that for one k the mean µk is exactly at a
data point xn
• For simplicity: assume that ⌃k = 2
kI
• Then: 2 1
N (xn | xn , k I) = p
2⇡ kD
• This means that the overall log-likelihood can be
maximized arbitrarily by letting k ! 0 (overfitting)
• Another problem is the identifiability:
• The order of the Gaussians is not fixed, therefore:
• There are K! equivalent solutions to the MLE problem

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 17 Computer Vision Group
Overfitting with MLE for Gaussian Mixtures

• One Gaussian fits exactly to one data point


• It has a very small variance, i.e. contributes
strongly to the overall likelihood
• In standard MLE, there is no way to avoid this!

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 18 Computer Vision Group
Expectation-Maximization
• EM is an elegant and powerful method for MLE
problems with latent variables
• Main idea: model parameters and latent variables
are estimated iteratively, where average over the
latent variables (expectation)
• A typical example application of EM is the
Gaussian Mixture model (GMM)
• However, EM has many other applications
• First, we consider EM for GMMs

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 19 Computer Vision Group
Expectation-Maximization for GMM
• First, we define the responsibilities:
(znk ) = p(znk = 1 | xn ) znk 2 {0, 1}
X
znk = 1
k

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 20 Computer Vision Group
Expectation-Maximization for GMM
• First, we define the responsibilities:
(znk ) = p(znk = 1 | xn )

⇡k N (xn | µk , ⌃k )
= PK
j=1 ⇡j N (xn | µj , ⌃j )

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 21 Computer Vision Group
Expectation-Maximization for GMM
• First, we define the responsibilities:
(znk ) = p(znk = 1 | xn )

⇡k N (xn | µk , ⌃k )
= PK
j=1 ⇡j N (xn | µj , ⌃j )

• Next, we derive the log-likelihood wrt. to µk :


@log p(X | ⇡, µ, ⌃) !
=0
@µk

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 22 Computer Vision Group
Expectation-Maximization for GMM
• First, we define the responsibilities:
(znk ) = p(znk = 1 | xn )

⇡k N (xn | µk , ⌃k )
= PK
j=1 ⇡j N (xn | µj , ⌃j )

• Next, we derive the log-likelihood wrt. to µk : 




@log p(X | ⇡, µ, ⌃) !

 =0
@µk

 PN
and we obtain: µ = n=1 (znk )xn
k PN
n=1 (znk )
Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 23 Computer Vision Group
Expectation-Maximization for GMM
• We can do the same for the covariances:


@log p(X | ⇡, µ, ⌃) !

 =0
@⌃k

and we obtain:
PN
n=1 (znk )(xn µk )(xn µk )T
⌃k = PN
n=1 (znk )
• Finally, we derive wrt. the mixing coefficients ⇡k :


 @log p(X | ⇡, µ, ⌃) ! K
X
= 0 where: ⇡k = 1
@⇡k k=1

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 24 Computer Vision Group
Expectation-Maximization for GMM
• We can do the same for the covariances:


@log p(X | ⇡, µ, ⌃) !

 =0
@⌃k

and we obtain:
PN
n=1 (znk )(xn µk )(xn µk )T
⌃k = PN
n=1 (znk )
• Finally, we derive wrt. the mixing coefficients ⇡k :


 @log p(X | ⇡, µ, ⌃) ! XK
= 0 where:
 ⇡k = 1

 @⇡ k k=1
N
1 X
and the result is: ⇡k = (znk )
N n=1
Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 25 Computer Vision Group
Algorithm Summary
1.Initialize means µk covariance matrices ⌃k and
mixing coefficients ⇡ k
2.Compute the initial log-likelihood log p(X | ⇡, µ, ⌃)
3. E-Step. Compute the responsibilities:


 ⇡k N (xn | µk , ⌃k )

 (z nk ) = PK
⇡ N (x | µ , ⌃ )
j=1 j n j j

4. M-Step. Update the parameters:



µ new 
 PN
n=1
= P
(z nk )x n
⌃ new
=
PN
n=1 (z nk )(x
P
n µ new
k )(xn µnew
k ) T
⇡knew =
1 XN
(znk )
k N k N
n=1 (znk ) n=1 (znk ) N n=1

5.Compute log-likelihood; if not converged go to 3.


Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 26 Computer Vision Group
The Same Example Again

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 27 Computer Vision Group
Observations
• Compared to K-means, points can now belong to
both clusters (soft assignment)
• In addition to the cluster center, a covariance is
estimated by EM
• Initialization is the same as used for K-means
• Number of iterations needed for EM is much higher
• Also: each cycle requires much more computation
• Therefore: start with K-means and run EM on the
result of K-means (covariances can be initialized to
the sample covariances of K-means)
• EM only finds a local maximum of the likelihood!
Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 28 Computer Vision Group
Spectral Clustering
• Consider an undirected graph that connects all
data points
• The edge weights are the similarities (“closeness”)
• We define the weighted degree di of a node as the
sum of all outgoing edges
N
X
W = di = wij
j=1

d1
D= d2
d3
d4
Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 29 Computer Vision Group
Spectral Clustering
• The Graph Laplacian is defined as:
L=D W
• This matrix has the following properties:
• the 1 vector is eigenvector with eigenvalue 0

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 30 Computer Vision Group
Spectral Clustering
• The Graph Laplacian is defined as:
L=D W
• This matrix has the following properties:
• the 1 vector is eigenvector with eigenvector 0
• the matrix is symmetric and positive semi-definite

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 31 Computer Vision Group
Spectral Clustering
• The Graph Laplacian is defined as:
L=D W
• This matrix has the following properties:
• the 1 vector is eigenvector with eigenvector 0
• the matrix is symmetric and positive semi-definite
• With these properties we can show:
Theorem: The set of eigenvectors of L with
eigenvalue 0 is spanned by the indicator vectors 

1A1 , . . . , 1AK, where Ak are the K connected
components of the graph.

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 32 Computer Vision Group
The Algorithm
• Input: Similarity matrix W
• Compute L = D - W
• Compute the eigenvectors that correspond to the
K smallest eigenvalues
• Stack these vectors as columns in a matrix U
• Treat each row of U as a K-dim data point
• Cluster the N rows with K-means clustering
• The indices of the rows that correspond to the
resulting clusters are those of the original data
points.

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 33 Computer Vision Group
An Example
k−means clustering spectral clustering
5 5

4 4

3 3

2 2

1 1

0 0
y

y
−1 −1

−2 −2

−3 −3

−4 −4

−5 −5
−6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6
x x

• Spectral clustering can handle complex problems


such as this one
3
• The complexity of the algorithm is O(N ), because
it has to solve an eigenvector problem
• But there are efficient variants of the algorithm
Machine Learning for PD Dr. Rudolph Triebel

Computer Vision 34 Computer Vision Group
Further Remarks
• To account for nodes that are highly connected,
we can use a normalized version of the graph
Laplacian
• Two different methods exist:
1 1
• Lrw = D L = I D W
1 1 1 1
• Lsym = D 2 LD 2 =I D 2 WD 2

• These have similar eigenspaces than the original


Laplacian L
• Clustering results tend to be better than with the
unnormalized Laplacian

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 35 Computer Vision Group
Summary
• Several Clustering methods exist:
• K-means clustering and Expectation-Maximization,
both based on Gaussian Mixture Models
• K-means uses hard assignments, whereas EM uses
soft assignments and estimates also the covariances
• Spectral clustering uses the graph Laplacian and
performs an eigenvector analysis
• Major Problem:
• most clustering algorithms require the number of
clusters to be given

Machine Learning for PD Dr. Rudolph Triebel



Computer Vision 36 Computer Vision Group

You might also like