Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 66

Overview of Clustering

Rong Jin
Outline
K means for clustering
  

w  w   
Expectation Maximization algorithm for clustering

 Spectrum clustering (if time is permitted)


w
Clustering
 Find out the underlying structure for given data
points
$$$

age
Application (I): Search Result Clustering
Application (II): Navigation
Application (III): Google News
Application (III): Visualization

Islands of music
(Pampalk et al., KDD’ 03)
Application (IV): Image Compression

http://www.ece.neu.edu/groups/rpl/kmeans/
How to Find good Clustering?
 Minimize the sum of
distance within clusters
C1
 26n
a
r

g
min
m, x
ij i
Cj   C2
C,
m  j
1i

j ij
, 1


1 xi thej-thcluster C4
mi, j   C5
0 xi thej-thcluster
6 C3
mi, j 1
j1

anyxi asinglecluster
How to Efficiently Clustering Data?
 2
6n
a
r

g
min
m, x
ij i
Cj  
C,
m  j
j ij
, 1i
1

   
Memberships mi , j and centers C j are correlated.
  2
 1 j  arg min( xi  C j )

Given centers {C j }, mi , j   k
0 otherwise
n 
  mi, j xi
 
Given memberships mi , j , Cj  i 1
n
 mi, j
i 1
K-means for Clustering
 K-means
 Start with a random
guess of cluster
centers
 Determine the
membership of each
data points
 Adjust the cluster
centers
K-means for Clustering
 K-means
 Start with a random
guess of cluster
centers
 Determine the
membership of each
data points
 Adjust the cluster
centers
K-means for Clustering
 K-means
 Start with a random
guess of cluster
centers
 Determine the
membership of each
data points
 Adjust the cluster
centers
K-means
1. Ask user how many clusters
they’d like. (e.g. k=5)
K-means
1. Ask user how many clusters
they’d like. (e.g. k=5)
2. Randomly guess k cluster
Center locations
K-means
1. Ask user how many clusters
they’d like. (e.g. k=5)
2. Randomly guess k cluster
Center locations
3. Each datapoint finds out
which Center it’s closest to.
(Thus each Center “owns” a
set of datapoints)
K-means
1. Ask user how many clusters
they’d like. (e.g. k=5)
2. Randomly guess k cluster
Center locations
3. Each datapoint finds out
which Center it’s closest to.
4. Each Center finds the
centroid of the points it
owns
K-means
1. Ask user how many clusters
Computational Complexity:
they’d like. (e.g. k=5)O(N)
where N is the number of points?
2. Randomly guess k cluster
Center locations
3. Each datapoint finds out
which Center it’s closest to.
4. Each Center finds the
centroid of the points it
owns

Any Computational Problem?


Improve K-means
 Group points by region
 KD tree
 SR tree
 Key difference
 Find the closest center for
each rectangle
 Assign all the points within a
rectangle to one cluster
Improved K-means
 Find the closest center for
each rectangle
 Assign all the points within
a rectangle to one cluster
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
Improved K-means
A Gaussian Mixture Model for Clustering
 Assume that data are
generated from a
mixture of Gaussian
distributions
 For each Gaussian
distribution
 Center: i
 Variance: i (ignore)
 For each data point
 Determine membership

zij : if xi belongs to j-th cluster


Learning a Gaussian Mixture
(with known covariance)
 Probability p( x  xi )
p ( x  xi )   p ( x  xi ,    j )   p (    j ) p ( x  xi |    j )
j j

 xi   j 
1
  p(    j ) exp   2 
 2 
d /2
j 2  2 2 
 
Learning a Gaussian Mixture
(with known covariance)
 Probability p( x  xi )
p ( x  xi )   p ( x  xi ,    j )   p (    j ) p ( x  xi |    j )
j j

 xi   j 
1
  p(    j ) exp   2 
 2 
d /2
j 2  2 2 
 

 Log-likelihood of data
 x

 
i 
l
og
p(
x
x)
 l
o


g(p)
1 i j2
ex
p

  


 j 

d/
2 2
i i 
j 22

2 

 
 Apply MLE to find optimal parameters p
(, j
j)
j
Learning a Gaussian Mixture
(with known covariance)

E-Step E[ zij ]  p(    j | x  xi )

p( x  xi |    j ) p(    j )
 k
 p( x  xi |   n ) p(   j )
n 1
1 2
 ( xi   j )
e 2 2 p(    j )
 1
k  ( xi   n )2
e
2
2 p(   n )
n 1
Learning a Gaussian Mixture
(with known covariance)

1 m
M-Step j  m  E[ zij ]xi
i 1
 ij
E[ z ]
i 1

1 m
p(    j )   E[ zij ]
m i 1
Gaussian Mixture Example: Start
After First Iteration
After 2nd Iteration
After 3rd Iteration
After 4th Iteration
After 5th Iteration
After 6th Iteration
After 20th Iteration
Mixture Model for Doc Clustering
 A set of language models   1 , 2 ,..., K 
  i  { p( w1 |  i ), p( w2 |  i ),..., p( wV |  i )}
Mixture Model for Doc Clustering
 A set of language models   1 , 2 ,..., K 
  i  { p( w1 |  i ), p( w2 |  i ),..., p( wV |  i )}

 Probability p(d  di )
p ( d  di )   p (d  di ,   j )
j

  p (   j ) p(d  di |    j )
j
V tf ( wk , di )
  p (   j )  p ( wk |  j ) 
j k 1
Mixture Model for Doc Clustering
 A set of language models   1 , 2 ,..., K 
  i  { p( w1 |  i ), p( w2 |  i ),..., p( wV |  i )}

 Probability p(d  di )
p ( d  di )   p (d  di ,   j )
j

  p (   j ) p(d  di |    j )
j
V tf ( wk , di )
  p (   j )  p ( wk |  j ) 
j k 1
Mixture Model for Doc Clustering
   1 , 2 ,..., K 
A set of language models Introduce hidden variable zij
 i  { p( w1 |  i ), p( w2 |  i ),...,
 p( wV |  i )}
zij: document di is generated by the
j-th language model j.
 Probability p ( d  d )
i

p ( d  d i )   p ( d  di ,   j )
j

  p(   j ) p(d  di |    j )
j
V tf ( wk , di )
  p (   j )  p ( wk |  j ) 
j k 1
Learning a Mixture Model
E
[z
ij]p
( j|dd
i)
E-Step
p
(di|
d j)p
(
j)
K
p(di|
d 
n)p
(
n)
n
1
V tf ( wk , di )
  p(wm |  j )  p(   j )
m 1
 K V
tf ( wk , di )
  p(wm |  n ) p(   n )
n 1 m 1

K: number of language models


Learning a Mixture Model
N
E
[z
ij]t
f(w
i,d
k)

M-Step p
(wi |j)k
1
N
 E[z
ij] d
k
k
1

N
1
p(   j ) 
N
 E[ zij ]
i 1

N: number of documents
Examples of Mixture Models
Other Mixture Models
 Probabilistic latent semantic index (PLSI)
 Latent Dirichlet Allocation (LDA)
Problems (I)
 Both k-means and mixture models need to compute
centers of clusters and explicit distance measurement
 Given strange distance measurement, the center of
clusters can be hard to compute
 
 
E.g., x  x '  max x1  x1' , x2  x2' ,..., xn  xn'

x y
xy 
 xz 

z
Problems (II)
 Both k-means and mixture models look for compact
clustering structures
 In some cases, connected clustering structures are more desirable
Graph Partition
 MinCut: bipartite graphs with minimal number of
cut edges

CutSize = 2
2-way Spectral Graph Partitioning
 Weight matrix W
 wi,j: the weight between two
vertices i and j
 Membership vector q
 1 i  Cluster A
qi  
-1 i  Cluster B

q  arg min CutSize


q[ 1,1]n

1
 
2
CutSize  J   qi  q j wi , j
4 i, j
Solving the Optimization Problem

1
 
2
q  arg min  qi  q j wi , j
q[ 1,1]n 4 i , j

 Directly solving the above problem requires


combinatorial search  exponential complexity
 How to reduce the computation complexity?
Relaxation Approach
 Key difficulty: qi has to be either –1, 1
 Relax qi to be any real number
n 2
 Impose constraint  q
i 1 i
n di w
i,j
j
1 1
   
2
J   qi  q j wi , j   qi2  q 2j  2qi q j wi , j
4 i, j 4 i, j
  1 D
d
i
i,j

1 2
  2qi   wi , j    2qi q j wi , j
4 i  j  4 i, j
 
1 1 1
2 i 2 i, j 2 i

  qi2 di   qi q j wi , j   qi di  i , j  wi , j q j 
T
J  q (D  W)q
Relaxation Approach
q*  arg min J  arg min qT (D  W)q
q q
2
subject to  k n
q
k
Relaxation Approach
q*  arg min J  arg min qT (D  W)q
q q
2
subject to  k n
q
k

 Solution: the second minimum eigenvector for D-W

( D  W )q  2q
Graph Laplacian
L  D  W : W   wi , j  , D   i , j
  j
wi , 
j



 L is semi-positive definitive matrix


 For Any x, we have xTLx  0, why?
 Minimum eigenvalue 1 = 0 (what is the eigenvector?)
 0  1  2  3 ...  k
 The second minimum eigenvalue 2 gives the best bipartite
graph
Recovering Partitions
 Due to the relaxation, q can be any number (not just
–1 and 1)
 How to construct partition based on the eigenvector?
 Simple strategy: A

{
i
|q

i0}
,
B{
i
|
q
i0}
Spectral Clustering
 Minimum cut does not balance the size of bipartite
graphs
Normalized Cut (Shi & Malik, 1997)
s( A, B)    wi, j , d A   di , d B   di
i A jB iA iB

s ( A, B ) s ( A, B) ddj
J  j
dA dB
 Minimize the similarity between clusters and meanwhile
maximize the similarity within clusters
s( A, B) s( A, B) dB  d A
J     wi , j
dA dB iA jB d AdB
2
 B A
d  d
   wi , j 
d/
BdAd if i
 A
i A jB d AdB d q(i
)  


 d
A/
d dif
B iB

 
2
  wi , j qi  q j
i j
Normalized Cut
 
2
J    wi , j qi  q j  qT (D - W )q
i j

 d B / d A d if i  A
qi  
 d A / d B d if i  B
Normalized Cut
 
2
J    wi , j qi  q j  qT (D - W )q
i j

 d B / d A d if i  A
qi  
 d A / d B d if i  B

 Relax q to real value under the constraint


q T Dq  1, q T De  0

 Solution: (
DW
)
qDq 
Image Segmentation
Non-negative Matrix Factorization

You might also like