Professional Documents
Culture Documents
SpectralClustering Lectures
SpectralClustering Lectures
July 2007
July 2007
Overview
Rough plan of the lectures, might change as we go along:
1. Intuition and basic algorithms
2. Advanced algorithms (in particular spectral clustering)
3. What is clustering, after all? How can we define it? Some
theoretic approaches
4. The number of clusters: different approaches in theory and
practice
Ulrike von Luxburg: Clustering
1
2
Ulrike von Luxburg: Clustering July 2007
Clustering: intuition
July 2007
Reason to do this:
• exploratory data analysis
• reducing the complexity of the data
• many more
3
Clustering Gene Expression Data
July 2007
4
July 2007
5
July 2007
and
real results on can
data sets real be
datafound
sets on
can our
be web-page
found on
caltech.edu/lihi/Demos/SelfTuningClustering.html
Demos/SelfTuningClustering.html
6
July 2007
PygmyChimpanzee
Chimpanzee
Orangutan
SumatranOrangutan
8
July 2007
9
July 2007
clusters.
• Repeat this until convergence.
10
July 2007
6 6 6
4 4 4
2 2 2
0 0 0
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
Previous step Updated clustering for fixed centers Updated centers based on new clustering
6 6 6
4 4 4
2 2 2
0 0 0
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
...
Ulrike von Luxburg: Clustering
Previous step Updated clustering for fixed centers Updated centers based on new clustering
6 6 6
4 4 4
2 2 2
0 0 0
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
11
July 2007
(i+1) 1 X
mk = (i+1)
Xs
|Ck | s∈Ck
12
Output: Clusters C1 , ..., CK
July 2007
13
July 2007
Justifying K -means
• Clustering as “data quantization”
• Tries to fit a mixture of Gaussians using a simplified
EM-algorithm (this is what Joaquin showed you)
• Want to show now: K -means solves a more general problem
14
July 2007
K
X 1 X
min kXi − Xj k2 (OPT 1)
{C1 ,...,CK }
k=1
|Ck |2 i∈Ck ,j∈Ck
15
July 2007
1
P
where mk := |Ck | j∈Ck Xj is the mean of cluster Ck .
Ulrike von Luxburg: Clustering
16
July 2007
1 X X
= h (Xi − X j ), (Xi − Xs )i
|Ck |2 j∈C s∈C
k k
1 X X X X
= 2
hX i , Xi i − hXi , Xs i − hXj , Xs i + hXj , Xs i
|Ck | i i,s j,i j,s
Ulrike von Luxburg: Clustering
1 X X
= hXi , Xi i − hXi , Xj i
|Ck |2 i i,j
1 1 X
= kXi − Xj k2
2 |Ck |2 i,j
17
July 2007
K X
X
min kXi − mk k2 (OPT 2)
{C1 ,...,CK }
k=1 i∈Ck
1
P
where mk := |Ck | j∈Ck Xj is the mean of cluster Ck .
Ulrike von Luxburg: Clustering
18
July 2007
1.5
0.5
0
Ulrike von Luxburg: Clustering
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
20
July 2007
Proof:
I Assume contrary, i.e. cluster Ck is not a Voronoi cell.
21
July 2007
22
July 2007
24
July 2007
25
July 2007
Proof:
26
July 2007
X X X X
1 2 3 4
m1 m2 m3
Ulrike von Luxburg: Clustering
m1 m2 m3
28
July 2007
29
July 2007
30
July 2007
cluster)
I Note: K -means is a simplified version of an EM-algorithm
fitting a Gaussian mixture model.
31
July 2007
32
July 2007
Hierarchical clustering
Goal: obtain a complete hierarchy of clusters and sub-clusters in
form of a dendrogram
Platypus
Marsupials and monotremes
Wallaroo
Opossum
Rodents Rat
HouseMouse
Cat
HarborSeal
GreySeal
Ferungulates WhiteRhino
Horse
FinbackWhale
BlueWhale
Cow
Ulrike von Luxburg: Clustering
Gibbon
Gorilla
Human
Primates
PygmyChimpanzee
Chimpanzee
Orangutan
SumatranOrangutan
Simple idea
Agglomerative (bottom-up) strategy:
• Start: each point is its own cluster
• Then check which points are closest and “merge” them to form
a new cluster
• Continue, always merge two “closest” clusters until we are left
with one cluster only
A complete book on the topic: N. Jardine and R. Sibson. Mathematical taxonomy. Wiley,
London, 1971.
Nice overview with application in biology: J. Kim and T. Warnow. Tutorial on phylogenetic
tree estimation. ISMB 1999.
34
July 2007
P
x∈C ,yinC 0 d(x,y )
Average linkage: dist(C , C 0 ) = |C |·|C 0 |
X
X
X X
X X
X X X
X
X
Ulrike von Luxburg: Clustering
35
July 2007
36
July 2007
Examples
... show matlab demos ...
demo linkage clustering by foot()
demo linkage clustering comparison()
Ulrike von Luxburg: Clustering
37
July 2007
38
July 2007
Summary so far
We have seen the two most widely used clustering algorithms:
• K -means and variants: tries to maximize within-cluster similarity
• Single linkage: builds a dendrogram
39
Ulrike von Luxburg: Clustering July 2007
40
Spectral Clustering
July 2007
41 ; spectral clustering
July 2007
Graph notation
Always assume that similarities sij are symmetric, non-negative
Then graph is undirected, can be weighted
f 0 Lf = f 0 Df − f 0 Sf
X X
= d i fi 2 − fi fj sij
i i,j
Ulrike von Luxburg: Clustering
!
1 XX X XX
= ( sij )fi 2 − 2 fi fj sij + ( sij )fj2
2 i j ij j i
1 X
= sij (fi − fj )2
2 ij
43
July 2007
hf , ∆f i = |∇f |2 dx
Proof: Exercise
45
July 2007
Lrw = D −1 L = I − D −1 S
Symmetric normalization:
46
July 2007
47
July 2007
48
July 2007
−1
−2 −1 0 1 2 3 4 5
• Data set in R2
Ulrike von Luxburg: Clustering
49
July 2007
4 4 4 4 4
3 3 3 3 3
2 2 2 2 2
1 1 1 1 1
0 0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4 0 2 4
Ulrike von Luxburg: Clustering
eig 1 (ind. range) eig 2 (ind. range) eig 3 (ind. range) eig 4 (ind. range) eig 5 (ind. range)
4 4 4 4 4
3 3 3 3 3
2 2 2 2 2
1 1 1 1 1
0 0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4 0 2 4
51
July 2007
0.1
0.05
−0.05
−0.1
−0.15 0.2
−0.2
0.1
−0.25
0.15 0.1 0.05 0 −0.05 0
−0.1 −0.15 −0.2
6 6 6 5
4
4 4 4
3
2 2 2 2
1
0 0 0
0
−2 −2 −2 −1
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
52
July 2007
0.8
0.6
0.4
0.2
0
0 10 20 30 40 50 60 70 80 90
Ulrike von Luxburg: Clustering
53
July 2007
54
July 2007
Balanced cuts:
1 1
RatioCut(A, B) := cut(A, B)( |A| + |B| )
1 1
Ncut(A, B) := cut(A, B)( vol(A) + vol(B) )
Ulrike von Luxburg: Clustering
55
Spectral clustering: relaxation of RatioCut or Ncut, repectively.
July 2007
i fi = 0 =⇒ f 1 = 0 =⇒ f ⊥ 1
t
P
• |A| = |B| =⇒
√
• kf k = n ∼ const.
Ulrike von Luxburg: Clustering
√
minf f 0 Lf s.t. f ⊥ 1, fi = ±1, kf k = n
Relaxation: allow fi ∈ R
By Rayleigh: solution f is the second eigenvector of L
Reconstructing solution: Xi ∈ A ⇐⇒ fi >= 0, Xi ∈ B otherwise
56
July 2007
again Rayleigh-Ritz.
57
July 2007
58
July 2007
59
July 2007
60
July 2007
0 d6 ...
Ulrike von Luxburg: Clustering
• For L or Lrw : all points of the same cluster are mapped on the
identical point in Rk
• Then spectral clustering finds the ideal solution.
61
July 2007
62
July 2007
• This might go wrong for Lsym if we have vertices with very small
degrees
63
July 2007
65
July 2007
ε-neighborhood graph:
• only use it if data similarities are “on same scale”
4
5 5
2
0 0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
0.08 0.08
0.06
0.06 0.06
0.04
0.04 0.04
0.02 0.02 0.02
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
69
2006.
July 2007
P Pn
i,j∈A sij − E (sij ) = i,j=1 (sij − E (sij ))(fi fj + 1)
Pn
= i,j=1 (sij − E (sij ))fi fj + const.
= −f 0 Bf + const.
71
July 2007
72
July 2007
communities).
• Random graph model chosen such that graphs are similar to the
given graph. Thus take into account at least some features of
the geometry of the current graph
73
July 2007
Nice historical overview on spectral clustering; and how relaxation can go wrong:
Ulrike von Luxburg: Clustering
• Spielman, D. and Teng, S. (1996). Spectral partitioning works: planar graphs and finite
element meshes. In FOCS, 1996
75
Ulrike von Luxburg: Clustering July 2007
76
Even more clustering algorithms
July 2007
K -means
clusters.
• Repeat this until convergence.
77
July 2007
K -means (2)
Previous step Updated clustering for fixed centers Updated centers based on new clustering
6 6 6
4 4 4
2 2 2
0 0 0
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
Previous step Updated clustering for fixed centers Updated centers based on new clustering
6 6 6
4 4 4
2 2 2
0 0 0
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
...
Ulrike von Luxburg: Clustering
Previous step Updated clustering for fixed centers Updated centers based on new clustering
6 6 6
4 4 4
2 2 2
0 0 0
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
78
July 2007
K -means (3)
The K -means algorithm:
Input: Data points X1 , ..., Xn ∈ Rd , number K of clusters to
construct.
(0) (0)
1. Randomly initialize the centers m1 , ..., mK .
2. Iterate until convergence:
2.1 Assign each data point to the closest cluster center, that is
(i+1) (i+1)
define the clusters C1 , ..., CK by
(i+1) 1 X
mk = (i+1)
Xs
|Ck | s∈Ck
79
Output: Clusters C1 , ..., CK
July 2007
K -means (4)
80
July 2007
Gibbon
Gorilla
Human
Primates
PygmyChimpanzee
Chimpanzee
Orangutan
SumatranOrangutan
254, 1967.
A complete book on the topic: N. Jardine and R. Sibson. Mathematical taxonomy. Wiley,
London, 1971.
Nice overview with application in biology: J. Kim and T. Warnow. Tutorial on phylogenetic
tree estimation. ISMB 1999.
82
July 2007
P
x∈C ,yinC 0 d(x,y )
Average linkage: dist(C , C 0 ) = |C |·|C 0 |
X
X
X X
X X
X X X
X
X
Ulrike von Luxburg: Clustering
83
July 2007
84
July 2007
85
July 2007
86
July 2007
Model-based clustering
Generative clustering (as opposed to discriminative):
I assume that data has been generated according to some
probabilistic model
I want to estimate the parameters of the model
I Methods to do this: maximum likelihood, Bayesian approaches
I In practice: EM-algorithm
I Examples: Gaussian mixtures, hidden Markov models
I Advantage: lead to better interpretation of the clusters (as
Ulrike von Luxburg: Clustering
Information-theoretic approaches
I Want to construct clusters such that as few “information” as
possible is lost about the data
I Minimize some distortion function
I Purely based on probability counts, no similarity or dissimilarity
needed
I Most popular approach: Information Bottleneck
N. Tishby, Fernando Pereira, and W. Bialek. The information bottleneck method. In
Proceedings of the 37th Annual Al lerton Conference on Communication, Control and
Ulrike von Luxburg: Clustering
88
July 2007
METIS
A coarsen-refinement algorithm for partitioning graphs:
I Given a large graph
I “Coarsen” it by merging nodes of the graph to “super-nodes”
I Cluster the coarse graph
I Uncoarsen the graph again, thereby also refining the clustering
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular
graphs. SIAM Journal on Scientific Computing, 20(1):359-392, 1999
Ulrike von Luxburg: Clustering
89
July 2007
Ensemble methods
(also called aggregation methods, or bagging)
I Repeatedly cluster the data using different algorithms,
parameter settings, perturbations, ...
I Obtain an “ensemble” of different clusterings
I Implicitly contains the information which points might “really”
belong in the same cluster, and about which points we are
rather unsure
I Try to generate a final clustering of the ensemble.
Ulrike von Luxburg: Clustering
Nice overview paper: A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse
framework for combining multiple partitions. JMLR, 3:583-617, 2002.
90
July 2007
Subspace clustering
I If data is very high-dimensional: different reasonable
clusterings might exist in different “subspaces”
I Example: customer data base. Can cluster customers along
shopping preferences, or along age, or along money they spend
...
I Goal is to come up with a list of subspaces and the
corresponding clusters.
Review article: L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional
data: a review. SIGKDD Explor. Newsl., 6(1):90-105, 2004.
A completely different perspective on this problem: Paper by Tishby et al at NIPS 2005
Ulrike von Luxburg: Clustering
91
July 2007
Co-clustering
also called bi-clustering
I Want to find clusters in different domains simultaneously
I Example: term/document co-occurrence matrix. Want to
cluster words and texts simultaneously such that I identify
groups of words which are typical for certain groups of texts
I Most approaches work iteratively: first cluster along one
dimension; then use those clusters as input to cluster along the
other dimension; repeat until convergence
Some references:
Ulrike von Luxburg: Clustering
92
July 2007
93
July 2007
94
July 2007
95
July 2007
Examples:
• K -means; spectral clustering
96
July 2007
Disadvantages:
• Choice of quality function is rather heuristic
• Optimization problems are usually NP hard
Ulrike von Luxburg: Clustering
97
July 2007
• Clusters are density level sets (here have to choose the level)
• Clusters correspond to modes of a density (here again: problem
of scale)
98
July 2007
Advantages:
• Intuitively makes sense
• Well-defined mathematical object
• Can use standard statistical techniques for estimating densities,
level sets, the support of density, modes of a density, ...
Disadvantages:
• Suggests that to perform clustering we first have to estimate the
Ulrike von Luxburg: Clustering
density :-(
• At any price, avoid estimating densities ... won’t work even in
case of a moderate number of dimensions
• Hope: often don’t estimate the full density, but just some
aspects (for example for level set estimation; single linkage)
99
July 2007
100
July 2007
101
July 2007
!"##$%
&'()*+,-./
0#+1*123+'4
Ulrike von Luxburg: Clustering
103
July 2007
Disadvantages:
• Which set of axioms??? Even though it sounds fundamental, it
is pretty ad hoc!
• In most cases, not really helpful for practice.
Ulrike von Luxburg: Clustering
104
July 2007
Advantages:
• Have a clear model of the data, and a clear interpretation what a
cluster is
• Can use standard techniques to estimate the parameters
(maximum likelihood, EM, Bayesian approaches, ...)
105
July 2007
1
Ulrike von Luxburg: Clustering
0.5
−0.5
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5
106
July 2007
107
July 2007
Disadvantages:
• Often not so clear what “original information” we are referring
to (the coordinates of the points? Particular features? )
• Often not so easy to implement (minimize some distortion which
has many local minima, often done by some simulated annealing
scheme)
Ulrike von Luxburg: Clustering
108
July 2007
109
July 2007
112
July 2007 underlying distribution P
Consistency of clustering
! Clustering algorithm A algorithms
Consistency, intuitively:
• The more data points I get, the more accurate can I estimate
??? limn " # A(X1,…,Xn) ???
the “true” partition.
von Luxburg, May 2006
x
x x
x x x
x
x x x xx x x
x x xx x x
x x x x x
Ulrike von Luxburg: ClusteringUlrike
x x x
x x
x x x
x
8
Not so easy to define ... and even more difficult to prove.
115
July 2007
117
July 2007
118
July 2007
119
July 2007
0.6
2 0.5
0.4
0
0.2
!2 0
01 2 0.2 0.4
4 0.6 6 0.8 0 0.5 1
120
July 2007
121
July 2007
122
July 2007
123
July 2007
number of K possible.
I Example: K -means
I Then we need a second definition: Given P, what is the
“correct” number of clusters?
Given a finite sample, now want to estimate K by some Kn .
124
Have to prove that estimator converges to the correct one.
July 2007
125
July 2007
0.6 0.6
−2.5
Ulrike von Luxburg: Clustering
0.4 −3
0.4
−4
0 2 4 6
0 0.5 1 0 0.58 10
1
minimum ; K
This is called the gap statistics.
Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a dataset via the
gap statistic. J. Royal. Statist. Soc. B, 63(2):411-423, 2001.
127
July 2007
2
Ulrike von Luxburg: Clustering
1
log W true
0 log W scrambled
log W uniform
log W randomized
−1 gap scrambled
gap uniform
gap randomized
−2
2 4 6 8 10
128
July 2007
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
−1 −1 −1
0 2 4 0 2 4 0 2 4
2
log W true
log W scrambled
log W uniform
gap scrambled
1.5 gap uniform
Ulrike von Luxburg: Clustering
0.5
−0.5
−1
1 2 3 4 5
129
July 2007
130
July 2007
131
July 2007
k = 2:
Ulrike von Luxburg, May 2006
Ulrike von Luxburg: Clustering
k = 5:
20
132
July 2007
133
July 2007
134
July 2007
135
July 2007
136
July 2007
138
July 2007
139
July 2007
%&! %!!
!"-
#!
$!
%!! !"$
#! *!
*!
&! !",
)!
)!
!"#
! ! !
!"# !"$ % !"# !"$ % !"# !"$ %
!"&
'(& '(# '(,
%* %* %*
!"*
%) %) %)
'(,
%! %! %! !"+
$ $ $
!")
# # #
* * *
!"%
) ) )
! ! ! !
!"# !"$ % !"# !"$ % !"# !"$ % !"#
A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability based method for discovering structure in
Figure 3: Left: Scatter plot of a mixture of 4 Gaussians. Center: H
clustered data. In Pacific Symposium on Biocomputing, pages 6 - 17, 2002.
measure; right: overlay of the cumulative distributions for increasing
140
As a clustering algorithm we use the average link hiera
July 2007
141
July 2007
142
July 2007
144
July 2007
Theoretical statements?
Under certain assumptions on the distribution P,
and for certain basis algorithms A,
I ... what can we prove about the parameter K chosen with
stability?
I Is it reasonable?
I In simple examples, does it select “the right” K ?
I How large does the sample need to be?
I Does it “converge”? Confidence statements?
Ulrike von Luxburg: Clustering
145
July 2007
146
July 2007
quality
clusterings
Ulrike von Luxburg: Clustering
clusterings
inimizers
148 of the objective function (Rakhlin
f ε-minimizers of a function is the set of all
July 2007
Large
Critical view non– the
stability
stabilityfor every k!
approach (4)
! Several global minima are often induced by symmetry
! Natural distributions are usually not perfeclty symmetric
! In this case for large n: every k is stable!
quality
quality
May 2006
clusterings clusterings
von
Luxburg:
Ulrike vonUlrike Luxburg,
Clustering
149
July 2007
150
July 2007
151
July 2007
Cow
Gibbon
Gorilla
Human
Primates
PygmyChimpanzee
Chimpanzee
Orangutan
SumatranOrangutan
154
July 2007
155
July 2007
156
Ulrike von Luxburg: Clustering July 2007
157
Wrapping up
July 2007
Wrapping up
What I didn’t talk about:
• Choosing the similarity function between the data points: is
crucial, but of course not simple ...
• Issues in high-dimensional data: distances can become
meaningless; density estimation is impossible; can find many
clusters when looking at different dimensions; ...
• Efficient implementations: exist for most algorithms, see
literature ...
• Online algorithms, algorithms for treating data streams,
Ulrike von Luxburg: Clustering
158
July 2007
Wrapping up (2)
There are lots of clustering algorithms out there!
There are not much theoretical guidelines out there!
sense.
159
July 2007
Wrapping up (3)
If you use clustering as a pre-processing step:
160
July 2007
Wrapping up (4)
Finally:
161