Clustering: Unsupervised Learning Methods 15-381

Clustering:
Unsupervised Learning Methods

15-381
Jaime Carbonell
8 April 2003
OUTLINE:
What is unsupervised learning?
Similarity computations
Clustering Algorithms
Other kinds of unsupervised learning
Unsupervised Learning
 Definition of Unsupervised Learning:

Learning useful structure without labeled
classes, optimization criterion, feedback
signal, or any other information beyond the
raw data and grouping principle(s).
Unsupervised Learning
 Examples:
 Find natural groupings of X’s (X = human languages, stocks,
gene sequences, animal species,…) 
Prelude to discovery of underlying properties
 Summarize the news for the past month 
Cluster first, then report centroids.
 Sequence extrapolation: E.g. Predict cancer incidence next
decade; predict rise in antibiotic-resistant bacteria
 Methods
 Clustering (n-link, k-means, GAC,…)
 Taxonomy creation (hierarchical clustering)
 Novelty detection (“meaningful” outliers)
 Trend detection (extrapolation from multivariate partial
derivatives)
Similarity Measures in Data Analysis
 General Assumptions
 Each data item is a tuple (vector)
 Values of tuples are nominal, ordinal or
numerical
 Similarity = (Distance)-1
 Pure Numerical Tuples

 Sim(di,dj) =  di,kdj,k
 sim (di,dj) = cos(di,dj)
 …and many more (slide after next)
Similarity Measures in Data Analysis
 For Ordinal Values
 E.g. "small," "medium," "large," "X-large"
 Convert to numerical assuming constant …on
a normalized [0,1] scale, where: max(v)=1,
min(v)=0, others interpolate
 E.g. "small"=0, "medium"=0.33, etc.
 Then, use numerical similarity measures
 Or, use similarity matrix (see next slide)
Similarity Measures (cont.)
 For Nominal Values

 E.g. "Boston", "LA", "Pittsburgh", or "male",
"female", or "diffuse", "globular", "spiral",
"pinwheel"
 Binary rule: If di, = dj,k, then sim = 1, else 0
 Use underlying sematic property: E.g.
Sim(Boston, LA) = dist(Boston, LA)-1, or
Sim(Boston, LA) =
(|size(Boston) - size(LA)| ) /Max(size(cities))
 Or, use similarity Matrix
Similarity Matrix
tinylittle small medium large huge
tiny1.0 0.8 0.7 0.5 0.2 0.0
little 1.0 0.9 0.7 0.3 0.1
small 1.0 0.7 0.3 0.2
medium 1.0 0.5 0.3
large 1.0 0.8
huge 1.0
 Diagonal must be 1.0

 Monotonicity property must hold
 No linearity (value interpolation) assumed
 Qualitative Transitive property must hold
Document Clustering Techniques
 Similarity or Distance Measure:Alternative Choices
 Cosine similarity
 Euclidean distance
 Kernel functions, e.g.,
 Language Modeling P(y|modelx) where x and y are

documents
 Kullback Leibler distance ("relative entropy")
Incremental Clustering Methods
Given n data items: D: D1, D2,…Di,…Dn
And given minimal similarity threshold: Smin
Cluster data incrementally as follows:
Procedure Singlelink(D) a.k.a “closest-link”b

Let CLUSTERS = {D1}
For i=2 to n
Let Dc =Argmax[Sim(Di,Dj];
j<i
If Dc>Smin, add Dj to Dc's cluster
Else Append(CLUSTERS, {Dj};; new cluster
Incremental Clustering
via Closest-link Method
Attach to cluster containing closest point.

Danger: “Snake”-like clusters
Incremental Clustering (cont.)
Procedure Averagelink(D)
Let CLUSTERS = {D1}
For i=2 to n
Let Dc =Argmax[Sim(Di, centroid(C)]
C in CLUSTERS
If Dc>Smin, add Dj to cluster C

Else Append(CLUSTERS, {Dj};; new cluster
 Observations
 Single pass over the data  easy to cluster new data
incrementally
 Requires arbitrary Smin threshold
 O(n|C|) time, O(n) space
K-Means Clustering
1. Select k-seeds s.t. d(ki,kj) > dmin

2. Assign points to clusters by min dist.
Cluster(pi) = Argmin(d(pi,sj))
sj{s1,…,sk}
3. Compute new cluster centroids:
 1 
cj 
n
 pi
pi  j th cluster
4. Reassign points to clusters (as in 2 above)

5. Iterate until no points change clusters
K-Means Clustering: Initial Data Points
Step 1: Select k random

seeds s.t. d(ki,kj) > dmin Initial Seeds
(if k=3)
K-Means Clustering: First-Pass Clusters
Step 2: Assign points

to clusters by min dist. Initial Seeds
Cluster(pi) = Argmin(d(pi,sj))
sj{s1,…,sk}
K-Means Clustering: Seeds  Centroids
Step 3: Compute new

cluster centroids: New Centroids
 1 
cj 
n
 pi
pi  j th cluster
K-Means Clustering: Second Pass Clusters
Step 4: Recompute
Centroids
Cluster(pi) = Argmin(d(pi,cj))
cj{c1,…,ck}
K-Means Clustering: Iterate Until Stability
Steps 5 to N: New Centroids

Iterate steps 3 & 4, until no
point changes cluster
 Example. Group documents based on similarity
Similarity matrix:
Thresholding at similarity value of .9 yields:

complete graph C1 = {1,4,5}, namely Complete Linkage
connected graph C2={1,4,5,6}, namely Single Linkage
For clustering we need three things:
 A similarity measure for pairwise comparison between documents
 A clustering criterion (complete Link, Single Ling,…)
 A clustering algorithm
 Clustering Criterion: Alternative Linkages
 Single-link ('nearest neighbor"):
 Complete-link:
 Average-link ("group average clustering") or GAC):

Hierarchical Agglomerative
Clustering Methods
 Generic Agglomerative Procedure (Salton '89):
- result in nested clusters via iterations
1. Compute all pairwise document-document similarity
coefficients
2. Place each of n documents into a class of its own
3. Merge the two most similar clusters into one;
- replace the two clusters by the new cluster
- recompute intercluster similarity scores w.r.t. the new cluster
4. Repeat the above step until there are only k clusters left
(note k could = 1).
Group Agglomerative Clustering
2
1 5
4
6 3
9
8
Hierarchical Agglomerative
Clustering Methods (cont.)
 Heuristic Approaches to Speedy

Clustering:
 Reallocation methods with k selected-seeds
(O(kn) time)
- k is the desired number of clusters; n is the number
of documents
 Buckshot: random sampling (of (k)n
documents) puls global HAC
 Fractionation: Divide and Conquer
Creating Taxonomies
 Hierarchical Clustering
 GAC trace creates binary hierarchy
 Incremental-link Hierarchical version
1. Cluster data with high Smin 1st hierarchical level
2. Decrease Smin (stop at Smin=0)
3. Treat cluster centroids as data tuples and recluster, creating
next level of hierarchy, then repeat steps 2 and 3.
 K-means Hierarchical k-means

1. Cluster data with large k
2. Decrease k (stop at k=1)
3. Treat cluster centroids as data tuples and recluster, creating
next level of hierarchy, then repeat steps 2 and 3.
Taxonomies (cont.)
 Postprocess Taxonomies
 Eliminate "no-op" levels
 Agglomerate "skinny" levels
 Label meaningful levels manually or with
centroid summary

Clustering: Unsupervised Learning Methods 15-381

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering: Unsupervised Learning Methods 15-381

Uploaded by

Copyright:

Available Formats

Clustering:

Unsupervised Learning Methods

 Definition of Unsupervised Learning:

 Pure Numerical Tuples

 For Nominal Values

 Diagonal must be 1.0

 Kernel functions, e.g.,

 Language Modeling P(y|modelx) where x and y are

Cluster data incrementally as follows:

Procedure Singlelink(D) a.k.a “closest-link”b

Attach to cluster containing closest point.

If Dc>Smin, add Dj to cluster C

1. Select k-seeds s.t. d(ki,kj) > dmin

4. Reassign points to clusters (as in 2 above)

Step 1: Select k random

Step 2: Assign points

Step 3: Compute new

Steps 5 to N: New Centroids

Thresholding at similarity value of .9 yields:

 Average-link ("group average clustering") or GAC):

 Heuristic Approaches to Speedy

 K-means Hierarchical k-means

You might also like