Professional Documents
Culture Documents
17 Hier
17 Hier
17 Hier
Introduction to
Information Retrieval
1
Introduction to Information Retrieval
Overview
❶ Recap
❷ Introduction
❸ Single-link/ Complete-link
❹ Centroid/ GAAC
❺ Variants
❻ Labeling clusters
2
Introduction to Information Retrieval
Outline
❶ Recap
❷ Introduction
❸ Single-link/ Complete-link
❹ Centroid/ GAAC
❺ Variants
❻ Labeling clusters
3
Introduction to Information Retrieval
Applications of clustering in IR
Application What is Benefit Example
clustered?
Search result search more effective
clustering results information
presentation to user
Scatter-Gather (subsets alternative user
of) collection interface: “search
without typing”
Collection collection effective information McKeown et al.
clustering presentation for 2002,
exploratory browsing news.google.com
4
Introduction to Information Retrieval
K- means algorithm
5
Introduction to Information Retrieval
Initialization of K-means
▪ Random seed selection is just one of many ways K-means
can be initialized.
▪ Random seed selection is not very robust: It’s easy to get a
suboptimal clustering.
▪ Better heuristics:
▪ Select seeds not randomly, but using some heuristic (e.g.,
filter out outliers or find a set of seeds that has “good
coverage” of the document space)
▪ Use hierarchical clustering to find good seeds (next class)
▪ Select i (e.g., i = 10) different sets of seeds, do a K-means
clustering for each, select the clustering with lowest RSS
6
Introduction to Information Retrieval
7
Introduction to Information Retrieval
Outline
❶ Recap
❷ Introduction
❸ Single-link/ Complete-link
❹ Centroid/ GAAC
❺ Variants
❻ Labeling clusters
8
Introduction to Information Retrieval
Hierarchical clustering
Our goal in hierarchical clustering is to
create a hierarchy like the one we saw earlier
in Reuters:
10
Introduction to Information Retrieval
11
Introduction to Information Retrieval
A dendogram
▪ The history of mergers
can be read off from
bottom to top.
▪ The horizontal line of
each merger tells us what
the similarity of the
merger was.
▪ We can cut the
dendrogram at a
particular point (e.g., at
0.1 or 0.4) to get a flat
clustering.
12
Introduction to Information Retrieval
Divisive clustering
13
Introduction to Information Retrieval
14
Introduction to Information Retrieval
17
Introduction to Information Retrieval
18
Introduction to Information Retrieval
19
Introduction to Information Retrieval
20
Introduction to Information Retrieval
21
Introduction to Information Retrieval
22
Introduction to Information Retrieval
23
Introduction to Information Retrieval
24
Introduction to Information Retrieval
25
Introduction to Information Retrieval
26
Introduction to Information Retrieval
Outline
❶ Recap
❷ Introduction
❸ Single-link/ Complete-link
❹ Centroid/ GAAC
❺ Variants
❻ Labeling clusters
27
Introduction to Information Retrieval
28
Introduction to Information Retrieval
29
Introduction to Information Retrieval
30
Introduction to Information Retrieval
Complete-link dendrogram
▪ Notice that this
dendrogram is much
more balanced than the
single-link one.
▪ We can create a 2-cluster
clustering with two
clusters of about the
same size.
31
Introduction to Information Retrieval
32
Introduction to Information Retrieval
Single-link clustering
33
Introduction to Information Retrieval
34
Introduction to Information Retrieval
35
Introduction to Information Retrieval
Single-link: Chaining
36
Introduction to Information Retrieval
Coordinates:
1 + 2 × ϵ, 4, 5 + 2 × ϵ, 6, 7 − ϵ.
37
Introduction to Information Retrieval
Outline
❶ Recap
❷ Introduction
❸ Single-link/ Complete-link
❹ Centroid/ GAAC
❺ Variants
❻ Labeling clusters
39
Introduction to Information Retrieval
Centroid HAC
▪ The similarity of two clusters is the average intersimilarity –
the average similarity of documents from the first cluster
with documents from the second cluster.
▪ A naive implementation of this definition is inefficient
(O(N2)), but the definition is equivalent to computing the
similarity of the centroids:
40
Introduction to Information Retrieval
41
Introduction to Information Retrieval
Centroid clustering
42
Introduction to Information Retrieval
43
Introduction to Information Retrieval
Inversions
44
Introduction to Information Retrieval
45
Introduction to Information Retrieval
46
Introduction to Information Retrieval
47
Introduction to Information Retrieval
48
Introduction to Information Retrieval
Outline
❶ Recap
❷ Introduction
❸ Single-link/ Complete-link
❹ Centroid/ GAAC
❺ Variants
❻ Labeling clusters
49
Introduction to Information Retrieval
50
Introduction to Information Retrieval
51
Introduction to Information Retrieval
52
Introduction to Information Retrieval
53
Introduction to Information Retrieval
54
Introduction to Information Retrieval
55
Introduction to Information Retrieval
Bisecting K-means
56
Introduction to Information Retrieval
Bisecting K-means
57
Introduction to Information Retrieval
Outline
❶ Recap
❷ Introduction
❸ Single-link/ Complete-link
❹ Centroid/ GAAC
❺ Variants
❻ Labeling clusters
58
Introduction to Information Retrieval
59
Introduction to Information Retrieval
Exercise
60
Introduction to Information Retrieval
Discriminative labeling
61
Introduction to Information Retrieval
Non-discriminative labeling
62
Introduction to Information Retrieval
63
Introduction to Information Retrieval
Resources
▪ Chapter 17 of IIR
▪ Resources at http://ifnlp.org/ir
▪ Columbia Newsblaster (a precursor of Google News):
McKeown et al. (2002)
▪ Bisecting K-means clustering: Steinbach et al. (2000)
▪ PDDP (similar to bisecting K-means; deterministic, but also
less efficient): Saravesi and Boley (2004)
65