Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 53

Unsupervised Algorithms: Clustering

• Input data : set of documents to classify ,not even class labels are
provided
• Task of the classifier : separate documents into
subsets (clusters) automatically separating procedure is called
clustering
Applications of Clustering
• IR: presentation of results (clustering of documents)
• Summarisation:
1. clustering of similar documents for multi-document summarisation
2. clustering of similar sentences for re-generation of sentences
• Topic Segmentation: clustering of similar paragraphs (adjacent or non-
adjacent) for detection of topic structure/importance
• Lexical semantics: clustering of words by cooccurrence patterns
Example
• Class labels can be generated automatically but are different from
labels specified by humans usually.
• Thus, solving the whole classification problem with no human
intervention is hard ,If class labels are provided, clustering is more
effective
The Cluster Hypothesis
• “Similar documents tend to be relevant to the same requests”
• Issues:
1. Variants: “Documents that are relevant to the same topics are
similar”
2. Simple vs. complex topics
3. Evaluation, prediction
• The cluster hypothesis is the main motivation behind document
clustering
Similarity Coefficients
1.Simple matching:

2.Dice’s Coefficient:

3.Cosine Coefficient:
Document-document similarity
• Document representative
• > Select features to characterize document: terms,phrases, citations
• > Select weighting scheme for these features:
a) Binary, raw/relative frequency, divergence measure
b) Title / body / abstract, controlled vocabulary, selected topics,
taxonomy
• Similarity / association coefficient or dissimilarity/ distance metric
Clustering methods
• Non-hierarchic methods
• => partitions
> High efficiency, low effectiveness»
• Hierarchic methods
• => hierarchic structures - small clusters of highly similar documents
nested within larger clusters of less similar documents
• Divisive => monothetic classifications
• Agglomerative => polythetic classifications !!
Partitioning method
• Generic procedure:
• The first object becomes the first cluster
• Each subsequent object is matched against existing clusters
1. It is assigned to the most similar cluster if the similarity measure is
above a set threshold
2. Otherwise it forms a new cluster
• Re-shuffling of documents into clusters can be done iteratively to
increase cluster similarity
Representation of clustered hierarches
kohonen feature in maps on text
• Clustering is used in information retrieval systems to enhance the
efficiency and effectiveness of the retrieval process.
• Clustering is achieved by partitioning the documents in a collection
into classes such that documents that are associated with each other
are assigned to the same cluster.
Types of Clustering
Desiderata for clustering
Non-hierarchical (partitioning) clustering
• Partitional clustering algorithms produce a set of k non-nested
partitions corresponding to k clusters of n objects.
• Advantage: not necessary to compare each object to each other
object, just comparisons of objects – cluster centroids necessary
• Optimal partitioning clustering algorithms are O(kn)
• Main algorithm: K-means
K-means Clustering

• Input: number K of clusters to be generated


• Each cluster represented by its documents centroid
• K-Means algorithm:
• partition docs among the K clusters
• each document assigned to cluster with closest centroid
• Re compute centroids
• repeat process until centroids do not change
• Vector space model:
• As in vector space classification, we measure relatedness between
vectors by Euclidean distance .. .which is almost equivalent to cosine
similarity.
• Each cluster in K-means is defined by a centroid.
• Objective/partitioning criterion: minimize the average squared
difference from the centroid
K-means: Basic idea
K-means algorithm
• We try to find the minimum average squared difference by iterating
two steps: 
• reassignment: assign each vector to its closest centroid
• recomputation: recompute each centroid as the average of the
vectors that were assigned to it in reassignment
• K-means can start with selecting as initial clusters centers K randomly
chosen objects, namely the seeds.
• It then moves the cluster centers around in space in order to minimize
RSS(A measure of how well the centroids represent the members of
their clusters is the Residual Sum of Squares , the squared distance of
each vector from its centroid summed over all vectors This is done
iteratively by repeating two steps (reassignment , re computation)
until a stopping criterion is met
• K-means can start with selecting as initial clusters centers K randomly
chosen objects, namely the seeds.
• It then moves the cluster centers around in space in order to
minimize RSS(A measure of how well the centroids represent the
members of their clusters is the Residual Sum of Squares , the
squared distance of each vector from its centroid summed over all
vectors This is done iteratively by repeating two steps (reassignment ,
re computation) until a stopping criterion is met.
• Algorithm Input:
• K: no of clusters
• D: data set containing n objects
• Output : a set of K clustersSteps
• 1. Arbitrarily choose k objects from D as the initial cluster centers
• 2. Repeat
• 3. Reassign each object to the cluster to which the object is the most
similar based on the distance measure
• 4. Recompute the centroid for newly formed cluster
• 5. Until no change

Certainly! Let's compute the k-means clustering algorithm for the
given data.
• Given data: Medicine A:(1,1)Medicine A:(1,1) Medicine B:
(2,1)Medicine B:(2,1) Medicine C:(4,3)Medicine C:(4,3) Medicine D:
(5,4)Medicine D:(5,4)
• We want to cluster these
• Given data: Medicine A:(1,1)Medicine A:(1,1) Medicine B:
(2,1)Medicine B:(2,1) Medicine C:(4,3)Medicine C:(4,3) Medicine D:
(5,4)Medicine D:(5,4)
• We want to cluster these
• Given data: Medicine A:(1,1)Medicine A:(1,1) Medicine B:
(2,1)Medicine B:(2,1) Medicine C:(4,3)Medicine C:(4,3) Medicine D:
(5,4)Medicine D:(5,4)
• We want to cluster these medicines into k=2 clusters based on their
attributes (weight index and pH).
• Step 2: Assignment Step:
• Calculate the distance of each medicine to each centroid using
Euclidean distance.
• Assign each medicine to the nearest centroid.
• Form clusters based on the assignments.
• : Repeat:
• Repeat steps 2 and 3 until convergence.
• Since we only have two iterations in this example, we can consider
this as the final result.
• So, the final clusters are:
• Cluster 1: Medicine A, Medicine B (Centroid: (1.5,1)(1.5,1))
• Cluster 2: Medicine C, Medicine D (Centroid: (4.5,3.5)(4.5,3.5))
• This is how k-means clustering algorithm works mathematically for
the given data.
Hierarchical Clustering
• Goal: to create a hierarchy of clusters by either decomposing a large
cluster into smaller ones, or agglomerating previously defined clusters
into larger ones
• Build a tree based hierarchical taxonomy from a set of document is
called dendrogram
• There are two types of hierarchical clustering, Divisive and
Agglomerative. Hierarchical
• Agglomerative
• Divisive
• Method used for computing cluster distances defines three variants of
the algorithm
• 1. single-linkage
• 2. complete-linkage
• 3. average-link age
Methods to find closest pair of clusters:
Single Linkage
• In single linkage hierarchical clustering, the distance between two
clusters is defined as the shortest distance between two points in
each cluster.
• For example, the distance between clusters “r” and “s” to the left is
equal to the length of the arrow between their two closest points
• Complete Linkage:
• In complete linkage hierarchical clustering, the distance between two
clusters is defined as the longest distance between two points in each
cluster. For example, the distance between clusters “r” and “s” to the
left is equal to the length of the arrow between their two furthest
points.
• Average linkage:

• In average linkage hierarchical clustering, the distance between two


clusters is defined as the average distance between each point in one
cluster to every point in the other cluster. For example, the distance
between clusters “r” and “s” to the left is equal to the average length
each arrow between connecting the points of one cluster to the other

You might also like