Cluster Analysis: G Sreenivas

Cluster analysis
G SREENIVAS
Cluster Analysis
●What is Cluster Analysis ?

●Types of Data in Cluster Analysis
●A Categorization of Major Clustering
Methods
●Partitioning Methods
●Hierarchical Methods
What is Cluster Analysis?
● Clustering :
Clustering is the process of grouping a data set in a way
that the similarity between data within a cluster is
maximized while the similarity between data of different
clusters is minimized.
● Clusters :
A cluster is a collection of data objects that are similar
to one another within the same cluster and are
dissimilar to the objects in other clusters.
What Is Good Clustering?
● A good clustering method will produce high quality

clusters with
○ high intra-class similarity
○ low inter-class similarity
● The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
● The quality of a clustering method is also measured by
its ability to discover some or all of the hidden patterns.
Data Structures
● Most of the main-memory-based clustering algorithms
operate on either of the two following data structures.
● Data matrix (object-by-variable structure) :
n objects p variables
● Dissimilarity matrix (object-by-object structure) :

n objects
Measure the Quality of Clustering
● Dissimilarity/Similarity metric:
Similarity is expressed in terms of a distance function, which
is typically metric : d(i, j)
● There is a separate “quality” function that measures the

“goodness” of a cluster.
● Weights should be associated with different variables

based on applications and data semantics.
● It is hard to define “similar enough” or “good enough”

○ the answer is typically highly subjective.
Similarity and Dissimilarity Between Objects
●Distances are normally used to measure the

similarity or dissimilarity between two data
objects
●Some popular ones include: Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
●If q = 1, d is Manhattan distance
● If q = 2, d is Euclidean distance :
● Properties
○ d(i,j) 0
○ d(i,i) = 0
○ d(i,j) = d(j,i)
○ d(i,j) d(i,k) + d(k,j)
● Also one can use weighted distance, parametric Pearson
product moment correlation, or other dissimilarity
measures.
Finding a Centroid
Use the following equation we can find the centroid of k
n-dimensional points :
Let’s find the centroid between 3 2-D points, say: (2,4) (5,2) (8,9)
Major Clustering Approaches
●Partitioning algorithms :
Construct various partitions and then evaluate

them by some criterion
●K-means, K-mediods
●Hierarchy algorithms : Create a hierarchical

decomposition of the set of data (or objects)
using some criterion
○CURE, Chameleon, BIRCH
The K-Means Clustering Method
●K-means Algorithm :
●Input: number of clusters k and a database
consisting of n objects.
●Output: a set of k clusters.
●1. Arbitrarily choose k objects as the initial
clusters.
●2. Repeat
■ (re)assign each object to the cluster to which the
object is most similar, based on the mean value
of the objects in the cluster;
■ Update the cluster means;i.e., calculate the
mean value of the objects for each cluster.
●3. Until no change
●The process iterates until the criterion
function converges.
○ E= (i=1 to k) (p € Ci) |p-mi|2
●E is the sum of square-error for all objects in

the database, p is the point of space, mi is the
mean of cluster Ci.
●Algorithm try to determine k partitions that
minimize squared-error function.
●Example 1
1.We Pick k=2

centers at random
2.We cluster our
data around these
center points
1.We recalculate
centers based on
our current clusters
1.We re-cluster our

data around our
new center points
1.We repeat the last

two steps until no
more data points are
moved into a
different cluster
Hierarchical Clustering
● Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
AGglomerative
NESting
a a (AGNES)
b b abcd
c
cd e
d
d e
e e DIvisive ANAlysis
Step 4 Step 3 Step 2 Step 1 Step 0
(DIANA)
Agglomerative , Level 2, k = 7 clusters.
Agglomerative, Level 6, k = 3 clusters.
Agglomerative, Level 8, k = 1 cluster.
AGNES (Agglomerative Nesting)
● Introduced in Kaufmann and Rousseeuw (1990)

● Use the Single-Link method and the dissimilarity matrix.
● Merge nodes that have the least dissimilarity
● Go on in a non-descending fashion
● Eventually all nodes belong to the same cluster
DIANA (Divisive Analysis)
● Introduced in Kaufmann and Rousseeuw (1990)

● Inverse order of AGNES
● The cluster is split according to some principle.
○ For example, Maximum Euclidean distance between closest
enamoring objects.
● Eventually each node forms a cluster on its own
A Dendrogram Shows How the Clusters
are Merged Hierarchically
Decompose data objects into a several levels of nested partitioning

(tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected component
forms a cluster.
Examples of Clustering Applications
● Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
● Land use: Identification of areas of similar land use in an
earth observation database
● City-planning: Identifying groups of houses according to
their house type, value, and geographical location
● Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
● Biology
○ Plant and animal taxonomies
○ Categorize genes with similar functionality
Cluster Analysis
● Reference:
1. Chapter 8: Data mining: Concepts and Techniques:
Jiawei Han and Micheline Kamber, Morgan Kaufmann
2. http://en.wikipedia.org/wiki/Cluster_analysis
3. http://home.dei.polimi.it//matteucc/clustering/tutorial html
/heirarchical.html
THANK YOU

Cluster Analysis: G Sreenivas

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis: G Sreenivas

Uploaded by

Copyright:

Available Formats

Cluster analysis

●What is Cluster Analysis ?

● A good clustering method will produce high quality

● Dissimilarity matrix (object-by-object structure) :

● There is a separate “quality” function that measures the

● Weights should be associated with different variables

● It is hard to define “similar enough” or “good enough”

●Distances are normally used to measure the

Construct various partitions and then evaluate

●Hierarchy algorithms : Create a hierarchical

●E is the sum of square-error for all objects in

1.We Pick k=2

1.We re-cluster our

1.We repeat the last

● Introduced in Kaufmann and Rousseeuw (1990)

● Introduced in Kaufmann and Rousseeuw (1990)

Decompose data objects into a several levels of nested partitioning

A clustering of the data objects is obtained by cutting the

You might also like