Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

Overview Data Mining Algorithms


Cluster Analysis
Graham Williams
Principal Data Miner, ATO Adjunct Associate Professor, ANU

Cluster Analysis Introduction Requirements Measuring Similarity Distances Data Types Algorithms Cluster Methods KMeans

Copyright c 2006, Graham J. Williams

http://togaware.com

1/25/1

Copyright c 2006, Graham J. Williams

http://togaware.com

3/25/2

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

What is Cluster Analysis?


1

What is Cluster Analysis?


Cluster: a collection of data objects
Similar to one another within the same cluster Dissimilar to the objects in other clusters

How do we understand the behaviour of an individual?


Paint everyone with the same brush; Treat everyone as an individual.

How do we understand the worldthrough understanding every individual in the world? We categorise, for good or bad, entities into groups:
Socio-economic groups: the poor, the rich; Political: a lefty, a new right; Racial: religious, geographical,

Cluster analysis
Grouping a set of data objects into clusters

Clustering is unsupervised classication: no predened classesdescriptive data mining. Typical applications


As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

We nd that to get through in life we generally talk about groups, not individuals, but computers dont need tothey have the power to build an understanding of the individual, for better or worse.
http://togaware.com 4/25/3

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

http://togaware.com

5/25/4

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

General Applications of Clustering


Pattern Recognition Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining

Specic Examples
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identication of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Image Processing Economic Science (especially market research) WWW


Document Classication Question Categorisation Weblog Access Patterns

Copyright c 2006, Graham J. Williams

http://togaware.com

6/25/5

Copyright c 2006, Graham J. Williams

http://togaware.com

7/25/6

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

What Is Good Clustering?

Clustering Caveats

High Quality:
high intra-class similarity low inter-class similarity

Depends on:
similarity measure algorithm for searching

Ability to discover hidden patterns

Clustering may not be the best way to discover interesting groups in a data set. Often visulisation methods work well, allowing the human expert to identify useful groups. However, as the data set sizes increase to millions of entites, this becomes inpractical and clusters help to partition the data so that we can deal with smaller groups. Dierent algorithms deliver dierent clusterings.

Copyright c 2006, Graham J. Williams

http://togaware.com

8/25/7

Copyright c 2006, Graham J. Williams

http://togaware.com

9/25/8

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

Requirements of Clustering in Data Mining


Cluster Analysis Introduction Requirements Measuring Similarity Distances Data Types Algorithms Cluster Methods KMeans

Overview

Scalability Dierent attribute types Clusters with arbitrary shape Minimal domain knowledge required Can cope with noise and outliers Insensitive to order of input records High dimensionality

Copyright c 2006, Graham J. Williams

http://togaware.com

10/25/9

Copyright c 2006, Graham J. Williams

http://togaware.com

11/25/10

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

Similarity and Dissimilarity Between Objects

Minkowski distance

Distance measures the similarity or dissimilarity between two data objects a = (a1 , a2 , . . . , ap ) and b = (b1 , b2 , . . . , bp ). Properties
d(a, b) 0 d(a, a) = 0 d(a, b) = d(b, a) d(a, b) d(a, c) + d(c, b)

d(a, b) =

(|a1 b1 |q + |a2 b2 |q + . . . + |ap bp |q )

If q = 1, d is the Manhattan distance. d(a, b) = |a1 b1 | + |a2 b2 | + . . . + |ap bp | If q = 2, d is Euclidean distance: d(a, b) = (|a1 b1 |2 + |a2 b2 |2 + . . . + |ap bp |2 )

Copyright c 2006, Graham J. Williams

http://togaware.com

12/25/11

Copyright c 2006, Graham J. Williams

http://togaware.com

13/25/12

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

Type of data in clustering analysis


Cluster Analysis Introduction Requirements Measuring Similarity Distances Data Types Algorithms Cluster Methods KMeans

Overview

Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types

Copyright c 2006, Graham J. Williams

http://togaware.com

14/25/13

Copyright c 2006, Graham J. Williams

http://togaware.com

15/25/14

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

Major Clustering Approaches


Partitioning algorithms (kmeans, pam, clara, fanny): Construct various partitions and then evaluate them by some criterion. A xed number of clusters, k, is generated. Start with an initial (perhaps random) cluster. Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to nd the best t of that model

Basic Partitioning Algorithm

Partition database D of n objects into k clusters


Given k, nd k clusters that optimises partitioning criterion Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-means and k-medoids algorithms


k-means: Each cluster represented by center of the cluster k-medoids or PAM (partition around medoids): Each cluster represented by one of the objects in the cluster

Copyright c 2006, Graham J. Williams

http://togaware.com

16/25/15

Copyright c 2006, Graham J. Williams

http://togaware.com

17/25/16

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

The K-Means Clustering Method


10 9 8 7 6

The K-Means Clustering Method


10 9 10 9

5 5 5 5 5 1 1 1 1

8 7 6 5 4 3 2

5 5 5 5 5 1 1 1 1

8 7 6 5 4 3 2

5 5 5  5 5 1  1 1 1

Given k, the k-means algorithm is implemented in 4 steps:


1 2

5 4

Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no objects change clusters.

3 2 1 0 0 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 1 2 3 4 5 6

5
7 8 9 10

1 0 0 10 9 1 2 3 4

5
5 6 7 8 9 10

1 0 0 1 2 3 4

5
5 6 7 8 9 10

5 5 5 5 5 5 1 1 1

8 7 6 5 4 3 2

5 5 5  5

5  1 1

1
7 8 9 10

1 0 0 1 2 3 4

1
5 6 7 8 9 10

Copyright c 2006, Graham J. Williams

http://togaware.com

18/25/17

Copyright c 2006, Graham J. Williams

http://togaware.com

19/25/18

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

Comments on K-Means

Comments on K-Means

Weakness Strengths
Relatively ecient: O(tkn), where n is the number objects, k is the number of clusters, and t is the number iterations. Normally, k, t n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Applicable only when the mean is denedwhat about categorical data? Need to specify k, the number of clusters, in advance. Unable to handle noisy data and outliers. Not suitable for non-convex clusters.

Copyright c 2006, Graham J. Williams

http://togaware.com

20/25/19

Copyright c 2006, Graham J. Williams

http://togaware.com

21/25/20

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

KMeans in R
c l u s t e r s < 5 l o a d ( w i n e . Rdata ) w i n e . c l = kmeans ( w i n e [ , 2 : 3 ] , clusters ) p l o t ( wine [ , 2 : 3 ] , c o l=w i n e . c l $ c l u s t e r ) p o i n t s ( wine . c l $ c e n t e r s , pch =19 , c e x = 1 . 5 , c o l =1: c l u s t e r s ) de v . c o p y ( d e v i c e=pdf , f i l e =wine c l u s t e r s . p d f ) de v . o f f ( )
Copyright c 2006, Graham J. Williams http://togaware.com 22/25/21
6
q q q q q

Rattle: Hierarchical Variable Cluster

q
q q q q q q q q q q q q q q qq q q q q q q q q

q q qq q q

Malic

q q q q q q

q q q q q qq q q q q q

q q q q

q q q q q

q q

qq q q q q q qq q q q q q q q q qq q qq q q q q q q q q q q q q q qq qq q q q q q q q q qq q q q q q qq q q q q q qq q q q q qq q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q

11

12

13 Alcohol

14

Copyright c 2006, Graham J. Williams

http://togaware.com

23/25/22

Cluster Analysis

Measuring Similarity

Algorithms

Cluster Analysis

Measuring Similarity

Algorithms

Rattle: Hierarchical Data Cluster

Summary

Cluster analysis is unsupervised learning. Useful for partitioning a very large population, perhaps for data mining each sub-population separately. Often more eective under expert guidance.

Copyright c 2006, Graham J. Williams

http://togaware.com

24/25/23

Copyright c 2006, Graham J. Williams

http://togaware.com

25/25/24

You might also like