Clusters 2x3

Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Algorithms
Overview Data Mining Algorithms

Cluster Analysis
Graham Williams
Principal Data Miner, ATO Adjunct Associate Professor, ANU
Cluster Analysis Introduction Requirements Measuring Similarity Distances Data Types Algorithms Cluster Methods KMeans
Copyright c 2006, Graham J. Williams
http://togaware.com
1/25/1
http://togaware.com
3/25/2
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
What is Cluster Analysis?

1
What is Cluster Analysis?

Cluster: a collection of data objects
Similar to one another within the same cluster Dissimilar to the objects in other clusters
How do we understand the behaviour of an individual?

Paint everyone with the same brush; Treat everyone as an individual.
How do we understand the worldthrough understanding every individual in the world? We categorise, for good or bad, entities into groups:
Socio-economic groups: the poor, the rich; Political: a lefty, a new right; Racial: religious, geographical,
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classication: no predened classesdescriptive data mining. Typical applications

As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
We nd that to get through in life we generally talk about groups, not individuals, but computers dont need tothey have the power to build an understanding of the individual, for better or worse.
http://togaware.com 4/25/3
http://togaware.com
5/25/4
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
General Applications of Clustering

Pattern Recognition Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining
Specic Examples
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identication of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Image Processing Economic Science (especially market research) WWW

Document Classication Question Categorisation Weblog Access Patterns
http://togaware.com
6/25/5
http://togaware.com
7/25/6
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
What Is Good Clustering?
Clustering Caveats
High Quality:
high intra-class similarity low inter-class similarity
Depends on:
similarity measure algorithm for searching
Ability to discover hidden patterns
Clustering may not be the best way to discover interesting groups in a data set. Often visulisation methods work well, allowing the human expert to identify useful groups. However, as the data set sizes increase to millions of entites, this becomes inpractical and clusters help to partition the data so that we can deal with smaller groups. Dierent algorithms deliver dierent clusterings.
http://togaware.com
8/25/7
http://togaware.com
9/25/8
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
Requirements of Clustering in Data Mining

Overview
Scalability Dierent attribute types Clusters with arbitrary shape Minimal domain knowledge required Can cope with noise and outliers Insensitive to order of input records High dimensionality
http://togaware.com
10/25/9
http://togaware.com
11/25/10
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
Similarity and Dissimilarity Between Objects
Minkowski distance
Distance measures the similarity or dissimilarity between two data objects a = (a1 , a2 , . . . , ap ) and b = (b1 , b2 , . . . , bp ). Properties
d(a, b) 0 d(a, a) = 0 d(a, b) = d(b, a) d(a, b) d(a, c) + d(c, b)
d(a, b) =
(|a1 b1 |q + |a2 b2 |q + . . . + |ap bp |q )
If q = 1, d is the Manhattan distance. d(a, b) = |a1 b1 | + |a2 b2 | + . . . + |ap bp | If q = 2, d is Euclidean distance: d(a, b) = (|a1 b1 |2 + |a2 b2 |2 + . . . + |ap bp |2 )
http://togaware.com
12/25/11
http://togaware.com
13/25/12
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
Type of data in clustering analysis

Overview
Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types
http://togaware.com
14/25/13
http://togaware.com
15/25/14
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
Major Clustering Approaches

Partitioning algorithms (kmeans, pam, clara, fanny): Construct various partitions and then evaluate them by some criterion. A xed number of clusters, k, is generated. Start with an initial (perhaps random) cluster. Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion Density-based: based on connectivity and density functions Grid-based: based on a multiple-level granularity structure Model-based: A model is hypothesized for each of the clusters and the idea is to nd the best t of that model
Basic Partitioning Algorithm
Partition database D of n objects into k clusters

Given k, nd k clusters that optimises partitioning criterion Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms

k-means: Each cluster represented by center of the cluster k-medoids or PAM (partition around medoids): Each cluster represented by one of the objects in the cluster
http://togaware.com
16/25/15
http://togaware.com
17/25/16
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
The K-Means Clustering Method

10 9 8 7 6
The K-Means Clustering Method

10 9 10 9
5 5 5 5 5 1 1 1 1
8 7 6 5 4 3 2
5 5 5 5 5 1 1 1 1
8 7 6 5 4 3 2
5 5 5 5 5 1 1 1 1
Given k, the k-means algorithm is implemented in 4 steps:

1 2
5 4
Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no objects change clusters.
3 2 1 0 0 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 1 2 3 4 5 6
5
7 8 9 10
1 0 0 10 9 1 2 3 4
5
5 6 7 8 9 10
1 0 0 1 2 3 4
5
5 6 7 8 9 10
5 5 5 5 5 5 1 1 1
8 7 6 5 4 3 2
5 5 5 5
5 1 1
1
7 8 9 10
1 0 0 1 2 3 4
1
5 6 7 8 9 10
http://togaware.com
18/25/17
http://togaware.com
19/25/18
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
Comments on K-Means
Comments on K-Means
Weakness Strengths
Relatively ecient: O(tkn), where n is the number objects, k is the number of clusters, and t is the number iterations. Normally, k, t n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Applicable only when the mean is denedwhat about categorical data? Need to specify k, the number of clusters, in advance. Unable to handle noisy data and outliers. Not suitable for non-convex clusters.
http://togaware.com
20/25/19
http://togaware.com
21/25/20
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
KMeans in R
c l u s t e r s < 5 l o a d ( w i n e . Rdata ) w i n e . c l = kmeans ( w i n e [ , 2 : 3 ] , clusters ) p l o t ( wine [ , 2 : 3 ] , c o l=w i n e . c l $ c l u s t e r ) p o i n t s ( wine . c l $ c e n t e r s , pch =19 , c e x = 1 . 5 , c o l =1: c l u s t e r s ) de v . c o p y ( d e v i c e=pdf , f i l e =wine c l u s t e r s . p d f ) de v . o f f ( )
Copyright c 2006, Graham J. Williams http://togaware.com 22/25/21
6
q q q q q
Rattle: Hierarchical Variable Cluster
q
q q q q q q q q q q q q q q qq q q q q q q q q
q q qq q q
Malic
q q q q q q
q q q q q qq q q q q q
q q q q
q q q q q
q q
qq q q q q q qq q q q q q q q q qq q qq q q q q q q q q q q q q q qq qq q q q q q q q q qq q q q q q qq q q q q q qq q q q q qq q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q
11
12
13 Alcohol
14
http://togaware.com
23/25/22
Cluster Analysis
Algorithms
Cluster Analysis
Algorithms
Rattle: Hierarchical Data Cluster
Summary
Cluster analysis is unsupervised learning. Useful for partitioning a very large population, perhaps for data mining each sub-population separately. Often more eective under expert guidance.
http://togaware.com
24/25/23
http://togaware.com
25/25/24

Clusters 2x3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clusters 2x3

Uploaded by

Copyright:

Available Formats

Cluster Analysis

Overview Data Mining Algorithms

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

What is Cluster Analysis?

What is Cluster Analysis?

How do we understand the behaviour of an individual?

Clustering is unsupervised classication: no predened classesdescriptive data mining. Typical applications

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

General Applications of Clustering

Image Processing Economic Science (especially market research) WWW

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

What Is Good Clustering?

Ability to discover hidden patterns

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

Requirements of Clustering in Data Mining

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

Similarity and Dissimilarity Between Objects

(|a1 b1 |q + |a2 b2 |q + . . . + |ap bp |q )

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

Type of data in clustering analysis

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

Major Clustering Approaches

Basic Partitioning Algorithm

Partition database D of n objects into k clusters

Heuristic methods: k-means and k-medoids algorithms

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

The K-Means Clustering Method

The K-Means Clustering Method

Given k, the k-means algorithm is implemented in 4 steps:

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

Rattle: Hierarchical Variable Cluster

Copyright c 2006, Graham J. Williams

Rattle: Hierarchical Data Cluster

Copyright c 2006, Graham J. Williams

Copyright c 2006, Graham J. Williams

You might also like