Professional Documents
Culture Documents
Clusters 2x3
Clusters 2x3
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis Introduction Requirements Measuring Similarity Distances Data Types Algorithms Cluster Methods KMeans
http://togaware.com
1/25/1
http://togaware.com
3/25/2
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
How do we understand the worldthrough understanding every individual in the world? We categorise, for good or bad, entities into groups:
Socio-economic groups: the poor, the rich; Political: a lefty, a new right; Racial: religious, geographical,
Cluster analysis
Grouping a set of data objects into clusters
We nd that to get through in life we generally talk about groups, not individuals, but computers dont need tothey have the power to build an understanding of the individual, for better or worse.
http://togaware.com 4/25/3
http://togaware.com
5/25/4
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
Specic Examples
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identication of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
http://togaware.com
6/25/5
http://togaware.com
7/25/6
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
Clustering Caveats
High Quality:
high intra-class similarity low inter-class similarity
Depends on:
similarity measure algorithm for searching
Clustering may not be the best way to discover interesting groups in a data set. Often visulisation methods work well, allowing the human expert to identify useful groups. However, as the data set sizes increase to millions of entites, this becomes inpractical and clusters help to partition the data so that we can deal with smaller groups. Dierent algorithms deliver dierent clusterings.
http://togaware.com
8/25/7
http://togaware.com
9/25/8
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
Overview
Scalability Dierent attribute types Clusters with arbitrary shape Minimal domain knowledge required Can cope with noise and outliers Insensitive to order of input records High dimensionality
http://togaware.com
10/25/9
http://togaware.com
11/25/10
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
Minkowski distance
Distance measures the similarity or dissimilarity between two data objects a = (a1 , a2 , . . . , ap ) and b = (b1 , b2 , . . . , bp ). Properties
d(a, b) 0 d(a, a) = 0 d(a, b) = d(b, a) d(a, b) d(a, c) + d(c, b)
d(a, b) =
If q = 1, d is the Manhattan distance. d(a, b) = |a1 b1 | + |a2 b2 | + . . . + |ap bp | If q = 2, d is Euclidean distance: d(a, b) = (|a1 b1 |2 + |a2 b2 |2 + . . . + |ap bp |2 )
http://togaware.com
12/25/11
http://togaware.com
13/25/12
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
Overview
Interval-scaled variables Binary variables Nominal, ordinal, and ratio variables Variables of mixed types
http://togaware.com
14/25/13
http://togaware.com
15/25/14
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
http://togaware.com
16/25/15
http://togaware.com
17/25/16
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
5 5 5 5 5 1 1 1 1
8 7 6 5 4 3 2
5 5 5 5 5 1 1 1 1
8 7 6 5 4 3 2
5 5 5 5 5 1 1 1 1
5 4
Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no objects change clusters.
3 2 1 0 0 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 1 2 3 4 5 6
5
7 8 9 10
1 0 0 10 9 1 2 3 4
5
5 6 7 8 9 10
1 0 0 1 2 3 4
5
5 6 7 8 9 10
5 5 5 5 5 5 1 1 1
8 7 6 5 4 3 2
5 5 5 5
5 1 1
1
7 8 9 10
1 0 0 1 2 3 4
1
5 6 7 8 9 10
http://togaware.com
18/25/17
http://togaware.com
19/25/18
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
Comments on K-Means
Comments on K-Means
Weakness Strengths
Relatively ecient: O(tkn), where n is the number objects, k is the number of clusters, and t is the number iterations. Normally, k, t n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Applicable only when the mean is denedwhat about categorical data? Need to specify k, the number of clusters, in advance. Unable to handle noisy data and outliers. Not suitable for non-convex clusters.
http://togaware.com
20/25/19
http://togaware.com
21/25/20
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
KMeans in R
c l u s t e r s < 5 l o a d ( w i n e . Rdata ) w i n e . c l = kmeans ( w i n e [ , 2 : 3 ] , clusters ) p l o t ( wine [ , 2 : 3 ] , c o l=w i n e . c l $ c l u s t e r ) p o i n t s ( wine . c l $ c e n t e r s , pch =19 , c e x = 1 . 5 , c o l =1: c l u s t e r s ) de v . c o p y ( d e v i c e=pdf , f i l e =wine c l u s t e r s . p d f ) de v . o f f ( )
Copyright c 2006, Graham J. Williams http://togaware.com 22/25/21
6
q q q q q
q
q q q q q q q q q q q q q q qq q q q q q q q q
q q qq q q
Malic
q q q q q q
q q q q q qq q q q q q
q q q q
q q q q q
q q
qq q q q q q qq q q q q q q q q qq q qq q q q q q q q q q q q q q qq qq q q q q q q q q qq q q q q q qq q q q q q qq q q q q qq q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q
11
12
13 Alcohol
14
http://togaware.com
23/25/22
Cluster Analysis
Measuring Similarity
Algorithms
Cluster Analysis
Measuring Similarity
Algorithms
Summary
Cluster analysis is unsupervised learning. Useful for partitioning a very large population, perhaps for data mining each sub-population separately. Often more eective under expert guidance.
http://togaware.com
24/25/23
http://togaware.com
25/25/24