Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 82

BIS 541

Data Mining
Chapter 4
Cluster Analysis

2020/2021 Spring

1
Chapter 4. Cluster Analysis

 What is Cluster Analysis


 A Categorization of Major Clustering Methods
 Applications of Clustering
 Partitioning Methods
 Hierarchical Methods
 Model-Based Clustering Methods
 Density-Based Methods

2
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster

 Dissimilar to the objects in other clusters

 Cluster analysis
 Finding similarities between data according to the

characteristics found in the data and grouping similar


data objects into clusters
 Clustering is unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data

distribution
 As a preprocessing step for other algorithms

3
What Is Good Clustering?

 A good clustering method will produce high quality clusters


with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
4
Basic Metrics

 Centroid, radius, diameter


 Typical alternatives to calculate the distance
between clusters
 Single link, complete link, average, centroid,
medoid
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

 Radius: square root of average distance from any point of the


cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared distance between
all pairs of points in the cluster

 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)

6
Typical Alternatives to Calculate the
Distance between Clusters
 Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

 Complete link: largest distance between an element in one cluster


and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

 Average: avg distance between an element in one cluster and an


element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e., dis(K i,


Kj) = dis(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dis(K i, Kj)
= dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
7
Single link

***
** *
* *
* * * *
*
** *
Single link
Complete link

***
** *
* *
* * * *
*
** *
complete link
Average link

***
** *
* *
* * * *
*
** *
average link average of
all pairvise distacnes
A medoid of a cluster

****
**
*
**

One of the objects


With a big *
Is the claster representative
Requirements of Clustering in Data Mining

 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
12
Chapter 4. Cluster Analysis

 What is Cluster Analysis


 A Categorization of Major Clustering Methods
 Applications of Clustering
 Partitioning Methods
 Hierarchical Methods
 Model-Based Clustering Methods
 Density-Based Methods

13
Major Clustering Approaches (I)

 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
Major Clustering Approaches (II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
Chapter 4. Cluster Analysis

 What is Cluster Analysis


 A Categorization of Major Clustering Methods
 Applications of Clustering
 Partitioning Methods
 Hierarchical Methods
 Model-Based Clustering Methods
 Density-Based Methods

16
General Applications of Clustering

 Any data object (entity) with set of features (attributes)


can be clustered
 partitioned into groups

 Students
 according to grades

 Empoyees of a company,
 sales performance

 Brenchs of a retail compony


 bank brenchss

17
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
18
 WWW
 document classification

 cluster Weblog data to discover groups of

similar access patterns


 Cities
 Countries
Clustering Cities

 Clustering Turkish cities


 Based on
 Political

 Demographic

 Economical

 Characteristics
 Political :general elections 1999,2002
 Demographic:population,urbanization rates
 Economical:gnp per head,growth rate of gnp

20
Constraint-Based Clustering Analysis
 Clustering analysis: less parameters but more user-desired
constraints, e.g., an ATM allocation problem

21
Chapter 4. Cluster Analysis

 What is Cluster Analysis


 A Categorization of Major Clustering Methods
 Applications of Clustering
 Partitioning Methods
 Hierarchical Methods
 Model-Based Clustering Methods
 Density-Based Methods

22
Basic Measures for Clustering

 Clustering: Given a database D = {t1, t2, .., tn}, a


distance measure dis(ti, tj) defined between any
two objects ti and tj, and an integer value k, the
clustering problem is to define a mapping f: D 
{1, …, k} where each ti is assigned to one cluster
Kf, 1 ≤ f ≤ k, such that tfp,tfq ∈ Kf and ts ∉ Kf,
dis(tfp,tfq) ≤ dis(tfp,ts)

23
Partitioning Algorithms: Basic Concept
 Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance

 km1tmiKm (Cm  tmi ) 2


 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster

24
How to find sum of square distance
 SSE sum of square error or distance
 Given a data set D = {t1, t2, .., tn}
 There are K clusters
 Make SSE = 0, initially total sum of squaree is 0
 For all clusters f from 1 to K
 Calculate and store cluster center c ,
f
 For each data object i in cluster f

 calculate square of distance between cluster


center cf and the data object tfi,
 Add to the SSE
 Report SSE
The K-Means Clustering Algorithm

 choose k, number of clusters to be determined


 Choose k objects randomly as the initial cluster centers
 repeat
 Assign each object to their closest cluster center
 Using Euclidean distance
 Compute new cluster centers
 Calculate mean points
 until
 No change in cluster centers or
 No object change its cluster

26
The K-Means Clustering Method

 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3

the
3

each
2 2
2

1
objects
1

0
cluster 1

0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10

similar
center reassign reassign
10 10

K=2 9 9

8 8

Arbitrarily choose K 7 7

6 6
object as initial 5 5

cluster center 4 Update 4

2
the 3

1 cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10

27
Example TBP Sec 3.3 page 84 Table 3.6
 Instance X Y
 1 1.0 1.5
 2 1.0 4.5
 3 2.0 1.5
 4 2.0 3.5
 5 3.0 2.5
 6 5.0 6.1
 k is chosen as 2 k=2
 Chose two points at random representing initial
cluster centers
 Object 1 and 3 are chosen as cluster centers

28
6

4
5
1* 3*

29
Example cont.
 Euclidean distance between point i and j
 D(i - j)=( (Xi – Xj)2 + (Yi – Yj)2)1/2
 Initial cluster centers
 C1:(1.0,1.5) C2:(2.0,1.5) assigned to
 D(C1 – 1) = 0.00 D(C2 –1) = 1.00 C1
 D(C1 – 2) = 3.00 D(C2 –2) = 3.16 C1
 D(C1 – 3) = 1.00 D(C2 –3) = 0.00 C2
 D(C1 – 4) = 2.24 D(C2 –4) = 2.00 C2
 D(C1 – 5) = 2.24 D(C2 –5) = 1.41 C2
 D(C1 – 6) = 6.02 D(C2 –6) = 5.41 C2
 C1:{1,2} C2:{3.4.5.6}

30
6

4
5
1* 3*

31
Example cont.
 Recomputing cluster centers
 For C1:
 X
C1 = (1.0+1.0)/2 = 1.0

 YC1 = (1.5+4.5)/2 = 3.0


 For C2:
 X
C2 = (2.0+2.0+3.0+5.0)/4 = 3.0

 YC2 = (1.5+3.5+2.5+6.0)/4 = 3.375


 Thus the new cluster centers are
 C1(1.0,3.0) and C2(3.0,3.375)
 As the cluster centers have changed
 The algorithm performs another iteration

32
6

2
*
* 4
5
1 3

33
Example cont.
 New cluster centers
 C1(1.0,3.0) and C2(3.0,3.375) assigned to
 D(C1 – 1) = 1.50 D(C2 –1) = 2.74 C1
 D(C1 – 2) = 1.50 D(C2 –2) = 2.29 C1
 D(C1 – 3) = 1.80 D(C2 –3) = 2.13 C1
 D(C1 – 4) = 1.12 D(C2 –4) = 1.01 C2
 D(C1 – 5) = 2.06 D(C2 –5) = 0.88 C2
 D(C1 – 6) = 5.00 D(C2 –6) = 3.30 C2
 C1 {1,2.3} C2:{4.5.6}

34
Example cont.
 computing new cluster centers
 For C1:
 X
C1 = (1.0+1.0+2.0)/3 = 1.33

 YC1 = (1.5+4.5+1.5)/3 = 2.50


 For C2:
 X
C2 = (2.0+3.0+5.0)/3 = 3.33

 YC2 = (3.5+2.5+6.0)/3 = 4.00


 Thus the new cluster centers are
 C1(1.33,2.50) and C2(3.33,4.3.00)
 As the cluster centers have changed
 The algorithm performs another iteration

35
Exercise

 Perform the third iteration

36
Commands
 Each initial cluster centers may end up with
different final cluster configuration
 Finds local optimum but not necessarily the
global optimum
 Based on sum of squared error SSE

differences between objects and their cluster


centers
 Choose a terminating criterion such as
 Maximum acceptable SSE

 Execute K-Means algorithm until satisfying the

condition

37
Comments on the K-Means Method

 Strength: Relatively efficient: O(tkn), where n is # objects, k is #


clusters, and t is # iterations. Normally, k, t << n.
 Doubling data size – doubles computation times of each iteration
 Doubling k (# of clusters) – doubles computation time of each
interation
 Comparing:
 PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms

38
Weaknesses of K-Means Algorithm

 Applicable only when mean is defined, then what


about categorical data?
 Need to specify k, the number of clusters, in advance
 run the algorithm with different k values
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex
shapes
 Works best when clusters are of approximately of equal size

39
Presence of Outliers
x
x
x xx
x x xx x xx
xx x x x xx
xx x

When k=2
When k=3
When k=2 the two natural clusters are not captured

40
Quality of clustering depends on
unit of measure
income
x income x
x x
x
x
x
x

x x age

Income measured by
YTL,age by year
x x
So what to do?
age Income measured by
TL, age by year
41
Variations of the K-Means Method

 A few variants of the k-means which differ in


 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

42
Exercise

 Show by designing simple examples:


 a) K-means algorithm may converge to

different local optima starting from different


initial assignments of objects into different
clusters
 b) In the case of clusters of unequal size, K-

means algorithm may fail to catch the


obvious (natural) solution

43
A: boundary object
* Dist(A,C1) < Dsit(A,C2)
** *
* ** **** Assigned to cluster 1
* * *A* ****** *
*
* * * ** * *
*
* ***** *
* * ** *
*** ** * **
** *
Cluster I Cluster 2

44
How to chose K

 For reasonable values of k


 e.g. from 2 to 15

 Plot k versus SSE (sum of square error)


 Visually inspect the plot and
 as k increases SSE falls

 Choose the breaking point

45
Plot of SSE vs. K

Choose k when
SSE
SSE v.s. k curve becomes flat

2 4 6 8 10 12 k

46
Velidation of clustering

 partition the data into two equal groups


 apply clustering the
 one of these partitions

 compare cluster centers with the overall data

 or
 apply clustering to each of these groups

 compare cluster centers

47
The K-Medoids Clustering Method

 Find representative objects, called medoids, in clusters


 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)

48
6

4
5
1* 3*

49
CLARA (Clustering Large Applications) (1990)
 CLARA (Kaufmann and Rousseeuw in 1990)
 Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the
sample is biased
50
CLARANS (“Randomized” CLARA) (1994)
 CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
51
Chapter 4. Cluster Analysis

 What is Cluster Analysis


 A Categorization of Major Clustering Methods
 Applications of Clustering
 Partitioning Methods
 Hierarchical Methods
 Model-Based Clustering Methods
 Density-Based Methods

52
Hierarchical Clustering

 Use distance matrix as clustering criteria. This method


does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
53
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

54
A Dendrogram Shows How the
Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested


partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the


dendrogram at the desired level, then each connected
component forms a cluster.

55
Example

 A B C D E
 A 0
 B 1 0
 C 2 2 0
 D 2 4 1 0
 E 3 3 5 3 0

56
Single Link Distance Measure
1
A B
2 2 1
3 A B
3 C
E 4 5
2 1 C
3 D E
1
D
{A}{B}{C}{D}{E}
{AB}{CD}{E}
1
1 A B
A B 2 2
2 2 3
3 C 3
C E
E 2 1 2
2 1 3 D
D
{ABCDE} 1
{ABCD}{E}

A B C D E

57
Merging Calculations with single link
 At a distance 1
 {A} and {B} merge – D({A},{B}) = 1
 {C} and {D} merge – D({C},{D}) = 1
 Clustering structure: {A,B}, {C,D} {E}

 What are distances between these 3 clusters?


 D({A,B},{C,D}):
 D(A,C)=2, D(A,D)=2, D(B,C)=2, D(B,D)=4
 min(2,2,2,4) = 2 (single link)
 D({A,B},{E}):
 D(A,E)=3, D(B,E)=3
 min(3,3) = 3 (single link)
 D({C,D},{E}):
 D(C,E)=5, D(D,E)=3
 min(5,3) = 3 (single link)
 At a distance 2, {A,B} and {C,D} merges
Merging Calculations with single link

 2 clusters
 Clustering structure: {A,B,C,D} {E}

 What is the distance between these 2 clusters?


 D({A,B,C,D},{E}):
 D(A,E)=3, D(B,E)=3, D(C,E)=3, D(D,E)=5
 min(3,3,3,5) = 3 (single link)
 At a distance 3, {A,B,C,D} and {E} merges
 For a single cluster – top of dendrogram
Complete Link Distance Measure
1
A B
2 2 1
3 A B
3 C
E 4 5
2 1 C
3 D E
1
D
{A}{B}{C}{D}{E}
{AB}{CD}{E}
1
1 A B
A B 2 2
3 3 3
3 5
5 C
C E
E 2 4 1 3
1 3 D
D
{ABCDE} 1
{ABE}{CD}

C D A B E

60
Merging Calculations with complete link
 At a distance 1
 {A} and {B} merge – D({A},{B}) = 1
 {C} and {D} merge – D({C},{D}) = 1
 Clustering structure: {A,B}, {C,D} {E}

 What are distances between these 3 clusters?


 D({A,B},{C,D}):
 D(A,C)=2, D(A,D)=2, D(B,C)=2, D(B,D)=4
 max(2,2,2,4) = 4 (complete link)
 D({A,B},{E}):
 D(A,E)=3, D(B,E)=3
 max(3,3) = 3 (complete link)
 D({C,D},{E}):
 D(C,E)=5, D(D,E)=3
 max(5,3) = 5 (complete link)
 At a distance 3, {A,B} and {E} merges
Merging Calculations with completelink

 2 clusters
 Clustering structure: {A,B, E,} {C,D}

 What is the distance between these 2 clusters?


 D({A,B,E},{C,D}):
 D(A,C)=2, D(A,D)=2, D(B,C)=2, D(B,D)=4,
D(E,C)=5, D(E,D)=3
 max(2,2,2,4,5,3) = 5 (complete link)
 At a distance 5, {A,B,E} and {C,D} merges
 For a single cluster – top of dendrogram
Average Link distance measure
1
A B
2 2 1
3 A B
3 C
E 4 5
2 1 C
3 D E
1
D
{A}{B}{C}{D}{E}
{AB}{CD}{E} average link {AB} and {CD} is 1
1
1 A B
A B 2 2
2 2 3
3 3.5
5 C
C E
E 4 2 4 1 2.5
2 1 3 D
D
{ABCDE} 1
{ABCD}{E}
Average link Average link between
Between {AB} and{CD} {ABCD} and {E} A B C D E
Is 2.5 Is 3.5
63
Merging Calculations by average link
 At a distance 1
 {A} and {B} merge – D({A},{B}) = 1
 {C} and {D} merge – D({C},{D}) = 1
 Clustering structure: {A,B}, {C,D} {E}

 What are distances between these 3 clusters?


 D({A,B},{C,D}):
 D(A,C)=2, D(A,D)=2, D(B,C)=2, D(B,D)=4
 average(2,2,2,4) = 10/4 = 2.5 (average link)
 D({A,B},{E}):
 D(A,E)=3, D(B,E) = 3
 average(3,3) = 3 (average link)
 D({C,D},{E}):
 D(C,E)=5, D(D,E) = 3
 average(5,3) = 8/2 = 4 (average link)
 At a distance 2.5, {A,B} and {C,D} merges
Merging Calculations by average link

 2 clusters
 Clustering structure: {A,B,C,D} {E}

 What is the distance between these 2 clusters?


 D({A,B,C,D},{E}):
 D(A,E)=3, D(B,E)=3, D(C,E)=3, D(D,E)=5
 average(3,3,3,5) = 14/4 = 3.5 (average link)
 At a distance 3.5, {A,B,C,D} and {E} merges
 For a single cluster – top of dendrogram
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

66
More on Hierarchical Clustering Methods
 Major weakness of agglomerative clustering methods
 do not scale well: time complexity of at least O(n2),

where n is the number of total objects


 can never undo what was done previously

 Integration of hierarchical with distance-based clustering


 BIRCH (1996): uses CF-tree and incrementally adjusts

the quality of sub-clusters


 CURE (1998): selects well-scattered points from the

cluster and then shrinks them towards the center of the


cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using

dynamic modeling
67
Chapter 4. Cluster Analysis

 What is Cluster Analysis


 A Categorization of Major Clustering Methods
 Applications of Clustering
 Partitioning Methods
 Hierarchical Methods
 Model-Based Clustering Methods
 Density-Based Methods

68
Probability Based Clustering

 A mixture is a set of n probability distributions


where each distribution represents a cluster
 The mixture model assigns each object a
probability of being a member of cluster k
 Rather than assigning the object or not

assigning the object into cluster k


 as in K-mean, k-medoids…
 P(C1/X)

69
Simplest Case
 K = 2 two clusters
 There is one real valued attribute X
 Each having Normally distributed
 Five parameters to determine:
 mean and standard deviation for cluster A 
A

A
 mean and standard deviation for cluster B 
B
B
 Sampling probability for cluster A p
A
 As P(A) + P(B) = 1

70
 There are two populations each normally
distributed with mean and standard deviation
 An object comes from one of them but which one
is unknown
 Assign a probability that an object comes from

A or B
 That is equivalent to assign a cluster for each object

71
In parametrıc statıstıcs
Assume that objects come from a distribution usually
Normal distribution with unknown parameters  and 
•Estimate the parameters from data

•Test hypothesis based on the normality assumption

•Test the normality assumption

72
Two normal dirtributions
Bi modal
Arbitary shapes can be captured by mixture of normals

73
 Variable age
 K=3
 Clasters C1 C2 C3
 X1: age = 25
 P(C1/age = 25) = 0.4
 P(C2/age = 25) = 0.35
 P(C3/age = 25) = 0.25

 X2: age = 65
 P(C1/age = 65) = 0.1
 P(C2/age = 65) = 0.25
 P(C3/age = 65) = 0.65

74
Chapter 4. Cluster Analysis

 What is Cluster Analysis


 A Categorization of Major Clustering Methods
 Applications of Clustering
 Partitioning Methods
 Hierarchical Methods
 Model-Based Clustering Methods
 Density-Based Methods

75
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such
as density-connected points
 Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98)

76
Density Concepts
 Core object (CO)–object with at least ‘M’ objects within a
radius ‘E-neighborhood’
 Directly density reachable (DDR)–x is CO, y is in x’s ‘E-
neighborhood’
 Density reachable–there exists a chain of DDR objects
from x to y
 Density based cluster–density connected objects
maximum w.r.t. reachability

77
Density-Based Clustering: Background
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
 1) p belongs to NEps(q)
p MinPts = 5
 2) core point condition: q
Eps = 1 cm
|NEps (q)| >= MinPts
78
Density-Based Clustering: Background (II)

 Density-reachable:
p
 A point p is density-reachable
from a point q wrt. Eps, MinPts if p1
q
there is a chain of points p1, …,
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
 Density-connected
p q
 A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o wrt.
Eps and MinPts.
79
Let MinPts =3

*
* Q
*
*M
* P **
*
* *
*

*M,P are core objects since each is in an


 neighborhood containing at least 3 points
*Q is directly density-reachable from M
M is directly density-reachable form P and vice versa
*Q is indirectly density reachable form P since
Q is directly density-reachable from M and M is directly density reacable
From P but p is not density reacable from Q sicne Q is not a core object
80
DBSCAN: Density Based Spatial Clustering of
Applications with Noise

 Relies on a density-based notion of cluster: A cluster is


defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

81
DBSCAN: The Algorithm

 Arbitrary select a point p


 Retrieve all points density-reachable from p wrt Eps
and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
 Continue the process until all of the points have been
processed.
82

You might also like