BIS 541 Ch04 20-21 S

BIS 541
Data Mining
Chapter 4
Cluster Analysis
2020/2021 Spring
1
Chapter 4. Cluster Analysis
 What is Cluster Analysis

 A Categorization of Major Clustering Methods
 Applications of Clustering
 Partitioning Methods
 Hierarchical Methods
 Model-Based Clustering Methods
 Density-Based Methods
2
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar

data objects into clusters
 Clustering is unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data
distribution
 As a preprocessing step for other algorithms
3
What Is Good Clustering?
 A good clustering method will produce high quality clusters

with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns.
4
Basic Metrics
 Centroid, radius, diameter

 Typical alternatives to calculate the distance
between clusters
 Single link, complete link, average, centroid,
medoid
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip
 Radius: square root of average distance from any point of the

cluster to its centroid
 N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared distance between
all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)
6
Typical Alternatives to Calculate the
Distance between Clusters
 Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster

and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an

element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e., dis(K i,

Kj) = dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dis(K i, Kj)
= dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
7
Single link
***
** *
* *
* * * *
*
** *
Single link
Complete link
***
** *
* *
* * * *
*
** *
complete link
Average link
***
** *
* *
* * * *
*
** *
average link average of
all pairvise distacnes
A medoid of a cluster
****
**
*
**
One of the objects

With a big *
Is the claster representative
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
12

13
Major Clustering Approaches (I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
Major Clustering Approaches (II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering

16
General Applications of Clustering
 Any data object (entity) with set of features (attributes)

can be clustered
 partitioned into groups
 Students
 according to grades
 Empoyees of a company,
 sales performance
 Brenchs of a retail compony

 bank brenchss
17
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
 Land use: Identification of areas of similar land use in an
earth observation database
 Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
 City-planning: Identifying groups of houses according to
their house type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters
should be clustered along continent faults
18
 WWW
 document classification
 cluster Weblog data to discover groups of
similar access patterns

 Cities
 Countries
Clustering Cities
 Clustering Turkish cities

 Based on
 Political
 Demographic
 Economical
 Characteristics
 Political :general elections 1999,2002
 Demographic:population,urbanization rates
 Economical:gnp per head,growth rate of gnp
20
Constraint-Based Clustering Analysis
 Clustering analysis: less parameters but more user-desired
constraints, e.g., an ATM allocation problem
21

22
Basic Measures for Clustering
 Clustering: Given a database D = {t1, t2, .., tn}, a

distance measure dis(ti, tj) defined between any
two objects ti and tj, and an integer value k, the
clustering problem is to define a mapping f: D 
{1, …, k} where each ti is assigned to one cluster
Kf, 1 ≤ f ≤ k, such that tfp,tfq ∈ Kf and ts ∉ Kf,
dis(tfp,tfq) ≤ dis(tfp,ts)
23
Partitioning Algorithms: Basic Concept
 Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
 km1tmiKm (Cm  tmi ) 2

 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
24
How to find sum of square distance
 SSE sum of square error or distance
 Given a data set D = {t1, t2, .., tn}
 There are K clusters
 Make SSE = 0, initially total sum of squaree is 0
 For all clusters f from 1 to K
 Calculate and store cluster center c ,
f
 For each data object i in cluster f
 calculate square of distance between cluster

center cf and the data object tfi,
 Add to the SSE
 Report SSE
The K-Means Clustering Algorithm
 choose k, number of clusters to be determined

 Choose k objects randomly as the initial cluster centers
 repeat
 Assign each object to their closest cluster center
 Using Euclidean distance
 Compute new cluster centers
 Calculate mean points
 until
 No change in cluster centers or
 No object change its cluster
26
The K-Means Clustering Method
 Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
the
3
each
2 2
2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to most
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
similar
center reassign reassign
10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
6 6
object as initial 5 5
cluster center 4 Update 4
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
27
Example TBP Sec 3.3 page 84 Table 3.6
 Instance X Y
 1 1.0 1.5
 2 1.0 4.5
 3 2.0 1.5
 4 2.0 3.5
 5 3.0 2.5
 6 5.0 6.1
 k is chosen as 2 k=2
 Chose two points at random representing initial
cluster centers
 Object 1 and 3 are chosen as cluster centers
28
6
4
5
1* 3*
29
Example cont.
 Euclidean distance between point i and j
 D(i - j)=( (Xi – Xj)2 + (Yi – Yj)2)1/2
 Initial cluster centers
 C1:(1.0,1.5) C2:(2.0,1.5) assigned to
 D(C1 – 1) = 0.00 D(C2 –1) = 1.00 C1
 D(C1 – 2) = 3.00 D(C2 –2) = 3.16 C1
 D(C1 – 3) = 1.00 D(C2 –3) = 0.00 C2
 D(C1 – 4) = 2.24 D(C2 –4) = 2.00 C2
 D(C1 – 5) = 2.24 D(C2 –5) = 1.41 C2
 D(C1 – 6) = 6.02 D(C2 –6) = 5.41 C2
 C1:{1,2} C2:{3.4.5.6}
30
6
4
5
1* 3*
31
Example cont.
 Recomputing cluster centers
 For C1:
 X
C1 = (1.0+1.0)/2 = 1.0
 YC1 = (1.5+4.5)/2 = 3.0

 For C2:
 X
C2 = (2.0+2.0+3.0+5.0)/4 = 3.0
 YC2 = (1.5+3.5+2.5+6.0)/4 = 3.375

 Thus the new cluster centers are
 C1(1.0,3.0) and C2(3.0,3.375)
 As the cluster centers have changed
 The algorithm performs another iteration
32
6
2
*
* 4
5
1 3
33
Example cont.
 New cluster centers
 C1(1.0,3.0) and C2(3.0,3.375) assigned to
 D(C1 – 1) = 1.50 D(C2 –1) = 2.74 C1
 D(C1 – 2) = 1.50 D(C2 –2) = 2.29 C1
 D(C1 – 3) = 1.80 D(C2 –3) = 2.13 C1
 D(C1 – 4) = 1.12 D(C2 –4) = 1.01 C2
 D(C1 – 5) = 2.06 D(C2 –5) = 0.88 C2
 D(C1 – 6) = 5.00 D(C2 –6) = 3.30 C2
 C1 {1,2.3} C2:{4.5.6}
34
Example cont.
 computing new cluster centers
 For C1:
 X
C1 = (1.0+1.0+2.0)/3 = 1.33
 YC1 = (1.5+4.5+1.5)/3 = 2.50

 For C2:
 X
C2 = (2.0+3.0+5.0)/3 = 3.33
 YC2 = (3.5+2.5+6.0)/3 = 4.00

 Thus the new cluster centers are
 C1(1.33,2.50) and C2(3.33,4.3.00)
 As the cluster centers have changed
 The algorithm performs another iteration
35
Exercise
 Perform the third iteration
36
Commands
 Each initial cluster centers may end up with
different final cluster configuration
 Finds local optimum but not necessarily the
global optimum
 Based on sum of squared error SSE
differences between objects and their cluster

centers
 Choose a terminating criterion such as
 Maximum acceptable SSE
 Execute K-Means algorithm until satisfying the
condition
37
Comments on the K-Means Method
 Strength: Relatively efficient: O(tkn), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n.
 Doubling data size – doubles computation times of each iteration
 Doubling k (# of clusters) – doubles computation time of each
interation
 Comparing:
 PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
38
Weaknesses of K-Means Algorithm
 Applicable only when mean is defined, then what

about categorical data?
 Need to specify k, the number of clusters, in advance
 run the algorithm with different k values
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex
shapes
 Works best when clusters are of approximately of equal size
39
Presence of Outliers
x
x
x xx
x x xx x xx
xx x x x xx
xx x
When k=2
When k=3
When k=2 the two natural clusters are not captured
40
Quality of clustering depends on
unit of measure
income
x income x
x x
x
x
x
x
x x age
Income measured by
YTL,age by year
x x
So what to do?
age Income measured by
TL, age by year
41
Variations of the K-Means Method
 A few variants of the k-means which differ in

 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method
42
Exercise
 Show by designing simple examples:

 a) K-means algorithm may converge to
different local optima starting from different

initial assignments of objects into different
clusters
 b) In the case of clusters of unequal size, K-
means algorithm may fail to catch the

obvious (natural) solution
43
A: boundary object
* Dist(A,C1) < Dsit(A,C2)
** *
* ** **** Assigned to cluster 1
* * *A* ****** *
*
* * * ** * *
*
* ***** *
* * ** *
*** ** * **
** *
Cluster I Cluster 2
44
How to chose K
 For reasonable values of k

 e.g. from 2 to 15
 Plot k versus SSE (sum of square error)

 Visually inspect the plot and
 as k increases SSE falls
 Choose the breaking point
45
Plot of SSE vs. K
Choose k when
SSE
SSE v.s. k curve becomes flat
2 4 6 8 10 12 k
46
Velidation of clustering
 partition the data into two equal groups

 apply clustering the
 one of these partitions
 compare cluster centers with the overall data
 or
 apply clustering to each of these groups
 compare cluster centers
47
The K-Medoids Clustering Method
 Find representative objects, called medoids, in clusters

 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)
48
6
4
5
1* 3*
49
CLARA (Clustering Large Applications) (1990)
 CLARA (Kaufmann and Rousseeuw in 1990)
 Built in statistical analysis packages, such as S+
 It draws multiple samples of the data set, applies PAM on
each sample, and gives the best clustering as the output
 Strength: deals with larger data sets than PAM
 Weakness:
 Efficiency depends on the sample size
 A good clustering based on samples will not necessarily
represent a good clustering of the whole data set if the
sample is biased
50
CLARANS (“Randomized” CLARA) (1994)
 CLARANS (A Clustering Algorithm based on Randomized
Search) (Ng and Han’94)
 CLARANS draws sample of neighbors dynamically
 The clustering process can be presented as searching a
graph where every node is a potential solution, that is, a
set of k medoids
 If the local optimum is found, CLARANS starts with new
randomly selected node in search for a new local optimum
 It is more efficient and scalable than both PAM and CLARA
 Focusing techniques and spatial access structures may
further improve its performance (Ester et al.’95)
51

52
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method

does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
53
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
54
A Dendrogram Shows How the
Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the

dendrogram at the desired level, then each connected
component forms a cluster.
55
Example
 A B C D E
 A 0
 B 1 0
 C 2 2 0
 D 2 4 1 0
 E 3 3 5 3 0
56
Single Link Distance Measure
1
A B
2 2 1
3 A B
3 C
E 4 5
2 1 C
3 D E
1
D
{A}{B}{C}{D}{E}
{AB}{CD}{E}
1
1 A B
A B 2 2
2 2 3
3 C 3
C E
E 2 1 2
2 1 3 D
D
{ABCDE} 1
{ABCD}{E}
A B C D E
57
Merging Calculations with single link
 At a distance 1
 {A} and {B} merge – D({A},{B}) = 1
 {C} and {D} merge – D({C},{D}) = 1
 Clustering structure: {A,B}, {C,D} {E}
 What are distances between these 3 clusters?

 D({A,B},{C,D}):
 D(A,C)=2, D(A,D)=2, D(B,C)=2, D(B,D)=4
 min(2,2,2,4) = 2 (single link)
 D({A,B},{E}):
 D(A,E)=3, D(B,E)=3
 min(3,3) = 3 (single link)
 D({C,D},{E}):
 D(C,E)=5, D(D,E)=3
 min(5,3) = 3 (single link)
 At a distance 2, {A,B} and {C,D} merges
Merging Calculations with single link
 2 clusters
 Clustering structure: {A,B,C,D} {E}
 What is the distance between these 2 clusters?

 D({A,B,C,D},{E}):
 D(A,E)=3, D(B,E)=3, D(C,E)=3, D(D,E)=5
 min(3,3,3,5) = 3 (single link)
 At a distance 3, {A,B,C,D} and {E} merges
 For a single cluster – top of dendrogram
Complete Link Distance Measure
1
A B
2 2 1
3 A B
3 C
E 4 5
2 1 C
3 D E
1
D
{A}{B}{C}{D}{E}
{AB}{CD}{E}
1
1 A B
A B 2 2
3 3 3
3 5
5 C
C E
E 2 4 1 3
1 3 D
D
{ABCDE} 1
{ABE}{CD}
C D A B E
60
Merging Calculations with complete link
 At a distance 1
 {A} and {B} merge – D({A},{B}) = 1
 {C} and {D} merge – D({C},{D}) = 1

 D({A,B},{C,D}):
 D(A,C)=2, D(A,D)=2, D(B,C)=2, D(B,D)=4
 max(2,2,2,4) = 4 (complete link)
 D({A,B},{E}):
 D(A,E)=3, D(B,E)=3
 max(3,3) = 3 (complete link)
 D({C,D},{E}):
 D(C,E)=5, D(D,E)=3
 max(5,3) = 5 (complete link)
 At a distance 3, {A,B} and {E} merges
Merging Calculations with completelink
 2 clusters
 Clustering structure: {A,B, E,} {C,D}

 D({A,B,E},{C,D}):
 D(A,C)=2, D(A,D)=2, D(B,C)=2, D(B,D)=4,
D(E,C)=5, D(E,D)=3
 max(2,2,2,4,5,3) = 5 (complete link)
 At a distance 5, {A,B,E} and {C,D} merges
Average Link distance measure
1
A B
2 2 1
3 A B
3 C
E 4 5
2 1 C
3 D E
1
D
{A}{B}{C}{D}{E}
{AB}{CD}{E} average link {AB} and {CD} is 1
1
1 A B
A B 2 2
2 2 3
3 3.5
5 C
C E
E 4 2 4 1 2.5
2 1 3 D
D
{ABCDE} 1
{ABCD}{E}
Average link Average link between
Between {AB} and{CD} {ABCD} and {E} A B C D E
Is 2.5 Is 3.5
63
Merging Calculations by average link
 At a distance 1
 {A} and {B} merge – D({A},{B}) = 1
 {C} and {D} merge – D({C},{D}) = 1

 D({A,B},{C,D}):
 D(A,C)=2, D(A,D)=2, D(B,C)=2, D(B,D)=4
 average(2,2,2,4) = 10/4 = 2.5 (average link)
 D({A,B},{E}):
 D(A,E)=3, D(B,E) = 3
 average(3,3) = 3 (average link)
 D({C,D},{E}):
 D(C,E)=5, D(D,E) = 3
 average(5,3) = 8/2 = 4 (average link)
 At a distance 2.5, {A,B} and {C,D} merges
Merging Calculations by average link
 2 clusters
 Clustering structure: {A,B,C,D} {E}

 D({A,B,C,D},{E}):
 D(A,E)=3, D(B,E)=3, D(C,E)=3, D(D,E)=5
 average(3,3,3,5) = 14/4 = 3.5 (average link)
 At a distance 3.5, {A,B,C,D} and {E} merges
DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
66
More on Hierarchical Clustering Methods
 Major weakness of agglomerative clustering methods
 do not scale well: time complexity of at least O(n2),
where n is the number of total objects

 can never undo what was done previously
 Integration of hierarchical with distance-based clustering

 BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters

 CURE (1998): selects well-scattered points from the
cluster and then shrinks them towards the center of the

cluster by a specified fraction
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling
67

68
Probability Based Clustering
 A mixture is a set of n probability distributions

where each distribution represents a cluster
 The mixture model assigns each object a
probability of being a member of cluster k
 Rather than assigning the object or not
assigning the object into cluster k

 as in K-mean, k-medoids…
 P(C1/X)
69
Simplest Case
 K = 2 two clusters
 There is one real valued attribute X
 Each having Normally distributed
 Five parameters to determine:
 mean and standard deviation for cluster A 
A
A
 mean and standard deviation for cluster B 
B
B
 Sampling probability for cluster A p
A
 As P(A) + P(B) = 1
70
 There are two populations each normally
distributed with mean and standard deviation
 An object comes from one of them but which one
is unknown
 Assign a probability that an object comes from
A or B
 That is equivalent to assign a cluster for each object
71
In parametrıc statıstıcs
Assume that objects come from a distribution usually
Normal distribution with unknown parameters  and 
•Estimate the parameters from data
•Test hypothesis based on the normality assumption
•Test the normality assumption
72
Two normal dirtributions
Bi modal
Arbitary shapes can be captured by mixture of normals
73
 Variable age
 K=3
 Clasters C1 C2 C3
 X1: age = 25
 P(C1/age = 25) = 0.4
 P(C2/age = 25) = 0.35
 P(C3/age = 25) = 0.25
 X2: age = 65
 P(C1/age = 65) = 0.1
 P(C2/age = 65) = 0.25
 P(C3/age = 65) = 0.65
74

75
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such
as density-connected points
 Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98)
76
Density Concepts
 Core object (CO)–object with at least ‘M’ objects within a
radius ‘E-neighborhood’
 Directly density reachable (DDR)–x is CO, y is in x’s ‘E-
neighborhood’
 Density reachable–there exists a chain of DDR objects
from x to y
 Density based cluster–density connected objects
maximum w.r.t. reachability
77
Density-Based Clustering: Background
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) <= Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q wrt. Eps, MinPts if
 1) p belongs to NEps(q)
p MinPts = 5
 2) core point condition: q
Eps = 1 cm
|NEps (q)| >= MinPts
78
Density-Based Clustering: Background (II)
 Density-reachable:
p
 A point p is density-reachable
from a point q wrt. Eps, MinPts if p1
q
there is a chain of points p1, …,
pn, p1 = q, pn = p such that pi+1 is
directly density-reachable from pi
 Density-connected
p q
 A point p is density-connected to a
point q wrt. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o wrt.
Eps and MinPts.
79
Let MinPts =3
*
* Q
*
*M
* P **
*
* *
*
*M,P are core objects since each is in an

 neighborhood containing at least 3 points
*Q is directly density-reachable from M
M is directly density-reachable form P and vice versa
*Q is indirectly density reachable form P since
Q is directly density-reachable from M and M is directly density reacable
From P but p is not density reacable from Q sicne Q is not a core object
80
DBSCAN: Density Based Spatial Clustering of
Applications with Noise
 Relies on a density-based notion of cluster: A cluster is

defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
81
DBSCAN: The Algorithm
 Arbitrary select a point p

 Retrieve all points density-reachable from p wrt Eps
and MinPts.
 If p is a core point, a cluster is formed.
 If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
 Continue the process until all of the points have been
processed.
82

BIS 541 Ch04 20-21 S

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BIS 541 Ch04 20-21 S

Uploaded by

Copyright:

Available Formats

BIS 541

 What is Cluster Analysis

 Dissimilar to the objects in other clusters

characteristics found in the data and grouping similar

 A good clustering method will produce high quality clusters

 Centroid, radius, diameter

 Radius: square root of average distance from any point of the

 Complete link: largest distance between an element in one cluster

 Average: avg distance between an element in one cluster and an

 Centroid: distance between the centroids of two clusters, i.e., dis(K i,

One of the objects

 What is Cluster Analysis

 What is Cluster Analysis

 Any data object (entity) with set of features (attributes)

 Brenchs of a retail compony

 cluster Weblog data to discover groups of

similar access patterns

 Clustering Turkish cities

 What is Cluster Analysis

 Clustering: Given a database D = {t1, t2, .., tn}, a

 km1tmiKm (Cm  tmi ) 2

 calculate square of distance between cluster

 choose k, number of clusters to be determined

cluster center 4 Update 4

 YC1 = (1.5+4.5)/2 = 3.0

 YC2 = (1.5+3.5+2.5+6.0)/4 = 3.375

 YC1 = (1.5+4.5+1.5)/3 = 2.50

 YC2 = (3.5+2.5+6.0)/3 = 4.00

 Perform the third iteration

differences between objects and their cluster

 Execute K-Means algorithm until satisfying the

 Strength: Relatively efficient: O(tkn), where n is # objects, k is #

 Applicable only when mean is defined, then what

 A few variants of the k-means which differ in

 Show by designing simple examples:

different local optima starting from different

means algorithm may fail to catch the

 For reasonable values of k

 Plot k versus SSE (sum of square error)

 Choose the breaking point

 partition the data into two equal groups

 compare cluster centers with the overall data

 compare cluster centers

 Find representative objects, called medoids, in clusters

 What is Cluster Analysis

 Use distance matrix as clustering criteria. This method

Decompose data objects into a several levels of nested

A clustering of the data objects is obtained by cutting the

 What are distances between these 3 clusters?

 What is the distance between these 2 clusters?

 What are distances between these 3 clusters?

 What is the distance between these 2 clusters?

 What are distances between these 3 clusters?

 What is the distance between these 2 clusters?

 Introduced in Kaufmann and Rousseeuw (1990)

where n is the number of total objects

 Integration of hierarchical with distance-based clustering

the quality of sub-clusters

cluster and then shrinks them towards the center of the

 What is Cluster Analysis

 A mixture is a set of n probability distributions

assigning the object into cluster k

•Test hypothesis based on the normality assumption