Professional Documents
Culture Documents
What Is Cluster Analysis?
What Is Cluster Analysis?
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Standardize data
Calculate the mean absolute deviation:
sf 1
n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where mf 1
n (x1 f x2 f ... xnf )
.
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x | 2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
0 1 1
d (Mary, jack ) 0.33
2 1 0 3
Jack
1 0 sum
1 2 1 3
Mary
0 0 3 3
sum 2 4 6
01
d ( jack , mary ) 0.33
2 01
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
Partitioning approach:
Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects) using
some criterion
Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
based)
7/10/22 Data Mining: Concepts and Techniques 26
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
p belongs to NEps(q)
core point condition: p MinPts = 5
q Eps = 1 cm
|NEps (q)| >= MinPts
Outlier
Border
Eps = 1cm
Core MinPts = 5
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of
(week)
Salary
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
age age
20 30 40 50 60 20 30 40 50 60
=3
Vacation
r y 30 50
l a
Sa age
Strength
automatically finds subspaces of the highest
Customer segmentation
Medical analysis
Drawbacks
most tests are for single attribute
data distribution
Distance-based outlier: A DB(p, D)-outlier is an object O
in a dataset T such that at least a fraction p of the
objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers
Index-based algorithm
Nested-loop algorithm
Cell-based algorithm