Professional Documents
Culture Documents
Data Mining Techniques & Applications: Cluster Detection Methods
Data Mining Techniques & Applications: Cluster Detection Methods
attributes
low tall normal 1 nonsmoker P
normal medium high 3 smoker P
low short high 2 smoker P
heavy tall high 2 nonsmoker P
i and j of p nominal
attributes. Let m d (i, j) = p −p m
represent the number
of attributes where the d (row1, row2) =
6−4 1
=
values of the two 6 3
0 +1
d ( jack , mary ) = = 0.33
2 + 0 +1
1+1
d ( jack , jim) = = 0.67
1+1+1
1+ 2
d ( jim, mary ) = = 0.75
1+1+ 2
Euclidean distance (q = 2)
d (i, j) = (| x − x | 2 + | x − x | 2 +...+ | x − x | 2 )
i1 j1 i2 j2 ip jp
Supremum (q = ∞)
d (i, j ) = max it − jt .
t
i• j n
j i • j = ik jk
n
cos(i, j ) =
|| i || || j || k =1
|| i ||= i
k =1
k
2
i
◼ e.g. Given two vectors: x = (3, 2, 0, 5) and y = (1, 0, 0, 0),
x • y = 3*1 + 2*0 + 0*0 + 5*0 = 3
||x|| = sqrt(32 + 22 + 02 + 52) ≈ 6.16
||y|| = sqrt(12 + 02 + 02 + 02) = 1
◼ Similarity: cos(x, y) = 3/(6.16 * 1) = 0.49
Dissimilarity: 1 – cos(x,y) = 0.51
k dk
Compute the overall dissimilarity d (i, j ) = k =1
n
k =1
k
•••••
Sarajevo School of Science and Technology 26
Basic Clustering Methods
The Agglomeration Method
Dendrogram
Strength/Weakness
◼ Strengths: no need for k, multiple versions of clustering
◼ Weaknesses: not so efficient. Does not scale up well.
◼ Cannot undo what was done in the previous iteration (cf. the K-means
method)
dist( p, t
p , t p S
p)
H ( P, S ) =
dist(m, t m) + dist( p, t p)
m , t m P p , t p S A
where:
P: a set of n randomly generated data points
S: a sample of n data points from the data set
tp: the nearest neighbor of point p in S
tm: the nearest neighbor of point m in P
e.g. After a number of trials,
average(HA) ≈ 0.56, and B
average(HB) ≈ 0.89
There is no cluster converging in figure A but there
is in figure B.
Clustering Results
DBScan Chameleon
CURE