Professional Documents
Culture Documents
Supriyahw3 12040810
Supriyahw3 12040810
n
s = i=1 P(x − ai )2
∂s n
∂ui =P 2 i=1 (x − ai )
n
nx =P i=1 (ai )
n
x = i=1 (ai ) ÷n
k-means cost function decreases monotonically by the k-means clustering
algorithm
( ( (
let z1 t) , z2 t) , .................zk t)
( t) ( t) ( t)
let c1 , c2 , .................ck
are the centres and clusters of the ith iteration
step1:
each data point to its closest centre so
( ( ( ( ( ( ( ( ( ( (
cost ( c1 t+1 ), c2 t+1 ), ...........ck t+1 ); z1 t) , z2 t) , .................zk t) ) ≤ cost(c1 t) , c2 t) , .................ck t) ; z1 t) , z2 t) , .
second step:
( ( ( ( ( (
cost (c1 t+1 ) , c2 t+1 ) , ........... ck t+1 ); z1 t+1 ), z2 t+1 ), .................zk t+1 )) ≤
( t+1 ( t+1 ( t) ( t) ( t) ( t)
cost(c1 ), c2 ), .................ck ; z1 , z2 , .................zk )
so we can say that k-means cost function decreases monotonically by the
k-means clustering algorithm
2)
1-dimension clustering
minimize the sum of distances not squared to the cluster center known as
k-median criteria
we have to show that it is not general
let’s take an 1 dimensional clustering example
[2,4,10,12,13,20,30,11,25]
step 1 :
m1 = 2
m2 = 4
c1 = 2, 3
c2 = 4, 10, 12, 20, 30, 11, 25
step2:
m1 = 2.5m2 = 16
1
dataset d(D,m1 ) d(D,m2 )height2
0.5 14
4 1.5 12
10 7.5 6
12 9.5 4
3 0.5 13
20 17.5 4
30 27.5 14
11 8.5 5
25 22.5 9
c1 = 2, 4, 3
c2 = 10, 12, 20, 30, 11, 25
dataset d(D,m1 ) d(D, m2 )height2
1 16
4 1 14
10 7 8
12 9 6
3 0 15
20 17 2
30 27 12
11 8 7
25 22 7
c1 = 2, 4, 10, 3
c2 = 12, 20, 30, 11, 25
step 4
m1 = 4.75
m2 = 19.6
dataset d(D,m1 ) d(D,m2 )height2
2.75 17.6
4 0.75 15.6
10 5.25 9.6
12 7.25 7.6
3 1.75 18.6
20 15.25 0.4
30 25.25 10.4
11 6.25 5=8.6
25 20.25 5.9
here the clusters are
c1 = 2, 4, 10, 12, 3, 11
c2 = 20, 30, 25
step - 5
m1 = 7
m2 = 25
2
dataset d(D,m1 ) d(D,m2 )height2
5 23
4 3 21
10 3 15
12 5 13
3 4 22
20 13 5
30 23 5
11 4 14
25 18 0
k=2
suppose mediods = (3,4) ; (7,4)
1) (7,6)-(7,4) = 2 (7,6)-(3,4) = 6
2)(2,6)-(7,4) = 7 (2,6)-(3,4) = 3
3)(3,8)-(7,4) = 8 (3,)-(3,4) = 4
so we get the clusters
k1 = (3, 4)(2, 6)(3, 8)(4, 7)
3
k2 = (7, 4)(6, 2)(6, 4)(7, 3)(8, 5)(7, 6)
minimum cost = from k1 , costof (3, 4)f romalltheotherpoints
minimum cost = 3+4+4+3+1+1+2+2 = 20
new time complexity :
time complexity is a programming the algorithm how much amount of time
it takes
means how much time it takes for an iteration
notations for calculating algorithm running time compplexity
1) o notation
2) omega notation
3)theta notation
7)
µ(s)isthecentroid
µ(s)=Σsi=1 si ÷ s
µ(T )=ΣTi=1 Ti ÷ T
µ(SU T ) =Σsi=1 si +ΣTi=1 Ti ÷ S + T
µ(SU T ) − (S) =Σsi=1 si +ΣTi=1 Ti ÷ S + T −Σsi=1 si ÷ s
= sµ(s) + T µ(t) ÷ s + t − sµ(s) ÷ s
=
sµ(s) + T µ(T ) − sµ(s) − T µ(s) ÷ s + T
=
T÷S + T |µ(T ) − µ(S)|
8)
D=((1,4),(1,5),(2,6),(3,5),(5,2),(8,2),(8,3),(9,1),(9,2),(9,3)).
i)
given centres are (9,1) and (8,3) by using the distance function that if two
points a = (x1 , y1 )andb = (x2 , y2 )aregiventhendistanceis
d(a,b) = —x2 − x1 | + |y2 − y1 |
iteration 1:
by calculating distance between the data points and given centres (8,3) (9,1)
(8,3) (9,1)
(1,4) 8 11
(1,5) 9 12
(2,6) 9 12
(3,5) 7 10
(5,2) 4 5
(8,2) 1 2
(8,3) 0 3
(9,1) 3 0
(9,2) 2 1
(9,3) 1 2
4
by finding the average of the points in the cluster we can find the new centre:
new cluster centre of cluster 1 : (4.625,3.75)
new cluster centre of cluster 2 : (9,1.5)
iteration 2 :
calculating distance between data points and initial centers :
(4.625,3.75),(9,1.5)
(4.625,3.75) (9,1.5)
(1,4) 3.875 10.5
(1,5) 4.875 11.5
(3,5) 6.875 9.5
(5,2) 2.125 4.5
(8,2) 5.125 1.5
(8,3) 4.125 2.5
(9,1) 7.125 0.5
(9,2) 6.125 0.5
(9,3) 5.125 1.5
(2,4,4,4) (8,6,2,2)
(1,4) 2.8 9.4
(1,5) 2 10.4
(2,6) 2 10.4
(3,5) 1.2 8.4
(5,2) 5 3.8
(8,2) 8 0.8
(8,3) 7 1.4
(9,1) 10 1.6
(9,2) 9 0.6
(9,3) 8 1.2
5
calculating distance between data points and initial centers :
(1.75,5),(8.2,16)
(1.75,5 ) (8.2,16))
(1,4) 1.75 8.84
(1,5) 0.75 9.84
(2,6) 1.25 11.84
(3,5) 1.25 11.84
(5,2) 6.25 4.16
(8,2) 9.25 0.16
(8,3) 8.25 0.84
(9,1) 11.25 2.16
(9,2) 10.25 1.16
(9,3) 9.25 1.84
(1,4) (1,5) (2,6) (3,5) (5,2) (8,2) (8,3) (9,1) (9,2) (9,3)
(1,4) 0
(1,5) 1 0
6
a 2,6 3,5 5,2 b
(2,6) 1.82514077
a c 5,2 b
c 1.971587379
d 5,2 b
5,2 4.519421808
b 7.474683073 3.681697782
9)
iii)
estimate the missing rate using collaborative filtering method:
step 1:
find the average of u1 , u2 , u3 , u4 f orA, B, C
avg-A = 4.3
avgB = 2.2
avgc = 3.5
step 2 :
find the similarities:
the range of similarities will always be +1 pand -1 p
sim(ci , cj ) =Σ(rip −ria vg)(rj p −rj a vg)÷( Σ(rip ) − ria vg)2 ( Σ(rj p ) − rj a vg)2
sim(c1 , c2 ) = −0.74
sim(c1 , c3 ) = 1.29
here the highest is sim(c1 , c3 ) = 1.29
the rating is 3
and
sim(c2 , c1 ) = 0.69
sim(c2 , c3 ) = 0.69
here both are same
then the rating is 5 or 4
and
sim(c3 , c1 ) = 1.29
7
sim(c3 , c2 ) = 0.39
the highest is sim(c3 , c1 ) = 1.29so
the rating is 4