Professional Documents
Culture Documents
Rough Clustering
Rough Clustering
[Original Article]
Department of Mathematics
Abstract: Grouping web transactions into clusters is important in order to obtain better understanding of
users behavior. Currently, the rough approximation-based clustering technique has been used to group web
transactions into clusters. However, the processing time is still an issue due to the high complexity for finding
the similarity of upper approximations of a transaction which used to merge between two or more clusters.
On the other hand, the problem of more than one transaction under given threshold is not addressed. In this
paper, we propose RoCeT model for grouping web transactions using rough set theory. It is based on the two
similarity classes which are nonvoid intersection.
Keywords: Clustering, Web transactions, Rough set theory.
Introduction
135
mentTof similarity
between t and s is given by sim(s,t)
S
= |t s|/|t s|. Obviously, sim(t,s) [0, 1], where
sim(t,s) = 1, when two transactions t and s are exactly
identical and sim(t,s) = 0, when two transactions t
and s have no items in common. De and Krishna [11]
used a binary relation R defined on T defined as follows. For any threshold value th [0, 1] and for any
two user transactions t and s T , a binary relation R
on T denoted as tRs iff sim(t, s) th. This relation
R is a tolerance relation as R is both reflexive and
symmetric but transitive may not hold good always.
sim(t2 , t4 ) =
sim(t3 , t4 ) =
|t1 t4 |
|t1 t4 |
|t1 t4 |
|t1 t4 |
=
=
|{hl2 ,hl3 }|
|{hl2 ,hl3 ,hl4 ,hl5 }|
|{hl3 ,hl5 }|
|{hl1 ,hl2 ,hl3 ,hl5 }|
= 0.5,
= 0.5.
R(t1 ) = {t1 },
R(t2 ) = {t2 , t4 },
R(t3 ) = {t3 , t4 },
Suppose that in an information system S = (U, A, V, f ),
R(t4 ) = {t2 , t3 , t4 }
there is U objects that mean there are at most |U | simRR(t1 ) = {t1 },
ilarity classes. For computation of similarity classes
RR(t2 ) = {t2 , t3 , t4 },
R(ti ) on R(tj ), where i 6= j is |U | |U 1|. Thus,
RR(t3 ) = {t2 , t3 , t4 },
the overall computational complexity of the RoCeT
RR(t4 ) = {t2 , t3 , t4 },
model is of the polynomial
(|U | |U 1|).
RRR(t2 ) = {t2 , t3 , t4 },
RRR(t3 ) = {t2 , t3 , t4 }.
4.1
Complexity
4.2
Example
hl1
1
0
1
0
hl2
1
1
0
1
hl3
0
1
1
1
hl4
0
1
0
0
hl5
0
0
1
1
|t1 t2 |
|t1 t2 |
|t1 t3 |
|t1 t3 |
|t1 t4 |
|t1 t4 |
|t1 t4 |
|t1 t4 |
=
=
=
=
|{hl2 }|
|{hl1 ,hl2 ,hl3 ,hl4 }| = 0.25,
|{hl1 }|
|{hl1 ,hl2 ,hl3 ,hl4 }| = 0.25,
|{hl2 }|
|{hl1 ,hl2 ,hl3 ,hl5 }| = 0.25,
|{hl3 }|
|{hl1 ,hl2 ,hl3 ,hl4 ,hl5 }| = 0.2,
Here, we can see that two consecutive upper approximations for {t1 }, {t2 }, {t3 } and {t4 } are same. Thus,
we get the similarities upper approximation for {t1 },
{t2 }, {t3 } and {t4 } as follow.
S1
S2
S3
S4
= {t1 },
= {t2 , t3 , t4 },
= {t2 , t3 , t4 },
= {t2 , t3 , t4 },
137
S
S1 = R(t1 ) = {t
S1 },
S
=
S
=
S
=
R(ti ), i = 2, 3, 4,
2
3
4
S
R(ti ) = {t2 , t4 } {t3 , t4 } {t2 , t3 , t4 } = {t2 , t3 , t4 }.
Hence, the two clusters are {t1 } and {t2 , t3 , t4 }.
b. For the second data transactions is given in Table 2
containing eleven objects |U | = 11, with six hyperlinks
|A| = 6.
Table 2. Data transactions
U/A
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
hl1
1
0
0
1
0
0
0
0
0
0
0
hl2
0
0
1
0
0
1
0
0
0
1
0
hl3
1
1
0
0
0
1
1
1
1
0
1
hl4
1
1
0
0
0
0
1
1
0
1
1
hl5
0
1
1
0
1
0
0
0
0
0
1
hl6
0
1
1
0
0
0
0
1
0
0
0
sim(t4 , t6 ) = 0,
sim(t4 , t7 ) = 0,
sim(t4 , t8 ) = 0,
sim(t4 , t9 ) = 0,
sim(t4 , t10 ) = 0,
sim(t4 , t11 ) = 0,
sim(t5 , t6 ) = 0,
sim(t5 , t7 ) = 0,
sim(t5 , t8 ) = 0,
sim(t5 , t9 ) = 0,
sim(t5 , t10 ) = 0,
sim(t5 , t11 1) = 0.33,
sim(t6 , t7 ) = 0.33,
sim(t6 , t8 ) = 0.25,
sim(t6 , t9 ) = 0.50,
sim(t6 , t10 ) = 0.33,
sim(t6 , t11 ) = 0.20,
sim(t7 , t8 ) = 0.67,
sim(t7 , t9 ) = 0.50,
sim(t7 , t10 ) = 0.33,
sim(t7 , t11 ) = 0.50,
sim(t8 , t9 ) = 0.33,
sim(t8 , t10 ) = 0.25,
sim(t8 , t11 ) = 0.50,
sim(t9 , t10 ) = 0,
sim(t9 , t11 ) = 0.33,
sim(t10 , t11 ) = 0.25.
138
RRRR(t5 ) = {t5 },
RRRR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRRR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t8 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRRR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t10 ) = {t10 },
RRRR(t11 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
sim(t3 , t4 ) = 0.25,
sim(t3 , t5 ) = 0.33,
sim(t3 , t10 ) = 0.25,
sim(t4 , t5 ) = 0,
sim(t4 , t10 ) = 0,
sim(t5 , t10 ) = 0.
Let given a second threshold value 0.3, then we have
similarity classes are given bellow.
R(t3 ) = {t3 , t5 },
R(t4 ) = {t4 },
R(t5 ) = {t3 , t5 },
R(t10 ) = {t10 }.
The intersection of similarity classes are summarized
in Figure 2.
R(t3 ) R(t4 ) = {t3 , t5 } {t4 } = ,
R(t3 ) R(t5 ) = {t3 , t5 } {t3 , t5 } = {t3 , t5 },
R(t3 ) R(t10 ) = {t3 , t5 } {t10 } = ,
R(t4 ) R(t5 ) = {t4 } {t3 , t5 } = ,
R(t4 ) R(t10 ) = {t4 } {t10 } = ,
R(t5 ) R(t10 ) = {t3 , t5 } {t10 } = .
Fig 2. The intersection of similarity classes
t2
t3
t4
t5
t6
t7
t8
t9
1,2,7,8,11
6,9
7,9
1,2,7,8,11
1,2,7,8,11
t10
7,8,11
1,2,8,11
2,7,8,11
1,2,8,11
2,7,8,11
1,2,8,11
2,7,8,11
11
11
10
10
transactions
Transactions
T /T
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
7
6
5
7
6
5
3
clusters
clusters
Computation
100
95
90
85
80
75
70
65
60
55
Based on Figure 8, the RoCeT model algorithm provides better solutions compared with [11] algorithm.
RoCeT
Cluster
1
2
3
4
Member Transactions
t1 , t2 , t6 , t7 , t8 , t9 , t11
t3 , t5
t4
t10
Overall Purity
second
0.045
Purity
1
1
1
1
100 %
0.04
0.035
0.03
0.025
0.02
0.015
RoCeT
Overall Purity
The technique of [11]
81.82 %
RoCeT
100 %
Fig.5. The Overall Purity
Purity
Computation
Time
18.18 %
46.30 %
80.77 %
Experiment test
The RoCeT
model
100%
100%
100%
100%
100%
Average
The technique
of [11]
93.0%
96.0%
95.5%
95.5%
99.9%
Improvement
7.0%
4.0%
0.5%
0.5%
0.1%
2.5%
The RoCeT
model
1.6969
9.093
77.266
554.426
3043.500
Average
The technique
of [11]
6.250
6.250
163.760
2205.100
9780.900
Improvement
68.79%
66.25%
48.35%
65.10%
64.97%
62.69%
The RoCeT
model
8806
39349
257003
1034964
4161122
Average
The technique
of [11]
28213
116576
497595
2965579
11879645
Improvement
68.50%
68.39%
52.82%
74.86%
68.88%
69.69%
Response Time
10000
9000
8000
7000
The Technique of [11]
second
6000
5000
4000
3000
2000
RoCeT
1000
0
100
200
500
Number of Transactions
141
1000
2000
12
Computation
x 10
10
RoCeT
100
200
500
Number of Transaction
1000
2000
90
90
80
80
70
70
60
60
Transaction
Transaction
50
40
50
40
30
30
20
20
10
10
0
1 2 3 4 5 6 7 8 91011121314151617181920212223
cluster
0 1 2 3 4 5 6 7 8 9 10111213141516171819
Cluster
160
140
Transaction
Transaction
140
120
100
80
120
100
80
60
60
40
40
20
20
0 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031
Cluster
1 2 3 4 5 6 7 8 910111213141516171819202122232425
Cluster
st
nd
after given 2
threshold 0.6
500
450
450
400
400
350
350
Transaction
Transaction
1
500
300
250
200
threshold
300
250
200
150
150
100
100
50
50
12345678910
11 2
13
14
15
16
17
18
19
20
21
22 3
24
25
26
27
28
29
30
31
32
33 4
35
36
Cluster
1 2 3 4 5 6 7 8 910
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Cluster
threshold 0.6
900
800
800
700
700
Transaction
1000
900
Transaction
1000
600
500
600
500
400
400
300
300
200
200
100
100
12345678910
111
21
31
41
51
61
71
82
92
02
122
32
42
52
62
72
83
93
03
13
233
43
53
63
73
84
90
Cluster
0123456789
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Cluster
2000
2000
1800
1800
1600
1600
1400
1400
1200
1000
Transaction
Transaction
1200
1000
800
800
600
600
400
400
200
200
0123456789
111
012
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Cluster
012345678910
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Cluster
143
Conclusion
1999 Workshop on Knowledge and Data Engineering Exchange. IEEE Computer Society.
References
[1]
[2]
[3]
[4]
Tutut Herawan
He is a Ph.D. candidate in Data
Mining at Universiti Tun Hussein
Onn Malaysia (UTHM). His research area includes Data Mining,
KDD and Real Analysis.
145