Rough Clustering

IJBSCHS(2010-16-2-16)
[Original Article]
Biomedical Soft Computing and Human Sciences, Vol.16, No.2, pp.135-145

c
Copyright 1995
Biomedical Fuzzy Systems Association
(Accepted on February 20,2010)
RoCeT: Rough Clustering for web Transactions

Iwan Tri Riyadi Yanto1,3, Tutut Herawan2,3 , Mustafa Mat Deris3
1
2
Department of Mathematics
Department of Mathematics Education
Universitas Ahmad Dahlan, Yogyakarta,Indonesia

3
Faculty of Information Technology and Multimedia
Universiti Tun Hussein Onn Malaysia, Johor, Malaysia

(received 31 December 2009, revised and accepted 20 February 2010)
Abstract: Grouping web transactions into clusters is important in order to obtain better understanding of
users behavior. Currently, the rough approximation-based clustering technique has been used to group web
transactions into clusters. However, the processing time is still an issue due to the high complexity for finding
the similarity of upper approximations of a transaction which used to merge between two or more clusters.
On the other hand, the problem of more than one transaction under given threshold is not addressed. In this
paper, we propose RoCeT model for grouping web transactions using rough set theory. It is based on the two
similarity classes which are nonvoid intersection.
Keywords: Clustering, Web transactions, Rough set theory.
Introduction
where for every ti T U is a non-empty subset of

U . The temporal order of users clicks within transacWeb usage data includes data from web server actions has been taken into account. A user transaction
cess logs, proxy server logs, browser logs, user profiles, t T is represented as a vector t = but1 , ut2 , . . . , utn c,
registration files, user sessions or transactions, user where uti = 1 if hli t, and uti = 0 if otherwise.
queries, bookmark folders, mouse-clicks and scrolls,
A well-known approach for clustering web transand any other data generated by the interaction of actions is using rough set theory [12-14]. De and Krusers and the web [1]. Generally, web mining tech- ishna [11] proposed an algorithm for clustering web
niques can be defined as those methods to extract so transactions using rough approximation. It is based
called nuggets (or knowledge) from web data repos- on the similarity of upper approximations of transitory, such as content, linkage, usage information, by actions by given any threshold. However, there are
utilizing data mining tool. Among such web data, some iterations should be done to merges of two or
user click stream, i.e. usage data, can be mainly uti- more clusters that have the same similarity of upper
lized to capture users navigation patters and identify approximations and didnt present how to handle the
user intended tasks. Once the user navigational be- problem if there are more than one transaction under
haviors are effectively characterized, they will provide given threshold. To overcome those problems, in this
benefit for further web applications, in turn, facilitate paper, we propose an alternative technique for clusterand improve web service quality for both web-based ing web transaction. We use the concept of similarity
organizations and for end users [2-9]. In web data class proposed by [11]. But, the RoCeT model differs
mining research, many data mining techniques, such on how to allocate transaction in the same cluster and
as clustering [8,10] is adopted widely to improve the how to handle the problem if there is more than one
usability and scalability of web mining.
transaction under given threshold.
Access transaction over the web can be expressed
The rest of the paper is organized as follows. Secin the two finite sets, user transaction and hyper- tion 2 describes the concept of rough set theory. Seclinks/URLs [11]. A user transaction U is a sequence of tion 3 describes the work of [11]. Section 4 describes
items, this set is formed by m users and the set A is set the RoCeT model. Section 5 describes the experimenof distinct n clicks (hyperlinks/URLs) clicked by users tal test. Finally, we conclude our works in Section 6.
that are U = {t1 , t2 , . . . , tm } and A = {hl1 , hl2 , . . . , hln },
1 Faculty of Information Technology and Multimedia,
UTHM, Parit Raja, Batu Pahat 86400, Johor.
Phone : +60177061496
Email : iwan015@gmail.com
135
I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions
Rough Set Theory
mentTof similarity
between t and s is given by sim(s,t)
S
= |t s|/|t s|. Obviously, sim(t,s) [0, 1], where
sim(t,s) = 1, when two transactions t and s are exactly
identical and sim(t,s) = 0, when two transactions t
and s have no items in common. De and Krishna [11]
used a binary relation R defined on T defined as follows. For any threshold value th [0, 1] and for any
two user transactions t and s T , a binary relation R
on T denoted as tRs iff sim(t, s) th. This relation
R is a tolerance relation as R is both reflexive and
symmetric but transitive may not hold good always.
An inf ormation system is a 4-tuple (quadruple) S =

(U, A, V, f ), where U = {u1 , u2 , u3 , . . . , u|U | } is a nonempty finite set of objects, A = {a1 , a2 , a3 , . . . , a|A| } is
a non-empty finite set of attributes, V = aA Va , Va is
the domain (value set) of attribute a, f : U A V
is an information function such that f (u, a) Va , for
every (u, a) U A, called information (knowledge)
function. The starting point of rough set approximations is the indiscernibility relation, which is generated
by information about objects of interest. Two objects
in an information system are called indiscernible (in- Definition 3 The similarity class of t, denoted by
distinguishable or similar) if they have the same fea- R(t), is a set of transactions which are similar to t
ture.
which is given by R(t) = {s T : sRt}.
Definition 1 Two elements x, y U are said to be
For different threshold values, one can get different
B-indiscernible (indiscernible by the set of attribute
similarity
classes. A domain expert can choose the
B A in S) if and only if f (x, a) = f (y, a), for every
threshold
based
on this experience to get a proper
a B.
similarity class. It is clear that for a fixed threshold
Obviously, every subset of A induces unique in- [0, 1], a transaction form a given similarity class may
discernibility relation. Notice that, an indiscernibility be similar to an object of another similarity class.
relation induced by the set of attribute B, denoted
by IN D(B), is an equivalence relation. The partiDefinition 4 Let P T , for a fixed threshold [0, 1]
tion of U induced by IN D(B) is denoted by U/B and
a binary tolerance relation R is defined on T. The
the equivalence class in the partition U/B containing
lower approximation of P, denoted by R(P ) and the
x U , in denoted by [x]B . The notions of lower and
upper approximation of P, denoted by R(P ) are reupper approximations of a set are defined as follows.
spectively
S defined as R(P ) = {t P : R(t) P } and
Definition 2 (See [14].) The B-lower approximation R(P = tP )R(t).
of X, denoted by B(X) and B-upper approximation
of X, denoted by B(X), respectively, are defined by
They proposed a technique of clustering the clicks
B(X) T
= {x U |[x]B X} and B(X) = {x of user navigations called as similarity upper approxiU |[x]B X 6= }.
mation and denoted by Si . A set of transactions that
are possibly similar to R(ti ) in denoted by RR(ti ).
The accuracy of approximation (accuracy of roughThis process continues until two consecutive upper
ness) of any subset X U with respect to B A,
approximations for ti , i = 1, 2, 3, , |U | are the same
denoted B (X) is measured by
and two or more clusters that have the same similarity
|B(X)|
upper approximations merges at each iteration. With
B (X) =
,
this technique, we need high computational complex|B(X)|
ity to cluster the transactions. This is due to find out
where |X| denotes the cardinality of X. For empty
the similarity upper approximation until two consecset , we define B () = 1. Obviously, 0 B (X)
utive upper approximations are same. To overcome
1. if X is a union of some equivalence classes of U ,
this problem, we propose an alternative technique to
then B (X) = 1. Thus, the set X is crisp (precise)
cluster the transactions.
with respect to B. And, if X is not a union of some
equivalence classes of U , then B (X) < 1. Thus, the
set X is rough (imprecise)with respect to B [13]. This
The RoCeT model
means that the higher of accuracy of approximation 4
of any subset X U is the more precise (the less
The RoCeT model for clustering the transactions is
imprecise) of itself.
based on all the possibly similar to the similarity class
of t(R(t)). The union of two similarity classes with
non void intersection will be the same clusters. The
3 Related Work
justification that a cluster is a union of two similarIn this section, we discuss the technique proposed by ity classes with non void intersection is presented in
[11]. Given two transactions t and s, the measure- Proposition 6.
136
Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)
Definition 5 Two clusters Si and Sj , i 6= j are said

to be the same if
[
Si =
R(ti ), i = 1, 2, 3, , |U |.
sim(t2 , t4 ) =
sim(t3 , t4 ) =
|t1 t4 |
|t1 t4 |
|t1 t4 |
|t1 t4 |
=
=
|{hl2 ,hl3 }|
|{hl2 ,hl3 ,hl4 ,hl5 }|
|{hl3 ,hl5 }|
|{hl1 ,hl2 ,hl3 ,hl5 }|
= 0.5,
= 0.5.
Second, The similarity classes can be obtained by given

the threshold value using Definition 1. By given the
Proposition
6 Let Si be a cluster. If R(ti ) 6= ,
S
value of threshold 0.5, we get the similarity classes as
then R(ti ) = Si .
follow.
Proof. We suppose that Si and Sj , where i 6= j are
R(t1 ) = {t1 },
the same clusters.
S
R(t2 ) = {t2 , t4 },
From
S Definition 5, if R(t
S i ) 6= Si , then we have
R(t3 ) = {t3 , t4 },
Si 6= Sj 6= R(tj )
S R(ti ) 6= S
R(t4 ) = {t2 , t3 , t4 }.
R(tj )
SR(ti ) 6=T S
The last step is to cluster the transactions. To get
(T R(ti )) ( R(tj )) =
the clusters, [11] used the similarity upper approxiR(ti ) =
mations and the processes are shown bellow.
This is a contradiction from the hypothesis.
T
R(t1 ) = {t1 },
R(t2 ) = {t2 , t4 },
R(t3 ) = {t3 , t4 },
Suppose that in an information system S = (U, A, V, f ),
R(t4 ) = {t2 , t3 , t4 }
there is U objects that mean there are at most |U | simRR(t1 ) = {t1 },
ilarity classes. For computation of similarity classes
RR(t2 ) = {t2 , t3 , t4 },
R(ti ) on R(tj ), where i 6= j is |U | |U 1|. Thus,
RR(t3 ) = {t2 , t3 , t4 },
the overall computational complexity of the RoCeT
RR(t4 ) = {t2 , t3 , t4 },
model is of the polynomial (|U | |U 1|).
RRR(t2 ) = {t2 , t3 , t4 },
RRR(t3 ) = {t2 , t3 , t4 }.
4.1
Complexity
4.2
Example
In this study, the comparisons between the RoCeT

model and the technique proposed by [11] are presented by given two examples, where two small data
sets of transactions are considered.
a. The first transactions data is adopted from [11]
given in Table 1 containing four objects (|U | = 4)
with five hyperlinks (|A| = 5).
Table 1. Data transactions
U/A
t1
t2
t3
t4
hl1
1
0
1
0
hl2
1
1
0
1
hl3
0
1
1
1
hl4
0
1
0
0
hl5
0
0
1
1
The technique of [11] needs three main steps. The

first of the techniques is obtaining the measure of similarity that gives information about the users access
patterns related to their common areas of interest by
similarity relation between two transactions of objects. The calculations of the measure of similarity
web transactions from Table 1 are given bellow.
sim(t1 , t2 ) =
sim(t1 , t3 ) =
sim(t1 , t4 ) =
sim(t2 , t3 ) =
|t1 t2 |
|t1 t2 |
|t1 t3 |
|t1 t3 |
|t1 t4 |
|t1 t4 |
|t1 t4 |
|t1 t4 |
=
=
=
=
|{hl2 }|
|{hl1 ,hl2 ,hl3 ,hl4 }| = 0.25,
|{hl1 }|
|{hl1 ,hl2 ,hl3 ,hl4 }| = 0.25,
|{hl2 }|
|{hl1 ,hl2 ,hl3 ,hl5 }| = 0.25,
|{hl3 }|
|{hl1 ,hl2 ,hl3 ,hl4 ,hl5 }| = 0.2,
Here, we can see that two consecutive upper approximations for {t1 }, {t2 }, {t3 } and {t4 } are same. Thus,
we get the similarities upper approximation for {t1 },
{t2 }, {t3 } and {t4 } as follow.
S1
S2
S3
S4
= {t1 },
= {t2 , t3 , t4 },
= {t2 , t3 , t4 },
= {t2 , t3 , t4 },
where S2 = S3 = S4 and S1 6= Si for i = 2, 3, 4.

Finally, we get the two clusters {t1 } and {t2 , t3 , t4 }.
However, for the RoCeT model, it is based on nonvoid intersection. According to Definition 5 and the
similarity classes as used in [11], there are a few computation we need to do to get the clusters. Therefore,
the RoCeT model to clusters the transactions perform
better than that [11]. The calculation of similarity relation is shown in Figure 1.
R(t1 ) R(t2 ) = {t1 } {t2 , t4 } = ,
R(t1 ) R(t3 ) = {t1 } {t3 , t4 } = ,
R(t1 ) R(t4 ) = {t1 } {t3 , t4 } = ,
R(t2 ) R(t3 ) = {t2 , t4 } {t3 , t4 } = {t4 }
R(t2 ) R(t4 ) = {t2 , t4 } {t2 , t3 , t4 } = {t4 },
R(t3 ) R(t4 ) = {t3 , t4 } {t2 , t3 , t4 } = {t4 }
Fig.1. The similarity relation
Here, we can see that R(ti ) R(tj ) 6= , i 6= j, for
i, j = 2, 3, 4. We get the clusters as follow.
137
S
S1 = R(t1 ) = {t
S1 },
S
=
S
=
S
=
R(ti ), i = 2, 3, 4,
2
3
4
S
R(ti ) = {t2 , t4 } {t3 , t4 } {t2 , t3 , t4 } = {t2 , t3 , t4 }.
Hence, the two clusters are {t1 } and {t2 , t3 , t4 }.
b. For the second data transactions is given in Table 2
containing eleven objects |U | = 11, with six hyperlinks
|A| = 6.
Table 2. Data transactions
U/A
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
hl1
1
0
0
1
0
0
0
0
0
0
0
hl2
0
0
1
0
0
1
0
0
0
1
0
hl3
1
1
0
0
0
1
1
1
1
0
1
hl4
1
1
0
0
0
0
1
1
0
1
1
hl5
0
1
1
0
1
0
0
0
0
0
1
hl6
0
1
1
0
0
0
0
1
0
0
0
The similarity for the transactions are given bellow.

sim(t1 , t2 ) = 0.40,
sim(t1 , t3 ) = 0,
sim(t1 , t4 ) = 0.33,
sim(t1 , t5 ) = 0,
sim(t1 , t6 ) = 0.25,
sim(t1 , t7 ) = 0.67,
sim(t1 , t8 ) = 0.50,
sim(t1 , t9 ) = 0.33,
sim(t1 , t10 ) = 0.25,
sim(t1 , t11 ) = 0.5,
sim(t2 , t3 ) = 0.40,
sim(t2 , t4 ) = 0,
sim(t2 , t5 ) = 0.25,
sim(t2 , t6 ) = 0.20,
sim(t2 , t7 ) = 0.50,
sim(t2 , t8 ) = 0.75,
sim(t2 , t9 ) = 0.25,
sim(t2 , t10 ) = 0.20,
sim(t2 , t11 ) = 0.75,
sim(t3 , t4 ) = 0.25,
sim(t3 , t5 ) = 0.33,
sim(t3 , t6 ) = 0.25,
sim(t3 , t7 ) = 0,
sim(t3 , t8 ) = 0.20,
sim(t3 , t9 ) = 0,
sim(t3 , t10 ) = 0.25,
sim(t3 , t11 ) = 0.20,
sim(t4 , t5 ) = 0,
sim(t4 , t6 ) = 0,
sim(t4 , t7 ) = 0,
sim(t4 , t8 ) = 0,
sim(t4 , t9 ) = 0,
sim(t4 , t10 ) = 0,
sim(t4 , t11 ) = 0,
sim(t5 , t6 ) = 0,
sim(t5 , t7 ) = 0,
sim(t5 , t8 ) = 0,
sim(t5 , t9 ) = 0,
sim(t5 , t10 ) = 0,
sim(t5 , t11 1) = 0.33,
sim(t6 , t7 ) = 0.33,
sim(t6 , t8 ) = 0.25,
sim(t6 , t9 ) = 0.50,
sim(t6 , t10 ) = 0.33,
sim(t6 , t11 ) = 0.20,
sim(t7 , t8 ) = 0.67,
sim(t7 , t9 ) = 0.50,
sim(t7 , t10 ) = 0.33,
sim(t7 , t11 ) = 0.50,
sim(t8 , t9 ) = 0.33,
sim(t8 , t10 ) = 0.25,
sim(t8 , t11 ) = 0.50,
sim(t9 , t10 ) = 0,
sim(t9 , t11 ) = 0.33,
sim(t10 , t11 ) = 0.25.
By given the threshold value 0.5, the similarity classes

are shown as follow.
R(t1 ) = {t1 , t7 , t8 , t11 },
R(t2 ) = {t2 , t7 , t8 , t11 },
R(t3 ) = {t3 },
R(t4 ) = {t4 },
R(t5 ) = {t5 },
R(t6 ) = {t6 , t9 },
R(t7 ) = {t1 , t2 , t7 , t8 , t9 , t11 },
R(t8 ) = {t1 , t2 , t7 , t8 , t11 },
R(t9 ) = {t6 , t7 , t9 },
R(t10 ) = {t10 },
R(t11 ) = {t1 , t2 , t7 , t8 , t11 }.
The process for finding similarity upper approximations in each transaction can be illustrated as follow.
R(t1 ) = {t1 , t7 , t8 , t11 },
R(t2 ) = {t2 , t7 , t8 , t11 },
R(t3 ) = {t3 },
R(t4 ) = {t4 },
R(t5 ) = {t5 },
R(t6 ) = {t6 , t9 },
R(t7 ) = {t1 , t2 , t7 , t8 , t9 , t11 },
R(t8 ) = {t1 , t2 , t7 , t8 , t11 },
R(t9 ) = {t6 , t7 , t9 },
R(t10 ) = {t10 },
R(t11 ) = {t1 , t2 , t7 , t8 , t11 }
RR(t1 ) = {t1 , t2 , t7 , t8 , t9 , t11 },
RR(t2 ) = {t1 , t2 , t7 , t8 , t9 , t11 },
RR(t3 ) = {t3 }
RR(t4 ) = {t4 },
RR(t5 ) = {t5 },
RR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RR(t8 ) = {t1 , t2 , t7 , t8 , t9 , t11 }
RR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RR(t10 ) = {t10 },
RR(t11 ) = {t1 , t2 , t7 , t8 , t9 , t11 }
RRR(t1 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRR(t2 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRR(t3 ) = {t3 },
RRR(t4 ) = {t4 },
RRR(t5 ) = {t5 },
RRR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRR(t8 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRR(t10 ) = {t10 },
RRR(t11 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRR(t1 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t2 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t3 ) = {t3 },
RRRR(t4 ) = {t4 },
138
RRRR(t5 ) = {t5 },
RRRR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRRR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t8 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRRR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t10 ) = {t10 },
RRRR(t11 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
{t3 }, {t4 }, {t5 }, {t11 }, we given the second threshold

value and group the similarity for the remainder transactions. The similarity of remainder of transactions
are shown bellow.
Hence, two consecutive upper approximation for {ti },

i = 1, 2, . . . , 11 are the same. Therefore, we get the
similarities upper approximation as follow.
S1 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S2 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S3 = {t3 },
S4 = {t4 },
S5 = {t5 },
S6 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S7 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S8 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S9 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S10 = {t10 },
S11 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }.
sim(t3 , t4 ) = 0.25,
sim(t3 , t5 ) = 0.33,
sim(t3 , t10 ) = 0.25,
sim(t4 , t5 ) = 0,
sim(t4 , t10 ) = 0,
sim(t5 , t10 ) = 0.
Let given a second threshold value 0.3, then we have
similarity classes are given bellow.
R(t3 ) = {t3 , t5 },
R(t4 ) = {t4 },
R(t5 ) = {t3 , t5 },
R(t10 ) = {t10 }.
The intersection of similarity classes are summarized
in Figure 2.
R(t3 ) R(t4 ) = {t3 , t5 } {t4 } = ,
R(t3 ) R(t5 ) = {t3 , t5 } {t3 , t5 } = {t3 , t5 },
R(t3 ) R(t10 ) = {t3 , t5 } {t10 } = ,
R(t4 ) R(t5 ) = {t4 } {t3 , t5 } = ,
R(t4 ) R(t10 ) = {t4 } {t10 } = ,
R(t5 ) R(t10 ) = {t3 , t5 } {t10 } = .
Fig 2. The intersection of similarity classes
Since Si = Sj 6= Sk , where i, j = 1, 2, 7, 8, 9, 11 and

k = 3, 4, 5, 10, then according to [11], there are five
clusters {t3 }, {t4 }, {t5 }, {t10 } and {t1 , t2 , t6 , t7 , t8 , t9 , t11 }.
For the proposed method, the intersection of similarity classes are summarized in Table 3. From Table
3, notice that R(ti ) R(tj ) 6= , for i 6= j; i, j =
Based on Figure 2, we see that R(t3 ) R(t5 ) 6=
1, 2, 6, 7, 8, 9, 11, and R(tk ) R(tl ) = , for k 6= l;
and R(ti ) R(tj ) = for i 6= j, i = 3, 4, 5; j = 4, 10.
k = 1, 2, . . . , 11; l = 3, 4, 5, 10. We get the clusters as
We get the cluster S3 = {t3 , t5 }, S4 = {t4 }, S5 =
follow.
{t3 , t5 }, S10 = {t10 }. Hence, the three clusters are
S1 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
{t3 , t5 }, {t4 }, {t10 }. Overall, for both of threshold valS2 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
ues given we have four clusters {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S3 = {t3 },
{t3 , t5 }, {t4 }, and {t10 }.
S4 = {t4 },
The purity of clusters was used as a measure to
S5 = {t5 },
test the quality of the clusters[11]. The purity of a
S6 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
cluster and overall purity are defined as:
P
S7 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
tith
S8 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
P urity(i) = P
tn
S9 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S10 = {t10 },
where :
S11 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }.
P
tith : the number of data occuring in both
The five clusters are {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, {t3 }, {t4 },
the ith cluster under given threshold.
P
{t5 } and {t10 }. These are the same cluster with that
tn
: the number of data in the data set.
in [11]. However, the iteration is lower than that
P] of cluster
of the technique proposed by [11]. For the clusters
P urity(i)
Overall P urity = i=1
{t3 }, {t4 }, {t5 }, {t11 }, with the threshold value given,
] of cluster
{t3 }, {t4 }, {t5 }, {t11 } be segregated clusters, but if we
According to this measure, a higher value of oversee in the data transactions, may be there is a related transactions among the clusters. To this, we all purity indicates a better clustering result, with perpropose the alternative technique to handle this prob- fect clustering yielding a value of 100%. The RoCeT
lem by given the second threshold value. Therefore, model and [11] algorithms for clustering web transacwe decide {t1 , t2 , t6 , t7 , t8 , t9 , t11 } as the first cluster tions are implemented in MATLAB version 7.6.0.324
on the first threshold value given and the remainder (R2008a).
139
Table 3. The intersection of similarity classes

t1
t2
t3
t4
t5
t6
t7
t8
t9
1,2,7,8,11
6,9
7,9
1,2,7,8,11
1,2,7,8,11
t10
7,8,11
1,2,8,11
2,7,8,11
1,2,8,11
2,7,8,11
1,2,8,11
2,7,8,11
11
11
10
10
transactions
Transactions
T /T
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
7
6
5
7
6
5
by given threshold 0.5

1
3
clusters
after given second threshold 0.3
clusters
Fig.3. Visualization of example 2

They are executed sequentially on a processor Intel Core 2 Duo CPUs. The total main memory is 1
Gigabyte and the operating system is Windows XP
Professional SP3. The purity of clusters is described
in Figure 4.
Computation
100
95
90
85
80
75
70
The comparisons of computation and response time

of RoCeT and [11] on a transaction data set from Table 2 are given in Figures 6 and 7, respectively.
65
60
55
Based on Figure 8, the RoCeT model algorithm provides better solutions compared with [11] algorithm.
The Technique of [11]
RoCeT
Fig.6. The Computation

Response Time
0.06
0.055
0.05
Cluster
1
2
3
4
Member Transactions
t1 , t2 , t6 , t7 , t8 , t9 , t11
t3 , t5
t4
t10
Overall Purity
second
0.045
Purity
1
1
1
1
100 %
0.04
0.035
0.03
0.025
0.02
0.015
RoCeT
Fig.7. The Response Time
Fig.4. The Purity of clusters

Data
Transaction
Overall Purity
The technique of [11]
81.82 %
RoCeT
100 %
Fig.5. The Overall Purity
Purity
Computation
Time
18.18 %
46.30 %
80.77 %
Fig.8. The overall improvement of to [11] by

RoCeT
140
Experiment test
In order to test the RoCeT model and compare with

[11] algorithm, we use a web log data set from:
http://kdd.ics.uci.edu/databases/msnbc/msnbc.html.
The data describes the page visits by users who visited on September 28, 1999. Visitors are recorded at
the level of URL category and are recorded chronologically. The data comes from Internet Information
Server (IIS) logs for msnbc.com. Each row in the
data set corresponds to the page visits of a user
within a twenty-four hour period. Each item in a

row corresponds to a request of a user for a page.
The client-side cached data is not recorded, thus this
data contains only the server-side log. From almost
one million transactions, we take 2000 transactions
and split into five categories; 100, 200, 500, 1000 and
2000. The comparison of response times is captured
in Figure 9 and computational is given in figure 10.
Table 4. The Purity of clusters

Number of
Transaction
100
200
500
1000
2000
The RoCeT
model
100%
100%
100%
100%
100%
Average
The technique
of [11]
93.0%
96.0%
95.5%
95.5%
99.9%
Improvement
7.0%
4.0%
0.5%
0.5%
0.1%
2.5%
Table 5. The executing time

Number of
Transaction
100
200
500
1000
2000
The RoCeT
model
1.6969
9.093
77.266
554.426
3043.500
Average
The technique
of [11]
6.250
6.250
163.760
2205.100
9780.900
Improvement
68.79%
66.25%
48.35%
65.10%
64.97%
62.69%
Table 6. The Computation

Number of
Transaction
100
200
500
1000
2000
The RoCeT
model
8806
39349
257003
1034964
4161122
Average
The technique
of [11]
28213
116576
497595
2965579
11879645
Improvement
68.50%
68.39%
52.82%
74.86%
68.88%
69.69%
Response Time
10000
9000
8000
7000
second
6000
5000
4000
3000
2000
RoCeT
1000
0
100
200
500
Number of Transactions
141
1000
Fig.9. The executing time
2000
12
Computation
x 10
10
The technique of [11]
RoCeT
100
200
500
Number of Transaction
1000
2000
Fig.10. The Computation

after given 2nd threshold 0.3
100
90
90
80
80
70
70
60
60
Transaction
Transaction
1st threshold 0.6

100
50
40
50
40
30
30
20
20
10
10
0
1 2 3 4 5 6 7 8 91011121314151617181920212223
cluster
0 1 2 3 4 5 6 7 8 9 10111213141516171819
Cluster
Fig.11. Visualization of 100 transactions

1st threshold 0.6

200
200
180
180
160
160
140
Transaction
Transaction
140
120
100
80
120
100
80
60
60
40
40
20
20
0 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031
Cluster
1 2 3 4 5 6 7 8 910111213141516171819202122232425
Cluster

142
st
nd
after given 2
threshold 0.6
500
450
450
400
400
350
350
Transaction
Transaction
1
500
300
250
200
threshold
300
250
200
150
150
100
100
50
50
12345678910
11 2
13
14
15
16
17
18
19
20
21
22 3
24
25
26
27
28
29
30
31
32
33 4
35
36
Cluster
1 2 3 4 5 6 7 8 910
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Cluster

st
threshold 0.6
900
800
800
700
700
Transaction
1000
900
Transaction
1000
600
500
600
500
400
400
300
300
200
200
100
100
12345678910
111
21
31
41
51
61
71
82
92
02
122
32
42
52
62
72
83
93
03
13
233
43
53
63
73
84
90
Cluster
0123456789
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Cluster

2000
2000
1800
1800
1600
1600
1400
1400
1200
1000
Transaction
Transaction
1st threshold 0.6
1200
1000
800
800
600
600
400
400
200
200
0123456789
111
012
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Cluster
012345678910
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Cluster
Fig. 15. Visualization of 2000 transactions
143
Conclusion
1999 Workshop on Knowledge and Data Engineering Exchange. IEEE Computer Society.
A web clustering technique can be applied to find

interesting user access patterns in web log. In this [7] Ngu, D.S.W. and Wu, X., (1997) Sitehelper: A
localized agent that helps incremental exploration
paper, we have proposed RoCeT model for clusterof the world wide web, Proceeding of 6th Internaing web transactions using rough set theory based on
tional World Wide Web Conference. Santa Clara,
similarity between two transactions. The analysis of
CA: ACM Press.
the RoCeT model was presented in terms of computation, processing time and cluster purity. We elaborate [8] Perkowitz, M. and Etzioni, O., (1998) Adapthe proposed technique through UCI benchmark data,
tive Web Sites: Automatically Synthesizing Web
i.e., msnbc.com web log data. It is shown that the RoPages, Proceedings of the 15th National ConCeT model requires significantly lower response time
ference on Artificial Intelligence. Madison, WI:
up to 62.69 % as compared to the technique of [11].
AAAI.
Meanwhile, for cluster purity it performs better up to
[9] Z . Yanchun, X. Guandong and Z. Xiaofang.,
2.5 %.
(2005) A Latent Usage Approach for Clustering Web Transaction and Building User Profile,
7 Acknowledgement
Springer-Verlag Berlin Heidelberg , 31 - 42.
This work was supported by the FRGS under the [10] Han, E. et al., (1998) Hypergraph Based Clustering in High-Dimensional Data Sets: A SumGrant No. Vote 0402, Ministry of Higher Education,
mary of Results, IEEE Data Engineering BulMalaysia.
letin, 21 (11), 15-22.
[11] De, S.K. and Krishna, P.R., ( 2004) Clustering web transactions using rough approximation,
Fuzzy Sets and Systems, 148, 131-138.
Pal, S.K., Talwar, V. and Mitra, P.,(2002) Web
Mining in Soft Computing Framework: Relevance, [12] Pawlak, Z., (1982) Rough sets, International
State of the Art and Future Directions, IEEE
Journal of Computer and Information Science. 11,
Transactions on neural network, 13 (5), 1163 341-356.
1177.
[13] Pawlak, Z. (1991) Rough sets: A theoretical asBucher, A.G. and Mulvenna, M.D., (1998) Dispect of reasoning about data, Kluwer Academic
covering Internet Marketing Intelligence through
Publisher.
Online Analytical Web Usage Mining, SIGMOD
[14] Pawlak, Z. and Skowron, A., (2007) Rudiments
Record, 27 (4), 54-61.
of rough sets Information Sciences. An International Journal. 177 (1), 3-27.
Cohen, E., Krishnamurthy, B. and Rexford, J.,
(1998) Improving and-to-end performance of the
web using server volumes and proxy lters, Proceeding of the ACM SIGCOMM . Vancouver,
British Columbia, Canada: ACM Press.
Iwan Tri Riyadi Yanto
He is a Master candidate in Data
Joachims, T., Freitag, D. and Mitchell, T.,(1997)
Mining at Universiti Tun Hussein
Webwatcher: A tour guide for the world wide
Onn Malaysia (UTHM). His reweb, In the 15th international Joint Confersearch area includes Data Mining,
ence on Artificial Intelligence (ICJAI97), Nagoya,
KDD, and Real Analysis.
Japan.
References
[1]
[2]
[3]
[4]
[5] Lieberman, H., (1995) Letizea: An agent that

assists web browsing, Proceeding of the 1995 International Joint Conference on Artificial Intelligence. Montreal, Canada: Morgan Kaufmann.
[6] Mobasher, B., Cooley, R., and Srivastava, J.,
(1999) Creating adaptive web sites trough usage based clustering of URLs, Proceedings of the
144
Tutut Herawan
He is a Ph.D. candidate in Data
Mining at Universiti Tun Hussein
Onn Malaysia (UTHM). His research area includes Data Mining,
KDD and Real Analysis.
Mustafa Mat Deris

He received the B.Sc. from University Putra Malaysia, M.Sc. from
University of Bradford, England
and Ph.D. from University Putra
Malaysia. He is a professor of computer science in the Faculty of Information Technology and Multimedia, UTHM, Malaysia. His research interests include distributed
databases, data grid, database performance issues and
data mining. He has published more than 80 papers
in journals and conference proceedings. He was appointed as one of editorial board members for International Journal of Information Technology, World
Enformatika Society, a reviewer of a special issue
on International Journal of Parallel and Distributed
Databases, Elsevier, 2004, a special issue on International Journal of Cluster Computing, Kluwer, 2004,
IEEE conference on Cluster and Grid Computing,
held in Chicago, April, 2004, and Malaysian Journal of Computer Science. He has served as a program committee member for numerous international
conferences/workshops including Grid and Peer-toPeer Computing, (GP2P 2005, 2006), Autonomic
Distributed Data and Storage Systems Management
(ADSM 2005, 2006), WSEAS, International Association of Science and Technology, IASTED on Database,
etc.
145

Rough Clustering

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rough Clustering

Uploaded by

Copyright:

Available Formats

IJBSCHS(2010-16-2-16)

Biomedical Soft Computing and Human Sciences, Vol.16, No.2, pp.135-145

RoCeT: Rough Clustering for web Transactions

Department of Mathematics Education

Universitas Ahmad Dahlan, Yogyakarta,Indonesia

Faculty of Information Technology and Multimedia

Universiti Tun Hussein Onn Malaysia, Johor, Malaysia

where for every ti T U is a non-empty subset of

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

Rough Set Theory

An inf ormation system is a 4-tuple (quadruple) S =

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

Definition 5 Two clusters Si and Sj , i 6= j are said

Second, The similarity classes can be obtained by given

In this study, the comparisons between the RoCeT

The technique of [11] needs three main steps. The

where S2 = S3 = S4 and S1 6= Si for i = 2, 3, 4.

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

The similarity for the transactions are given bellow.

By given the threshold value 0.5, the similarity classes

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

{t3 }, {t4 }, {t5 }, {t11 }, we given the second threshold

Hence, two consecutive upper approximation for {ti },

Since Si = Sj 6= Sk , where i, j = 1, 2, 7, 8, 9, 11 and

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

Table 3. The intersection of similarity classes

by given threshold 0.5

after given second threshold 0.3

Fig.3. Visualization of example 2

The comparisons of computation and response time

The Technique of [11]

Fig.6. The Computation

The Technique of [11]

Fig.7. The Response Time

Fig.4. The Purity of clusters

Fig.8. The overall improvement of to [11] by

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

In order to test the RoCeT model and compare with

within a twenty-four hour period. Each item in a

Table 4. The Purity of clusters

Table 5. The executing time

Table 6. The Computation

Fig.9. The executing time

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

The technique of [11]

Fig.10. The Computation

1st threshold 0.6

Fig.11. Visualization of 100 transactions

1st threshold 0.6

Fig.12. Visualization of 200 transactions

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

Fig.13. Visualization of 500 transactions

after given 2nd threshold 0.3

Fig.14. Visualization of 1000 transactions

1st threshold 0.6

Fig. 15. Visualization of 2000 transactions

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

A web clustering technique can be applied to find

[5] Lieberman, H., (1995) Letizea: An agent that

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

Mustafa Mat Deris

You might also like