Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

IJBSCHS(2010-16-2-16)

[Original Article]

Biomedical Soft Computing and Human Sciences, Vol.16, No.2, pp.135-145


c
Copyright 1995
Biomedical Fuzzy Systems Association
(Accepted on February 20,2010)

RoCeT: Rough Clustering for web Transactions


Iwan Tri Riyadi Yanto1,3, Tutut Herawan2,3 , Mustafa Mat Deris3
1
2

Department of Mathematics

Department of Mathematics Education

Universitas Ahmad Dahlan, Yogyakarta,Indonesia


3

Faculty of Information Technology and Multimedia

Universiti Tun Hussein Onn Malaysia, Johor, Malaysia


(received 31 December 2009, revised and accepted 20 February 2010)

Abstract: Grouping web transactions into clusters is important in order to obtain better understanding of
users behavior. Currently, the rough approximation-based clustering technique has been used to group web
transactions into clusters. However, the processing time is still an issue due to the high complexity for finding
the similarity of upper approximations of a transaction which used to merge between two or more clusters.
On the other hand, the problem of more than one transaction under given threshold is not addressed. In this
paper, we propose RoCeT model for grouping web transactions using rough set theory. It is based on the two
similarity classes which are nonvoid intersection.
Keywords: Clustering, Web transactions, Rough set theory.

Introduction

where for every ti T U is a non-empty subset of


U . The temporal order of users clicks within transacWeb usage data includes data from web server ac- tions has been taken into account. A user transaction
cess logs, proxy server logs, browser logs, user profiles, t T is represented as a vector t = but1 , ut2 , . . . , utn c,
registration files, user sessions or transactions, user where uti = 1 if hli t, and uti = 0 if otherwise.
queries, bookmark folders, mouse-clicks and scrolls,
A well-known approach for clustering web transand any other data generated by the interaction of actions is using rough set theory [12-14]. De and Krusers and the web [1]. Generally, web mining tech- ishna [11] proposed an algorithm for clustering web
niques can be defined as those methods to extract so transactions using rough approximation. It is based
called nuggets (or knowledge) from web data repos- on the similarity of upper approximations of transitory, such as content, linkage, usage information, by actions by given any threshold. However, there are
utilizing data mining tool. Among such web data, some iterations should be done to merges of two or
user click stream, i.e. usage data, can be mainly uti- more clusters that have the same similarity of upper
lized to capture users navigation patters and identify approximations and didnt present how to handle the
user intended tasks. Once the user navigational be- problem if there are more than one transaction under
haviors are effectively characterized, they will provide given threshold. To overcome those problems, in this
benefit for further web applications, in turn, facilitate paper, we propose an alternative technique for clusterand improve web service quality for both web-based ing web transaction. We use the concept of similarity
organizations and for end users [2-9]. In web data class proposed by [11]. But, the RoCeT model differs
mining research, many data mining techniques, such on how to allocate transaction in the same cluster and
as clustering [8,10] is adopted widely to improve the how to handle the problem if there is more than one
usability and scalability of web mining.
transaction under given threshold.
Access transaction over the web can be expressed
The rest of the paper is organized as follows. Secin the two finite sets, user transaction and hyper- tion 2 describes the concept of rough set theory. Seclinks/URLs [11]. A user transaction U is a sequence of tion 3 describes the work of [11]. Section 4 describes
items, this set is formed by m users and the set A is set the RoCeT model. Section 5 describes the experimenof distinct n clicks (hyperlinks/URLs) clicked by users tal test. Finally, we conclude our works in Section 6.
that are U = {t1 , t2 , . . . , tm } and A = {hl1 , hl2 , . . . , hln },
1 Faculty of Information Technology and Multimedia,
UTHM, Parit Raja, Batu Pahat 86400, Johor.
Phone : +60177061496
Email : iwan015@gmail.com

135

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

Rough Set Theory

mentTof similarity
between t and s is given by sim(s,t)
S
= |t s|/|t s|. Obviously, sim(t,s) [0, 1], where
sim(t,s) = 1, when two transactions t and s are exactly
identical and sim(t,s) = 0, when two transactions t
and s have no items in common. De and Krishna [11]
used a binary relation R defined on T defined as follows. For any threshold value th [0, 1] and for any
two user transactions t and s T , a binary relation R
on T denoted as tRs iff sim(t, s) th. This relation
R is a tolerance relation as R is both reflexive and
symmetric but transitive may not hold good always.

An inf ormation system is a 4-tuple (quadruple) S =


(U, A, V, f ), where U = {u1 , u2 , u3 , . . . , u|U | } is a nonempty finite set of objects, A = {a1 , a2 , a3 , . . . , a|A| } is
a non-empty finite set of attributes, V = aA Va , Va is
the domain (value set) of attribute a, f : U A V
is an information function such that f (u, a) Va , for
every (u, a) U A, called information (knowledge)
function. The starting point of rough set approximations is the indiscernibility relation, which is generated
by information about objects of interest. Two objects
in an information system are called indiscernible (in- Definition 3 The similarity class of t, denoted by
distinguishable or similar) if they have the same fea- R(t), is a set of transactions which are similar to t
ture.
which is given by R(t) = {s T : sRt}.
Definition 1 Two elements x, y U are said to be
For different threshold values, one can get different
B-indiscernible (indiscernible by the set of attribute
similarity
classes. A domain expert can choose the
B A in S) if and only if f (x, a) = f (y, a), for every
threshold
based
on this experience to get a proper
a B.
similarity class. It is clear that for a fixed threshold
Obviously, every subset of A induces unique in- [0, 1], a transaction form a given similarity class may
discernibility relation. Notice that, an indiscernibility be similar to an object of another similarity class.
relation induced by the set of attribute B, denoted
by IN D(B), is an equivalence relation. The partiDefinition 4 Let P T , for a fixed threshold [0, 1]
tion of U induced by IN D(B) is denoted by U/B and
a binary tolerance relation R is defined on T. The
the equivalence class in the partition U/B containing
lower approximation of P, denoted by R(P ) and the
x U , in denoted by [x]B . The notions of lower and
upper approximation of P, denoted by R(P ) are reupper approximations of a set are defined as follows.
spectively
S defined as R(P ) = {t P : R(t) P } and
Definition 2 (See [14].) The B-lower approximation R(P = tP )R(t).
of X, denoted by B(X) and B-upper approximation
of X, denoted by B(X), respectively, are defined by
They proposed a technique of clustering the clicks
B(X) T
= {x U |[x]B X} and B(X) = {x of user navigations called as similarity upper approxiU |[x]B X 6= }.
mation and denoted by Si . A set of transactions that
are possibly similar to R(ti ) in denoted by RR(ti ).
The accuracy of approximation (accuracy of roughThis process continues until two consecutive upper
ness) of any subset X U with respect to B A,
approximations for ti , i = 1, 2, 3, , |U | are the same
denoted B (X) is measured by
and two or more clusters that have the same similarity
|B(X)|
upper approximations merges at each iteration. With
B (X) =
,
this technique, we need high computational complex|B(X)|
ity to cluster the transactions. This is due to find out
where |X| denotes the cardinality of X. For empty
the similarity upper approximation until two consecset , we define B () = 1. Obviously, 0 B (X)
utive upper approximations are same. To overcome
1. if X is a union of some equivalence classes of U ,
this problem, we propose an alternative technique to
then B (X) = 1. Thus, the set X is crisp (precise)
cluster the transactions.
with respect to B. And, if X is not a union of some
equivalence classes of U , then B (X) < 1. Thus, the
set X is rough (imprecise)with respect to B [13]. This
The RoCeT model
means that the higher of accuracy of approximation 4
of any subset X U is the more precise (the less
The RoCeT model for clustering the transactions is
imprecise) of itself.
based on all the possibly similar to the similarity class
of t(R(t)). The union of two similarity classes with
non void intersection will be the same clusters. The
3 Related Work
justification that a cluster is a union of two similarIn this section, we discuss the technique proposed by ity classes with non void intersection is presented in
[11]. Given two transactions t and s, the measure- Proposition 6.
136

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

Definition 5 Two clusters Si and Sj , i 6= j are said


to be the same if
[
Si =
R(ti ), i = 1, 2, 3, , |U |.

sim(t2 , t4 ) =
sim(t3 , t4 ) =

|t1 t4 |
|t1 t4 |
|t1 t4 |
|t1 t4 |

=
=

|{hl2 ,hl3 }|
|{hl2 ,hl3 ,hl4 ,hl5 }|
|{hl3 ,hl5 }|
|{hl1 ,hl2 ,hl3 ,hl5 }|

= 0.5,
= 0.5.

Second, The similarity classes can be obtained by given


the threshold value using Definition 1. By given the
Proposition
6 Let Si be a cluster. If R(ti ) 6= ,
S
value of threshold 0.5, we get the similarity classes as
then R(ti ) = Si .
follow.
Proof. We suppose that Si and Sj , where i 6= j are
R(t1 ) = {t1 },
the same clusters.
S
R(t2 ) = {t2 , t4 },
From
S Definition 5, if R(t
S i ) 6= Si , then we have
R(t3 ) = {t3 , t4 },
Si 6= Sj 6= R(tj )
S R(ti ) 6= S
R(t4 ) = {t2 , t3 , t4 }.
R(tj )
SR(ti ) 6=T S
The last step is to cluster the transactions. To get
(T R(ti )) ( R(tj )) =
the clusters, [11] used the similarity upper approxiR(ti ) =
mations and the processes are shown bellow.
This is a contradiction from the hypothesis. 
T

R(t1 ) = {t1 },
R(t2 ) = {t2 , t4 },
R(t3 ) = {t3 , t4 },
Suppose that in an information system S = (U, A, V, f ),
R(t4 ) = {t2 , t3 , t4 }
there is U objects that mean there are at most |U | simRR(t1 ) = {t1 },
ilarity classes. For computation of similarity classes
RR(t2 ) = {t2 , t3 , t4 },
R(ti ) on R(tj ), where i 6= j is |U | |U 1|. Thus,
RR(t3 ) = {t2 , t3 , t4 },
the overall computational complexity of the RoCeT
RR(t4 ) = {t2 , t3 , t4 },
model is of the polynomial (|U | |U 1|).
RRR(t2 ) = {t2 , t3 , t4 },
RRR(t3 ) = {t2 , t3 , t4 }.

4.1

Complexity

4.2

Example

In this study, the comparisons between the RoCeT


model and the technique proposed by [11] are presented by given two examples, where two small data
sets of transactions are considered.
a. The first transactions data is adopted from [11]
given in Table 1 containing four objects (|U | = 4)
with five hyperlinks (|A| = 5).
Table 1. Data transactions
U/A
t1
t2
t3
t4

hl1
1
0
1
0

hl2
1
1
0
1

hl3
0
1
1
1

hl4
0
1
0
0

hl5
0
0
1
1

The technique of [11] needs three main steps. The


first of the techniques is obtaining the measure of similarity that gives information about the users access
patterns related to their common areas of interest by
similarity relation between two transactions of objects. The calculations of the measure of similarity
web transactions from Table 1 are given bellow.
sim(t1 , t2 ) =
sim(t1 , t3 ) =
sim(t1 , t4 ) =
sim(t2 , t3 ) =

|t1 t2 |
|t1 t2 |
|t1 t3 |
|t1 t3 |
|t1 t4 |
|t1 t4 |
|t1 t4 |
|t1 t4 |

=
=
=
=

|{hl2 }|
|{hl1 ,hl2 ,hl3 ,hl4 }| = 0.25,
|{hl1 }|
|{hl1 ,hl2 ,hl3 ,hl4 }| = 0.25,
|{hl2 }|
|{hl1 ,hl2 ,hl3 ,hl5 }| = 0.25,
|{hl3 }|
|{hl1 ,hl2 ,hl3 ,hl4 ,hl5 }| = 0.2,

Here, we can see that two consecutive upper approximations for {t1 }, {t2 }, {t3 } and {t4 } are same. Thus,
we get the similarities upper approximation for {t1 },
{t2 }, {t3 } and {t4 } as follow.
S1
S2
S3
S4

= {t1 },
= {t2 , t3 , t4 },
= {t2 , t3 , t4 },
= {t2 , t3 , t4 },

where S2 = S3 = S4 and S1 6= Si for i = 2, 3, 4.


Finally, we get the two clusters {t1 } and {t2 , t3 , t4 }.
However, for the RoCeT model, it is based on nonvoid intersection. According to Definition 5 and the
similarity classes as used in [11], there are a few computation we need to do to get the clusters. Therefore,
the RoCeT model to clusters the transactions perform
better than that [11]. The calculation of similarity relation is shown in Figure 1.
R(t1 ) R(t2 ) = {t1 } {t2 , t4 } = ,
R(t1 ) R(t3 ) = {t1 } {t3 , t4 } = ,
R(t1 ) R(t4 ) = {t1 } {t3 , t4 } = ,
R(t2 ) R(t3 ) = {t2 , t4 } {t3 , t4 } = {t4 }
R(t2 ) R(t4 ) = {t2 , t4 } {t2 , t3 , t4 } = {t4 },
R(t3 ) R(t4 ) = {t3 , t4 } {t2 , t3 , t4 } = {t4 }
Fig.1. The similarity relation
Here, we can see that R(ti ) R(tj ) 6= , i 6= j, for
i, j = 2, 3, 4. We get the clusters as follow.

137

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

S
S1 = R(t1 ) = {t
S1 },
S
=
S
=
S
=
R(ti ), i = 2, 3, 4,
2
3
4
S
R(ti ) = {t2 , t4 } {t3 , t4 } {t2 , t3 , t4 } = {t2 , t3 , t4 }.
Hence, the two clusters are {t1 } and {t2 , t3 , t4 }.
b. For the second data transactions is given in Table 2
containing eleven objects |U | = 11, with six hyperlinks
|A| = 6.
Table 2. Data transactions
U/A
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11

hl1
1
0
0
1
0
0
0
0
0
0
0

hl2
0
0
1
0
0
1
0
0
0
1
0

hl3
1
1
0
0
0
1
1
1
1
0
1

hl4
1
1
0
0
0
0
1
1
0
1
1

hl5
0
1
1
0
1
0
0
0
0
0
1

hl6
0
1
1
0
0
0
0
1
0
0
0

The similarity for the transactions are given bellow.


sim(t1 , t2 ) = 0.40,
sim(t1 , t3 ) = 0,
sim(t1 , t4 ) = 0.33,
sim(t1 , t5 ) = 0,
sim(t1 , t6 ) = 0.25,
sim(t1 , t7 ) = 0.67,
sim(t1 , t8 ) = 0.50,
sim(t1 , t9 ) = 0.33,
sim(t1 , t10 ) = 0.25,
sim(t1 , t11 ) = 0.5,
sim(t2 , t3 ) = 0.40,
sim(t2 , t4 ) = 0,
sim(t2 , t5 ) = 0.25,
sim(t2 , t6 ) = 0.20,
sim(t2 , t7 ) = 0.50,
sim(t2 , t8 ) = 0.75,
sim(t2 , t9 ) = 0.25,
sim(t2 , t10 ) = 0.20,
sim(t2 , t11 ) = 0.75,
sim(t3 , t4 ) = 0.25,
sim(t3 , t5 ) = 0.33,
sim(t3 , t6 ) = 0.25,
sim(t3 , t7 ) = 0,
sim(t3 , t8 ) = 0.20,
sim(t3 , t9 ) = 0,
sim(t3 , t10 ) = 0.25,
sim(t3 , t11 ) = 0.20,
sim(t4 , t5 ) = 0,

sim(t4 , t6 ) = 0,
sim(t4 , t7 ) = 0,
sim(t4 , t8 ) = 0,
sim(t4 , t9 ) = 0,
sim(t4 , t10 ) = 0,
sim(t4 , t11 ) = 0,
sim(t5 , t6 ) = 0,
sim(t5 , t7 ) = 0,
sim(t5 , t8 ) = 0,
sim(t5 , t9 ) = 0,
sim(t5 , t10 ) = 0,
sim(t5 , t11 1) = 0.33,
sim(t6 , t7 ) = 0.33,
sim(t6 , t8 ) = 0.25,
sim(t6 , t9 ) = 0.50,
sim(t6 , t10 ) = 0.33,
sim(t6 , t11 ) = 0.20,
sim(t7 , t8 ) = 0.67,
sim(t7 , t9 ) = 0.50,
sim(t7 , t10 ) = 0.33,
sim(t7 , t11 ) = 0.50,
sim(t8 , t9 ) = 0.33,
sim(t8 , t10 ) = 0.25,
sim(t8 , t11 ) = 0.50,
sim(t9 , t10 ) = 0,
sim(t9 , t11 ) = 0.33,
sim(t10 , t11 ) = 0.25.

By given the threshold value 0.5, the similarity classes


are shown as follow.
R(t1 ) = {t1 , t7 , t8 , t11 },
R(t2 ) = {t2 , t7 , t8 , t11 },
R(t3 ) = {t3 },
R(t4 ) = {t4 },
R(t5 ) = {t5 },
R(t6 ) = {t6 , t9 },
R(t7 ) = {t1 , t2 , t7 , t8 , t9 , t11 },
R(t8 ) = {t1 , t2 , t7 , t8 , t11 },
R(t9 ) = {t6 , t7 , t9 },
R(t10 ) = {t10 },
R(t11 ) = {t1 , t2 , t7 , t8 , t11 }.
The process for finding similarity upper approximations in each transaction can be illustrated as follow.
R(t1 ) = {t1 , t7 , t8 , t11 },
R(t2 ) = {t2 , t7 , t8 , t11 },
R(t3 ) = {t3 },
R(t4 ) = {t4 },
R(t5 ) = {t5 },
R(t6 ) = {t6 , t9 },
R(t7 ) = {t1 , t2 , t7 , t8 , t9 , t11 },
R(t8 ) = {t1 , t2 , t7 , t8 , t11 },
R(t9 ) = {t6 , t7 , t9 },
R(t10 ) = {t10 },
R(t11 ) = {t1 , t2 , t7 , t8 , t11 }
RR(t1 ) = {t1 , t2 , t7 , t8 , t9 , t11 },
RR(t2 ) = {t1 , t2 , t7 , t8 , t9 , t11 },
RR(t3 ) = {t3 }
RR(t4 ) = {t4 },
RR(t5 ) = {t5 },
RR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RR(t8 ) = {t1 , t2 , t7 , t8 , t9 , t11 }
RR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RR(t10 ) = {t10 },
RR(t11 ) = {t1 , t2 , t7 , t8 , t9 , t11 }
RRR(t1 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRR(t2 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRR(t3 ) = {t3 },
RRR(t4 ) = {t4 },
RRR(t5 ) = {t5 },
RRR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRR(t8 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRR(t10 ) = {t10 },
RRR(t11 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRR(t1 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t2 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t3 ) = {t3 },
RRRR(t4 ) = {t4 },

138

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

RRRR(t5 ) = {t5 },
RRRR(t6 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRRR(t7 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t8 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }
RRRR(t9 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
RRRR(t10 ) = {t10 },
RRRR(t11 ) = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }

{t3 }, {t4 }, {t5 }, {t11 }, we given the second threshold


value and group the similarity for the remainder transactions. The similarity of remainder of transactions
are shown bellow.

Hence, two consecutive upper approximation for {ti },


i = 1, 2, . . . , 11 are the same. Therefore, we get the
similarities upper approximation as follow.
S1 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S2 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S3 = {t3 },
S4 = {t4 },
S5 = {t5 },
S6 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S7 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S8 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S9 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S10 = {t10 },
S11 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }.

sim(t3 , t4 ) = 0.25,
sim(t3 , t5 ) = 0.33,
sim(t3 , t10 ) = 0.25,
sim(t4 , t5 ) = 0,
sim(t4 , t10 ) = 0,
sim(t5 , t10 ) = 0.
Let given a second threshold value 0.3, then we have
similarity classes are given bellow.
R(t3 ) = {t3 , t5 },
R(t4 ) = {t4 },
R(t5 ) = {t3 , t5 },
R(t10 ) = {t10 }.
The intersection of similarity classes are summarized
in Figure 2.
R(t3 ) R(t4 ) = {t3 , t5 } {t4 } = ,
R(t3 ) R(t5 ) = {t3 , t5 } {t3 , t5 } = {t3 , t5 },
R(t3 ) R(t10 ) = {t3 , t5 } {t10 } = ,
R(t4 ) R(t5 ) = {t4 } {t3 , t5 } = ,
R(t4 ) R(t10 ) = {t4 } {t10 } = ,
R(t5 ) R(t10 ) = {t3 , t5 } {t10 } = .
Fig 2. The intersection of similarity classes

Since Si = Sj 6= Sk , where i, j = 1, 2, 7, 8, 9, 11 and


k = 3, 4, 5, 10, then according to [11], there are five
clusters {t3 }, {t4 }, {t5 }, {t10 } and {t1 , t2 , t6 , t7 , t8 , t9 , t11 }.
For the proposed method, the intersection of similarity classes are summarized in Table 3. From Table
3, notice that R(ti ) R(tj ) 6= , for i 6= j; i, j =
Based on Figure 2, we see that R(t3 ) R(t5 ) 6=
1, 2, 6, 7, 8, 9, 11, and R(tk ) R(tl ) = , for k 6= l;
and R(ti ) R(tj ) = for i 6= j, i = 3, 4, 5; j = 4, 10.
k = 1, 2, . . . , 11; l = 3, 4, 5, 10. We get the clusters as
We get the cluster S3 = {t3 , t5 }, S4 = {t4 }, S5 =
follow.
{t3 , t5 }, S10 = {t10 }. Hence, the three clusters are
S1 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
{t3 , t5 }, {t4 }, {t10 }. Overall, for both of threshold valS2 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
ues given we have four clusters {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S3 = {t3 },
{t3 , t5 }, {t4 }, and {t10 }.
S4 = {t4 },
The purity of clusters was used as a measure to
S5 = {t5 },
test the quality of the clusters[11]. The purity of a
S6 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
cluster and overall purity are defined as:
P
S7 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
tith
S8 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
P urity(i) = P
tn
S9 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 },
S10 = {t10 },
where :
S11 = {t1 , t2 , t6 , t7 , t8 , t9 , t11 }.
P
tith : the number of data occuring in both
The five clusters are {t1 , t2 , t6 , t7 , t8 , t9 , t11 }, {t3 }, {t4 },
the ith cluster under given threshold.
P
{t5 } and {t10 }. These are the same cluster with that
tn
: the number of data in the data set.
in [11]. However, the iteration is lower than that
P] of cluster
of the technique proposed by [11]. For the clusters
P urity(i)
Overall P urity = i=1
{t3 }, {t4 }, {t5 }, {t11 }, with the threshold value given,
] of cluster
{t3 }, {t4 }, {t5 }, {t11 } be segregated clusters, but if we
According to this measure, a higher value of oversee in the data transactions, may be there is a related transactions among the clusters. To this, we all purity indicates a better clustering result, with perpropose the alternative technique to handle this prob- fect clustering yielding a value of 100%. The RoCeT
lem by given the second threshold value. Therefore, model and [11] algorithms for clustering web transacwe decide {t1 , t2 , t6 , t7 , t8 , t9 , t11 } as the first cluster tions are implemented in MATLAB version 7.6.0.324
on the first threshold value given and the remainder (R2008a).
139

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

Table 3. The intersection of similarity classes


t1

t2

t3

t4

t5

t6

t7

t8

t9

1,2,7,8,11

6,9

7,9

1,2,7,8,11

1,2,7,8,11

t10

7,8,11

1,2,8,11

2,7,8,11

1,2,8,11

2,7,8,11

1,2,8,11

2,7,8,11

11

11

10

10

transactions

Transactions

T /T
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11

7
6
5

7
6
5

by given threshold 0.5


1

3
clusters

after given second threshold 0.3

clusters

Fig.3. Visualization of example 2


They are executed sequentially on a processor Intel Core 2 Duo CPUs. The total main memory is 1
Gigabyte and the operating system is Windows XP
Professional SP3. The purity of clusters is described
in Figure 4.

Computation
100
95
90
85
80
75
70

The comparisons of computation and response time


of RoCeT and [11] on a transaction data set from Table 2 are given in Figures 6 and 7, respectively.

65
60
55

Based on Figure 8, the RoCeT model algorithm provides better solutions compared with [11] algorithm.

The Technique of [11]

RoCeT

Fig.6. The Computation


Response Time
0.06
0.055
0.05

Cluster
1
2
3
4

Member Transactions
t1 , t2 , t6 , t7 , t8 , t9 , t11
t3 , t5
t4
t10
Overall Purity

second

0.045

Purity
1
1
1
1
100 %

0.04
0.035
0.03
0.025
0.02
0.015

The Technique of [11]

RoCeT

Fig.7. The Response Time

Fig.4. The Purity of clusters


Data
Transaction

Overall Purity
The technique of [11]
81.82 %
RoCeT
100 %
Fig.5. The Overall Purity

Purity

Computation

Time

18.18 %

46.30 %

80.77 %

Fig.8. The overall improvement of to [11] by


RoCeT
140

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

Experiment test

In order to test the RoCeT model and compare with


[11] algorithm, we use a web log data set from:
http://kdd.ics.uci.edu/databases/msnbc/msnbc.html.
The data describes the page visits by users who visited on September 28, 1999. Visitors are recorded at
the level of URL category and are recorded chronologically. The data comes from Internet Information
Server (IIS) logs for msnbc.com. Each row in the
data set corresponds to the page visits of a user

within a twenty-four hour period. Each item in a


row corresponds to a request of a user for a page.
The client-side cached data is not recorded, thus this
data contains only the server-side log. From almost
one million transactions, we take 2000 transactions
and split into five categories; 100, 200, 500, 1000 and
2000. The comparison of response times is captured
in Figure 9 and computational is given in figure 10.

Table 4. The Purity of clusters


Number of
Transaction
100
200
500
1000
2000

The RoCeT
model
100%
100%
100%
100%
100%
Average

The technique
of [11]
93.0%
96.0%
95.5%
95.5%
99.9%

Improvement
7.0%
4.0%
0.5%
0.5%
0.1%
2.5%

Table 5. The executing time


Number of
Transaction
100
200
500
1000
2000

The RoCeT
model
1.6969
9.093
77.266
554.426
3043.500
Average

The technique
of [11]
6.250
6.250
163.760
2205.100
9780.900

Improvement
68.79%
66.25%
48.35%
65.10%
64.97%
62.69%

Table 6. The Computation


Number of
Transaction
100
200
500
1000
2000

The RoCeT
model
8806
39349
257003
1034964
4161122
Average

The technique
of [11]
28213
116576
497595
2965579
11879645

Improvement
68.50%
68.39%
52.82%
74.86%
68.88%
69.69%

Response Time
10000
9000
8000
7000
The Technique of [11]

second

6000
5000
4000
3000
2000
RoCeT
1000
0

100

200

500
Number of Transactions

141

1000

Fig.9. The executing time

2000

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

12

Computation

x 10

10

The technique of [11]

RoCeT

100

200

500
Number of Transaction

1000

2000

Fig.10. The Computation


after given 2nd threshold 0.3
100

90

90

80

80

70

70

60

60

Transaction

Transaction

1st threshold 0.6


100

50
40

50
40

30

30

20

20

10

10

0
1 2 3 4 5 6 7 8 91011121314151617181920212223
cluster

0 1 2 3 4 5 6 7 8 9 10111213141516171819
Cluster

Fig.11. Visualization of 100 transactions


after given 2nd threshold 0.3

1st threshold 0.6


200
200
180
180
160

160
140

Transaction

Transaction

140
120
100
80

120
100
80

60

60

40

40

20

20

0 1 2 3 4 5 6 7 8 910111213141516171819202122232425262728293031
Cluster

1 2 3 4 5 6 7 8 910111213141516171819202122232425
Cluster

Fig.12. Visualization of 200 transactions


142

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

st

nd

after given 2

threshold 0.6
500

450

450

400

400

350

350

Transaction

Transaction

1
500

300
250
200

threshold

300
250
200

150

150

100

100

50

50
12345678910
11 2
13
14
15
16
17
18
19
20
21
22 3
24
25
26
27
28
29
30
31
32
33 4
35
36
Cluster

1 2 3 4 5 6 7 8 910
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Cluster

Fig.13. Visualization of 500 transactions


st

after given 2nd threshold 0.3

threshold 0.6

900

800

800

700

700

Transaction

1000

900

Transaction

1000

600
500

600
500

400

400

300

300

200

200

100

100
12345678910
111
21
31
41
51
61
71
82
92
02
122
32
42
52
62
72
83
93
03
13
233
43
53
63
73
84
90
Cluster

0123456789
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Cluster

Fig.14. Visualization of 1000 transactions


after given 2nd threshold 0.3

2000

2000

1800

1800

1600

1600

1400

1400

1200
1000

Transaction

Transaction

1st threshold 0.6

1200
1000

800

800

600

600

400

400

200

200
0123456789
111
012
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Cluster

012345678910
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Cluster

Fig. 15. Visualization of 2000 transactions

143

I.T.R Yanto, et al.: RoCeT: Rough Clustering for web Transactions

Conclusion

1999 Workshop on Knowledge and Data Engineering Exchange. IEEE Computer Society.

A web clustering technique can be applied to find


interesting user access patterns in web log. In this [7] Ngu, D.S.W. and Wu, X., (1997) Sitehelper: A
localized agent that helps incremental exploration
paper, we have proposed RoCeT model for clusterof the world wide web, Proceeding of 6th Internaing web transactions using rough set theory based on
tional World Wide Web Conference. Santa Clara,
similarity between two transactions. The analysis of
CA: ACM Press.
the RoCeT model was presented in terms of computation, processing time and cluster purity. We elaborate [8] Perkowitz, M. and Etzioni, O., (1998) Adapthe proposed technique through UCI benchmark data,
tive Web Sites: Automatically Synthesizing Web
i.e., msnbc.com web log data. It is shown that the RoPages, Proceedings of the 15th National ConCeT model requires significantly lower response time
ference on Artificial Intelligence. Madison, WI:
up to 62.69 % as compared to the technique of [11].
AAAI.
Meanwhile, for cluster purity it performs better up to
[9] Z . Yanchun, X. Guandong and Z. Xiaofang.,
2.5 %.
(2005) A Latent Usage Approach for Clustering Web Transaction and Building User Profile,
7 Acknowledgement
Springer-Verlag Berlin Heidelberg , 31 - 42.
This work was supported by the FRGS under the [10] Han, E. et al., (1998) Hypergraph Based Clustering in High-Dimensional Data Sets: A SumGrant No. Vote 0402, Ministry of Higher Education,
mary of Results, IEEE Data Engineering BulMalaysia.
letin, 21 (11), 15-22.
[11] De, S.K. and Krishna, P.R., ( 2004) Clustering web transactions using rough approximation,
Fuzzy Sets and Systems, 148, 131-138.
Pal, S.K., Talwar, V. and Mitra, P.,(2002) Web
Mining in Soft Computing Framework: Relevance, [12] Pawlak, Z., (1982) Rough sets, International
State of the Art and Future Directions, IEEE
Journal of Computer and Information Science. 11,
Transactions on neural network, 13 (5), 1163 341-356.
1177.
[13] Pawlak, Z. (1991) Rough sets: A theoretical asBucher, A.G. and Mulvenna, M.D., (1998) Dispect of reasoning about data, Kluwer Academic
covering Internet Marketing Intelligence through
Publisher.
Online Analytical Web Usage Mining, SIGMOD
[14] Pawlak, Z. and Skowron, A., (2007) Rudiments
Record, 27 (4), 54-61.
of rough sets Information Sciences. An International Journal. 177 (1), 3-27.
Cohen, E., Krishnamurthy, B. and Rexford, J.,
(1998) Improving and-to-end performance of the
web using server volumes and proxy lters, Proceeding of the ACM SIGCOMM . Vancouver,
British Columbia, Canada: ACM Press.
Iwan Tri Riyadi Yanto
He is a Master candidate in Data
Joachims, T., Freitag, D. and Mitchell, T.,(1997)
Mining at Universiti Tun Hussein
Webwatcher: A tour guide for the world wide
Onn Malaysia (UTHM). His reweb, In the 15th international Joint Confersearch area includes Data Mining,
ence on Artificial Intelligence (ICJAI97), Nagoya,
KDD, and Real Analysis.
Japan.

References
[1]

[2]

[3]

[4]

[5] Lieberman, H., (1995) Letizea: An agent that


assists web browsing, Proceeding of the 1995 International Joint Conference on Artificial Intelligence. Montreal, Canada: Morgan Kaufmann.
[6] Mobasher, B., Cooley, R., and Srivastava, J.,
(1999) Creating adaptive web sites trough usage based clustering of URLs, Proceedings of the
144

Tutut Herawan
He is a Ph.D. candidate in Data
Mining at Universiti Tun Hussein
Onn Malaysia (UTHM). His research area includes Data Mining,
KDD and Real Analysis.

Biomedical Soft Computing and Human Sciences, Vol.16, No.2(2010)

Mustafa Mat Deris


He received the B.Sc. from University Putra Malaysia, M.Sc. from
University of Bradford, England
and Ph.D. from University Putra
Malaysia. He is a professor of computer science in the Faculty of Information Technology and Multimedia, UTHM, Malaysia. His research interests include distributed
databases, data grid, database performance issues and
data mining. He has published more than 80 papers
in journals and conference proceedings. He was appointed as one of editorial board members for International Journal of Information Technology, World
Enformatika Society, a reviewer of a special issue
on International Journal of Parallel and Distributed
Databases, Elsevier, 2004, a special issue on International Journal of Cluster Computing, Kluwer, 2004,
IEEE conference on Cluster and Grid Computing,
held in Chicago, April, 2004, and Malaysian Journal of Computer Science. He has served as a program committee member for numerous international
conferences/workshops including Grid and Peer-toPeer Computing, (GP2P 2005, 2006), Autonomic
Distributed Data and Storage Systems Management
(ADSM 2005, 2006), WSEAS, International Association of Science and Technology, IASTED on Database,
etc.

145

You might also like