Fuzzy Association Rule Mining For Web Usage Visualization: Suash Deb, Simon Fong, Cecilia Ho

Fuzzy association rule mining for web usage visualization
Suash Deb
1
, Simon Fong
2
*, Cecilia Ho
2

a
Department of Computer Science and Engineering, Cambridge Institute of Technology, Ranchi, India
2
Department of Computer Science, University of Macau, Taipa, Macau SAR
Abstract
As an important task to web business management, is monitoring the growth of the website via visual inspection and to alert
of any anomaly. Web mining is a known popular research area for knowledge discovery on Websites and Web operations. In
particular, association rule mining (ARM) has been studied and applied for finding Web pages or Web links that are
frequently accessed together in a session. However, most of the previous works articles in the literature used ARM are for
studying the Web visitors/customers browsing patterns on the Website, hence the Website could be fine-tuned or
personalized according to their Web surfing preferences. In this paper, we embark on a slightly different perspective from the
views and requirements of Website monitor which aims at visualizing the dynamic activities (also known as Web usages) on
the Website; so that the relations in terms of being clicked in a sequence of visits between the Web pages could be visualized.
Fuzzy ARM is applied here because the contextual relations between Web pages are supposed not to be strictly defined but
fuzzy in nature. An experiment is conducted to verify the efficacy of our proposed model with superior results when
compared to using ARM algorithm alone.
2013 Elsevier Science. All rights reserved.
Keywords: Web usage mining; Fuzzy association rules; Web usage visualization
1. Introduction
Website monitoring is crucial to an online business or e-government as it offers insight on the progress of the
business which runs upon the Website. Many Web diagnosis software programs are readily available in
commercial markets; and they usually output the insights and knowledge of the Web operation in forms of
tabulated statistics or sometimes bar-charts at best, such as the most number of hits, visitor counts, and busiest
hours of the day/week/year etc.Earlier, a Website performance monitoring system called WebVS (Web
visualization system) was proposed by the authors [1]. WebVS checks and visualizes both the static website
structure and the dynamic usage data respectively. When combined, both static view and dynamic view represent
the health of the growth of a website with respective to addition of Web contents/pages and the actual popularity
of the contents on the Website.With the aid of visualization, The Web structure is rendered as a radial tree and the
analytic results will be overlaid on it by some visual cues. Web administrations and analysts can select different
data attributes and thresholds to visualize. The static and dynamic view of Web graphs rendered by the WEBVS
give the idea of how a Website is doing. Such presentation of the portal status is easier to understand and to
locate anomalies or interesting phenomenon than lists of statistics and numbers in a report. While radial tree Web
graphs are useful for illustrating the full view of a website and the information pertained at different parts of the
website, they fall short in visualizing the association of parts of the websites being visited. WEBVS also
generates association rules by applying the Fuzzy Apriori-T Association Rules algorithm and visualizes the rules
in a relation graph. Visualization of such associations is implemented here along with radial tree Web graphs
because dynamic usages of the website that are represented by Web visits complement with the growth of
website structures and contents as a holistic approach.
For discovering the relations between Web pages or Web links, association rule mining (ARM) has been
studied widely by researchers in Web mining research community. However, most of the previous works articles
in the literature used ARM are for studying the Web visitors/customers browsing patterns on the Website, hence
Proceedings of International Conference on Computing Sciences
WILKES100 ICCS 2013
ISBN: 978-93-5107-172-3
253 Elsevier Publications, 2013
*
Corresponding author - Suhash Deb
Suash Deb, Simon Fongand Cecilia Ho
the Website could be fine-tuned or personalized according to their Web surfing preferences. In this paper, we
focus on a slightly different perspective from the views and requirements of WebVS which aims at visualizing
the dynamic activities (also known as usages) on the Website. Therefore the relations in terms of being clicked in
a sequence of visits between the Web pages could be visualized. Fuzzy ARM is applied here instead of the
original ARM. It is because the contextual relations between Web pages are known to be defined differently by
different people; for example, a session of Web browsing may be long in one culture but not so in the other.
The term evening is loosely defined as the period of time after sunset which of course geographically differs
from city to city. The measures used in such rule association are hence fuzzy in nature. The main contribution of
this paper is a fuzzified ARM model which could be used as an important element in WebVS or similar Web
visualization package.
The paper is structured as follow. A brief review of the related technology such as Website performance
visualization and ARM algorithms is done in Section 2. Section 3 depicts the theoretical model Fuzzy ARM or
simply FARM. An experiment is conducted in Section 4 and the efficacy of our proposed FARM model is
discussed as well. Section 5 shows the visualized results. Section 6 concludes this paper.
2. Reviews of Web Mining and Visualization Systems
The authors in [3] introduced different 2D and 3D visualization diagrams of particular interest, classifying
Web pages into two classes of hot (with many hits) and cold (with few hits) ones and illustrating behavior of
users. The framework enables flexible selection of mappings between data attributes and visualization
dimensions for different diagrams. Selected existing academic and commercial Web analysis and visualization
systems in this section are briefly reviewed.

Table I. Selected academic Web analysis and visualization systems proposed in the past

Authors Web Mining Visualization

C
o
n
t
e
n
t

M
i
n
i
n
g

S
t
r
u
c
t
u
r
e

M
i
n
i
n
g

U
s
a
g
e

M
i
n
i
n
g

C
l
u
s
t
e
r
i
n
g

S
t
r
u
c
t
u
r
e

S
t
a
t
i
c

I
n
f
o
r
m
a
t
i
o
n

D
y
n
a
m
i
c

u
s
a
g
e

U
s
a
g
e

R
e
l
a
t
i
o
n

P
e
r
s
o
n
a
l
i
z
a
-

t
i
o
n

G
r
o
w
t
h

M
o
n
i
t
o
r
i
n
g

I
n
t
e
r
-
s
i
t
e
s

C
o
m
p
a
r
i
s
o
n

Smith and Ng [4]
Song and Shepperd [5]
Chen et al[6]
Munzner [7, 8]
Chi et al[9]
Chi et al [10]
Erick [11]
Liu et al [12]
Liu et al [13]
Niu et al [14]
Reiss and Eddon [15]
Chen [16]
Pascual-Cid et al [17, 18]

To help users search for information and organize information layout, Smith and Ng [19] suggested using a
self-organizing map (SOM) to mine Web data and provided a visual tool to assist user navigation. LOGSOM
organizes Web pages into a two-dimensional map based solely on the users' navigation behavior, rather than the
content of the Web pages.
Fuzzy Association Rule Mining for Web Usage Visualization
Song and Shepperd [20] view the topology of a Web site as a directed graph and mine Web browsing patterns
for ecommerce. They use vector analysis and fuzzy set theory to cluster users and URLs. Their frequent access
path identification algorithm is not based on sequence mining which has a very important role in knowledge
discovery in Web log data, due to the ordered nature of click-streams.
Chen et al [21] describe a novel representation technique which makes use of the Web structure together with
summarization techniques to better represent knowledge in actual Web Documents. They named the proposed
technique as Semantic Virtual Document (SVD). The SVD can be used together with a suitable clustering
algorithm to achieve an automatic content-based categorization of similar Web Documents. This technique
allows an automatic content-based classification of Web documents as well as a tree-like graphical user interface
for browsing post retrieval document browsing enhances the relevant judgment process for Internet users. They
also introduce cluster-biased automatic query expansion technique to interpret short queries accurately. They
present a prototype of Intelligent Search and Review of Cluster Hierarchy (iSEARCH) for Web content mining.
The H3 hyperbolic site viewer is developed by Tamara Munzner while at Stanford University [7, 8]. Using a
sophisticated two-pass algorithm to organize pages in hyperbolic space, it results in pages laid out on a
hemisphere using a non-Euclidean distance metric, ensuring there is exponential room to place nodes and
enabling it to cope with large Web sites. The H3 viewer is also interactive; rotating the sphere gives the viewer a
fixed target frame rate to maintain interactive performance.
Chi et al introduced a system based on the visualization of website structure using the radial Tree visual
metaphor [9]. The edge thickness was used to map the amount of traffic occurred on a link while colour was used
for mapping the type of content of the target node. The same authors also presented Time Tube [10], which
consists of a set of snapshots that represent the evolution of the website with time.
Eick [11] proposed a visualisation method to depict users behaviour based on the usage of three columns. The
left column contain nodes that represent the most frequent referrer pages used to reach a desired page, located in
the middle column. The destination pages after the focus node are placed in the right column. Hence, it is quite
intuitive to identify the users flow around a single node.
WebCompare in [12], is a Web comparison system that uses information retrieval and data mining techniques
to compare keywords in U pages and C pages to identify those potentially interesting pages. Liu et al [13]
proposed VSComp combines clustering and visualization to highlight those potentially interesting pages from two
Web sites. The key idea of the proposed approach is that Web pages from the two sites are combined first, and
then clustered and displayed together. This naturally reveals those interesting pages, i.e., similar and different
pages in the two sites. In terms of techniques, VSComp is different from WebCompare as VSComp uses
clustering and visualization, which are not used in WebCompare
In the WebKIV system [14], a radial tree algorithm is used to construct the Web site structure in a 2D plane. It
implemented the disktree representation to compare Web navigational patterns and defined a three dimensional
scale to describe the Web visualization task.
Reiss and Eddon [15] proposed the Webviz that gathers usage data from large numbers of users, monitoring
the URLs of the Web pages they are currently browsing. It summarizes this information by categories and then
displays the results so that users can understand browsing patterns over time, can spot trends, and can identify any
unusual patterns. The display consists of concentric circles, each representing a different time interval, with the
outermost interval representing the most recent period. Within each interval Webviz displays the different
categories of information. The saturation and brightness of the region and the frequency, width, and amplitude of
the interior line code the additional information.
Web Knowledge Visualization and Discovery System (WEBKVDS) [16] is mainly composed of two parts: a)
FootPath: for visualizing the Web structure with the different data and pattern layers; b) Web Graph Algebra: for
manipulating and operating on the Web graph objects for visual data mining. The authors presented the idea of
layering data and patterns in distinct layers on top of a disktree representation of the Web structure, allowing the
display of information in context which is more suited for the interpretation of discovered patterns. And with the
help of the Web graph algebra, the system provides a means for interactive visual Web mining.
WebViz system, a tool to visualize both the structure and usage of Web sites, is proposed in [22]. The structure
of a Web segment is rendered as a radial tree, and usage data is extracted and layered as polygonal graphs. By
interactively creating and adjusting these layers, a user can develop real time insight into the data. The system
shows the idea of interactive visual operators and the idea of a polygon graph as a visual cue. This technique
extends the concept of the radial tree by generating polygons that appear from the connection of parent nodes in
the hierarchy with representative points in the edges calculated according to any usage data of its children node.
The polygonal graphs, however, are not straight forward enough and it needs time to interpret to discover useful
information.
Website Exploration Tool (WET) in [17], the closest system we found from academia for Web graph
visualization, uses the Graphs Logic System (GLS) to calculate the representative subgraphs from the whole
collected Web graph, simplifying the quantity of data to be visualized and avoiding overlapped visualizations.
GLS generates GraphML file to be visualized as radial tree and treemap. The main goal of WET is to assist in the
conversion of Web data into information by providing an already known context where Web analysts may
interpret the data. In the most recent paper, the authors describe the assessment process of two Virtual Learning
Environments (VLE). An improved version of WET [18], provides a set of combined visual abstractions that can
be visually customised as well as recomputed by changing the focus of interest. However, the WET only focuses
on visualizing the website data for the usage evaluation
3. Fuzzy Association Rules Mining
Fuzzy association rules mining [31] is an improved version of ARM which has been applied in many areas,
such as rainfall prediction [32], refining search query in Web retrieval [33], and Website personalization together
with a case-reasoning engine [34]. In this section, the basic concepts and the crisp problem for traditional
association rules are introduced, followed by a new fuzzy approach that improves the crisp problem.
3.1. Definitions of Association Rules
To measure the reliability/accuracy of a rule two values, support and confidence, that have been extensively
used, were initially introduced. Let I={i
1
, i
2
, ... , i
m
}be a set of items (objects) and T={ t
1
, t
2
, ... , t
n
}a set of
transactions with items in I, both assumed to be finite. Icontains all the possible items of a database, different
combinations of those items are called itemsets
Definition 1.An association rule is an expression of the form X Y, where X,Y I, X,Y , and X Y =
.The rule X Y means every transaction of T that contains X contains Y too.The usual measures to assess
association rules are support and confidence, both based on the concept of support of an itemset. The support
measures the reliability by the relative frequency of co-occurrence of the rules items. The confidence measures
the rule accuracy as the quotient between the support of that rule and the relative frequency of the items
belonging to the left part of the rule
Definition 2.The support of an itemsetI
j
I with respect to a set of transactions T is
supp(I
j
, T) = (1)
indicating the probability that a transaction of T contains I
j

Definition 3.The support of the association rule X Y in T is
Supp(X Y, T) =supp(X Y, T) (2)
and its confidence is
Conf(X Y, T) = = (3)
It is usual to assume that T is fixed for each problem and thus, it is customary to avoid any reference to it.
Then, the above introduced values are simply noted as supp(I
j
), Supp(X Y) and Conf(X Y)
respectively.Support it is the percentage of transactions where the rule holds. Confidence is the conditional
probability of Y with respect to X or, in other words, the relative cardinality of Y with respect to X. Association
rules mining is the attempt to discover rules whose support and confidence are greater than two user-defined
thresholds called minsuppand minconfrespectively. Such rules are called strong rules.
Most of the existing algorithms work in the following dual steps:
Step 1. Find the frequent itemsets. Considering transactions one by one, it updates the support of the itemsets
each time a transaction is considered. It is the most expensive step from the computational point of view.
Step 2. Obtain rules with support and confidence greater than the user-defined thresholds, from the frequent
itemsets obtained in the previous step. Specifically, if the itemsetsX and X Y are frequent, we can obtain the rule
X Y since it is equal to the support of the itemsetX Y according to the Definition 3.
3.2 Crisp Boundary Problem: Motivation to Fuzzy Approach
| |
| } | { |
T
I
j

) , (
) , (
T X supp
T Y X supp
) , (
) , (
T X supp
T Y X Supp
Conventional Association Rule Mining (ARM) algorithms usually deal with datasets with categorical values
and expect any numerical values to be converted to categorical ones using ranges. In real life, data is neither only
categorical nor only numerical but a combination of both. And the general method adopted is to convert
numerical attributes into categorical attributes using ranges. The problem with the above approach to dividing
ranged values into sub-ranges is that the boundaries between the sub-ranges are crisp boundaries. Fuzzy
Association Rule Mining (FARM) is intended to address the crisp boundary problem encountered in traditional
ARM. The principal idea is that ranged values can belong to more than one sub-ranges, we say that the value has
a membership degree, , that associates it with each available sub-ranges.
The most known ARM algorithm, Apriori, is based on a simple but key fundamental observation about
frequent itemsets: Every subset of a frequent itemset must be a frequent itemset too. From this, the algorithm is
designed to proceed iteratively starting from frequent itemsets containing a single item. The Fuzzy Apriori-T
algorithm we use in this research is a fuzzy version of the Apriori-T algorithm.
Using this fuzzy function, it is possible to assign a membership degree to each of the elements in X. Elements of
the set could but are not required to be numbers as long as a degree of membership can be deduced from them.
For the purpose of mining fuzzy association rules, numeric elements are used for quantitative data, but other
categories might also exist where no numerical elements will be found. For example, we define three age
categories: Young, Middle-aged and Old, and then ascertain the fuzzy membership (range [0, 1]) of each crisp
numerical value in these categories. Thus, Age =35 may have =0.6 for the fuzzy partition Middle-aged, =
0.3 for Young and =0.1 for Old [35]. Thus, by using fuzzy partitions, we preserve the information encapsulated
in the numerical attribute, and are also able to convert it to a categorical attribute, albeit a fuzzy one. Therefore,
many fuzzy sets can be defined on the domain of each quantitative attribute, and the original dataset is
transformed into an extended one with attribute values having fuzzy memberships in the interval [0, 1].Applying
to Web mining, since we aim to evaluate the reputation or popularity of the Web pages or e-Services and
different types of information in a given Website, we can predefine the content categories of the Web pages. And,
by the accessed URL in Web log, we can identify what content the accessed page is about.
3.3 Fuzzy Association Rules
As in classical association rules, I={i
1
, i
2
, ... , i
m
} represents all the attributes appearing in the transaction
database T={ i
1
, i
2
, ... , i
n
}. Icontains all the possible items of a database, different combinations of those items
are called itemsets. Each item i
k
will associate (to some degree) with several fuzzy sets. The degree of association
is given by a membership degree in the range [0, 1].
Definition 5.A fuzzy transaction is a nonempty fuzzy subset I.
For every i I, we note (i) the membership degree of iin a fuzzy transaction . We note (I
j
) the degree of
inclusion of an itemsetI
j
I in a fuzzy transaction , defined as:
.
Definition 6.Let I be a set of items, T a FT-set, and X,Y I two crisp subsets, with X,Y , and AC = . A
fuzzy association rule X Y holds in T iff(X) (Y) T, i.e., the membership degree of Y is greater than that
of X for every fuzzy transaction in T.This definition preserves the meaning of association rules, because if we
assume X and Y given that (X) (Y).
The support of the fuzzy association rule XY in the FT-set T is supp(X Y). The confidence computation is to
use a scalar cardinality of fuzzy sets based on the weighted summation of the cardinalities of its -cuts. The new
confidence measure then looks as follows:
conf(XY)= (4)
This method puts greater emphasis on elements with higher membership degrees due to the fact that an element
with membership
k
occurs in each summand of k, k{1,... , t}.
3.4 Our Approach
An item and a transaction are abstract concepts that may be seen as representing some kind of an object
and a subset of objects, respectively. Fuzzy data have uncertain values associated with fuzzy (linguistic) labels
(such as "high" and "low") and a membership function, which normalizes the design parameter to the range
between 0 and 1. In this case, a fuzzy transaction can contain more than one item corresponding to different labels
i
i
X
Y X
t
i
i i

| |
| |
) (
1
1
1

=
+
Suash Deb, Simon Fong

and Cecilia Ho
j
j
j
X
c
X
L L ,...,
1
of the same attribute, because it is possible for a single value in the table to fit more than one label to a certain
degree.
Similar to [37], let Lab(X
j
) = {} be a set of linguistic labels for attribute X
j
. We shall use the labels
to name the corresponding fuzzy set, i.e. :Dom(X
j
) [0, 1]
Let L =
j{1,...,m}
Lab(X
j
). Then, the set of items with labels in L associated to RE is
Every instance r of RE is associated to a FT-set, denoted
, with items in
. Each tuple t r is associatedto

a single fuzzy transaction
such that
In this project, we aim to discover the interesting and meaningful patterns of browsing behaviors and
preferences of users from different origins. Therefore, we have four attributes A
r
={Origin, Hour, Duration,
Content} as described in Table II below. Although the authors in [38] suggest to use both the duration and visiting
frequency as the weighting parameters, we simply only use the duration in this research since visiting frequency
may be unreliable
Table II. Categorical attributes for FARM
Attribute Description
Origin Visitor location, i.e. the country resolved from the remote-host field of a log entry
Hour Hour of the day a visitor made the page access request
Duration The length of the period that a visitor spent on a page, i.e. the activity time
Content Different content categories of the requested page predefined by experts
We use for origin the set of labels Lab(Origin) ={US, UK, China, J apan, Canada, }. The geographical
location of a visitor can be absolutely resolved from the IP or domain name in the Web access log. Therefore the
membership degree of the origin attribute is always 1. This information can help in relating the user behaviors and
preferences to their origin.
Hour is the time of day a visitor made the access request. This can be obtained from the timestamp of the
access log entry in the format of HH:MM:SS. Visitors may behave differently in different hour of the day, for
example: they may interest in different kinds of content. According to [37], the set of labels for hour Lab(Hour) =
{Early morning, Morning, Noon, Afternoon, Night} can be defined as in Figure 1.
Fig. 1.Representation of some fuzzy labels for Hour.
Duration refers to the length of the period that a user spent on a page, indicating the users interest of the page
content. It is measured by the length of time between two successive activities or requests of the same user within
a session. Figure 2 shows a possible definition of the set of labels for durationLab(Duration) ={Short, Quite
Short, Medium, Quite Long, Long}. Note that we use the fixed 15 minutes as the maximum difference between
two requests in the same session as suggested by [39]. The duration reflects the relative importance of each page,
because a user generally spends more time on a more useful page that contains updated, useful and attractive
content. If a user in not interested in a page, he/she will usually jump to another page quickly. However, a quick
jump may also occur when there is too little content on the page to browse through. Hence, it is more appropriate
to normalize the duration by the total bytes of the page. The equation is defined as shown in (5), where each p
i
P
is the page viewed.
j
X
k
L

=
} ,..., 1 {
} ,..., 1 { and | ,
m j
c k RE X L X
I
j j
X
k j
RE
L
j
( ) ]) [ ( ,
j
X
k
X
k j
t
L
X t L L X
j j
=
)
) (
) (
( max
) (
) (
) (
p Size
p ion TotalDurat
p Size
p ion TotalDurat
p Duration
T Q
=
Fig. 2.Representation of some fuzzy labels for Duration
We categorize the Web pages by the page content fuzzily in the sense that a Web page may contain more than
one kinds of information and, therefore, it can be classified into more than one content categories in a certain
degree. Table 7 shows an example of the fuzzy membership associated with the set of labels Lab(Content) = {A,
B, C, D} for the content attribute, where P = {p
1
,,p
k
} is a set of Web pages. 60% of the content of the page p
1
is
classified as the content category B and 40% as the content category C.
We use the sets of labels for hour and duration, discovering the relations between the length and the hour of the
activity timeThen we have L = Lab(Hour) Lab(Duration) and
={<Hour, Early morning>, <Hour, Morning>,

<Hour, Noon>, <Hour, Afternoon>, <Hour, Night>, <Duration, Short>, <Duration, Quite short>, <Duration,
Medium>, <Duration, Quite long>, <Duration, Long>}. The FT-set T
r
on
is
. Table III shows the fuzzy transactions with items in
with the columns defining the fuzzy transactions

of
as fuzzy subsets of

Table III. Fuzzy transactions for the temporal relation of results

<Hour, Early morning> 0 0 0 1 0 0.25
<Hour, Morning> 0 0 0 0 0 0.75
<Hour, Noon> 0 0 0.5 0 0 0
<Hour, Afternoon> 0.75 0 0.5 0 1 0
<Hour, Night> 0.25 1 0 0 0 0
<Duration, Short> 0 0 0 0.33 0 0
<Duration, Quite short> 0 0 0.25 0.67 0 1
<Duration, Medium> 1 1 0.75 0. 0 0
<Duration, Quite long> 0 0 0 0 1 0
<Duration, Long> 0 0 0 0 0 0
We have a collection of Web pages P ={p
1
, . . . , p
n
} and a set of access log transactions associated to the
collection of pages P as T
P
={ t
1
, . . . , t
m
}. We can obtain a set of items I ={ i
1
, . . . , i
m
} which represents all the
attribute labels appearing in the transaction collection T
P
. The degrees of association to these items in a access
transaction t
i
are given by a membership value normalizedin the range [0, 1] and are represented by ={
i1
, . . . ,
im
}. Therefore, we can define a set of fuzzy transactions F ={ t
1
, . . . , t
n
}, where each document t
i
corresponds
to a fuzzy transaction f
i
F, and where the membership value ={
i1
, . . . ,
im
} of the item set I ={ i
1
, . . . , i
m
}
are fuzzy values from a fuzzy weighting scheme as mentioned earlier.
The fuzzy representation building process of T
P
is shown is the Algorithm 1. Given a set of transactions, all
possible items are extracted and their associated membership values are obtained by the item weighting scheme.
On this set of fuzzy transactions we apply the Algorithm 2 to extract the association rules
Algorithm 1 Basic algorithm to obtain the fuzzy representations of all Web page access transactions
Input: a set of transactions TP ={t1, . . . , tm }
Output: a fuzzy representation for all transactions in TP.
1. Let TP ={t1, . . . , tm }be a collection of page access transactions
2. Extract an initial set of itermsIfrom each transactiontiT
3. Apply the fuzzy membership weighting scheme in Section 5.3.2.4
4. The representation of tiobtained is a set of items I={i1, . . . , im} with their associated membership values {i1, . . . ,
im}
Algorithm 2 Basic algorithm to obtain the association rules from Web access log
Input: a set of transactions F ={t1, . . . , tm }where ti contains a set of items I ={i1, . . . , im } with their associated membership
values ={i1, . . . , im}.
Output: a set of association rules.
1. Construct the itemsets from the set of transactions F.
2. Establish the threshold values of minimum support minsuppand minimum confidence minconf.
3. Find all the itemsets that have a support above threshold minsupp, that is, the frequent itemsets.
4. Generate the rules, discarding those rules below threshold minconf.
4. Experiment
A real life dataset of a public organization Web site, the NASA Web site was used to evaluate the efficacy of
the proposed method for Web visualization. NASA was chosen because of its large amount of data that are
publicly available.We extracted the Web structure and the page information of the present NASA Web site with
some Web crawler utility software [41] and Web Content Extractor [20]. By specifying the crawling rules,
preferred data and output format, we can obtain specific data from a particular website automatically from the
Internet. Since the Web site is very large, we only extracted the Missions part with crawling depth of 3 for this
experiment use. The extracted data contains 103 unique HTML pages. Theoretically our prototype can work with
any size of data provided that a view zooming function is available.
The logs were collected from 00:00:00 August 1, 1995 through 23:59:59 August 31, 1995. In this period there
were 1,569,898 requests. Timestamps have 1-second resolution. There are a total of 18,688 unique IPs requesting
pages, having a total of 171,529 sessions. A total of 15,429 unique pages are requested. The logs are in an ASCII
file format with one line per request, with the following attribute columns:
1. Remote-host making the request. A hostname when possible, otherwise the Internet address if the name
could not be looked up.
2. Timestamp in the format "DAY/MON/YEAR HH:MM:SS ZONE", where DAY is the day of the month,
MON is the name of the month, YEAR is the year, HH:MM:SS is the time of day using a 24-hour clock,
and ZONE is the time zone which is -0400 in this dataset.
3. Request given in quotes.
4. Status code.
5. Bytes in the reply.
A transaction contains a set of Web pages access requested by a user in the Web logs within a predefined
period of time. After all pre-processing, which includes the filtering of unwanted data, user identification, content
type mapping, etc., we calculate the fuzzy membership values of the attributes, {Hour, Duration, Origin}
according to the label defined earlier and fuzzily classify the pages into content categories by the keywords of
their URLs.Therefore, we obtain the data in pairs. In every entry, the first pair indicates the geographical location
of the user following with the itemsets of the content accessed by the user. The first numeric value in a pair is the
code representation of attribute label while the second numeric value is the fuzzy membership value.
For experiments, prepared two datasets containing the access requests of more than 60 hosts from 5 countries
are prepared. We aim to assess the performance of our proposed FARM algorithm with respect to standard
Apriori-T algorithm.NASA1 contains 3 attributes {Hour, Duration, Origin} while NASA2 contains 4 attributes
{Hour, Duration, Origin, Content}. The experiments are run on a Windows Vista machine with a 2 GHz Intel
Core Duo CPU and 3 GB RAM.
Association Rules (ARs) are generated from what are called frequent itemsets, these are itemsets with a
support count above some user specified support threshold. The support threshold is typically given a low value
so that no potentially interesting rules are missed. Once the frequent itemsets in a data set have been identified the
ARs can be generated. Each frequent itemsets of size greater than one can produce 2 or more ARs. To reduce this
number only those rules above a given confidence threshold are selected. Therefore the confidence threshold
value chosen is usually quite high.
More importantly, for any dataset there is a particular support value for which optimal number of itemsets is
generated and for supports less than this value, we get a flood of itemsets which are of no practical use. From our
experiments, we have observed that our algorithm performs most efficiently at this optimal support value, which
occurs in the range of 0.015 - 0.03 for the dataset NASA1 and the range of 0.025 - 0.05 for NASA2. Applying the
Fuzzy Apriori-T algorithm, we have the following results
Fig. 3. Number of frequent itemsets (NASA1).
Fig. 4. Number of frequent itemsets (NASA2).
Fig. 5. Number of rules with minsupp =0.01 (NASA1+2). Fig. 6. Number of rules with minsupp =0.02 (NASA1+2)
Fig. 7. Execution time with confidence =0.03 (NASA1) Fig. 8. Execution timewith confidence =0.03 (NASA2

Figures 3 and 4show the results and demonstrate the difference between the numbers of frequent itemsets
generated using the Fuzzy Apriori-T and Apriori-T algorithms from two different datasets, NASA1 and NASA2.
As expected the number of frequent itemsets increases as the minimum support decreases. From the results, it is
clear that FARM produces more frequent itemsets (and consequently rules) than Apriori-T. Figures 5 and 6 show
the number of rules produced using support threshold 0.01 and 0.02 respectively. The Fuzzy Apriori-T generates
many more rules than Apriori-T in the case of the dataset with 4 attributes.
Figures 7 and 8 show the performance of the two algorithms on execution time by varying the support
threshold for different datasets. It can be seen that the execution time increases as the threshold decreases in all
cases irrespective of dataset type. Although the two algorithms have similar performance on execution time and
number of frequent itemsets, Fuzzy Apriori-T benefits over standard Apriori-T by extracting more interesting
rules, especially in the case of more attributes. Table IV lists groups of rules with quite a high confidence
discarded by Apriori-T but considered by Fuzzy Apriori-T for different settings from the NASA2

Table IV. Rules discarded by Apriori-T
Rule Confidence(%) Lift Ratio
{Procurement} ->{US} 99.57 1.75
{History J apan} ->{Short} 83.77 1.35
{QuiteLong} ->{US} 81.92 1.44
{History Night} ->{Short} 80.77 1.3
minsupp =0.03 minconf =0.65

{Long Home} ->{US} 91.06 1.6
{Noon Home} ->{US} 88.52 1.56
{Afternoon History} ->{Short} 80.88 1.3
{Noon History} ->{Short} 78.51 1.27
{UK Missions} ->{Short} 74.67 1.2
{USCountdown} ->{Short} 67.18 1.08
{EarlyMorningMissions} ->{Short} 66.67 1.08
minsupp =0.035, minconf =0.65
{Long Home} ->{US} 91.06 1.6
{Noon Home} ->{US} 88.52 1.56
{US Countdown} ->{Short} 67.18 1.08

5.Visualization by Relation Graph

The association rules, generated by applying Fuzzy Apriori-T algorithm, are visualized in Figure 9 showing the
relationship between the four attributes of the dataset NASA2. The horizontal bars in the visualization show the
absolute frequency of how often each category occurred. The category N/A indicates no item of the rule fits in any
of the categories in that dimension. The purple line from the category Early morning in the left indicates the rule
{Early morning, About}->{US}; while the one in the right indicates {Early morning, History}->{Short}.By
choosing the dimensions of origin and content, we obtain the relation graph visualizing the rules of these two
attributes in Figure 10. We can see clearly that US users interest in the content of Procurement and About while
J apan users prefer History content
Fig. 9. Association rules with support0.02 and confidence80%. Fig. 10. Association rules showing users from different origins
6. Conclusion
In this paper we described the Web usage analysis by applying an association rule mining algorithm, called Fuzzy
Apriori-T algorithm. It is set to find out the relations between visitors locations and their navigation preferences.
Visualization of the generated rules in relation graph helped to show easy -understanding of the discovered
patterns. The motive of this approach is to enable visualization of a balanced growth of a Website which can
quantitatively be observed from the website structure, as well as the distribution of popularity received by the
Web visitors, from the association rules. The Web graph only visualizes part of the NASA Web site as a trial test.
And the fuzzy association rule mining covers a period of the Web logs as experiment. The experiment validated
the capabilities of our proposed visualization and data mining models
References
[1] Simon Fong, Ho Si Meng, A Web-based Performance Monitoring System for e-Government Portal, The 3rd International Conference on
Theory and Practice of Electronic Governance (ICEGOV 2009), 10-13 November 2009, Bogota, Colombia, pp.74-82.
[2] A. H. Youssefi, D. J . Duke, M. J . Zaki, Visual Web Mining, WWW2004, New York, May 2004.
[3] Oosthuizen, C., Wesson, J ., Cilliers, C., Visual Web Mining of Organizational Web Sites, Tenth International Conference on
Information Visualization, (2006), pp.395-401.
[4] Smith K.A. and Ng A., Web page clustering using a self-organizing map of user navigation patterns, Decision Support Systems, Volume
35, Issue 2, (2003), pp.245-256.
[5] Q. Song and M. Shepperd, Mining Web browsing patterns for e-Commerce, Computer. Indus. 57(7) (2006) 62263
[6] L. Chen, W. Lian and W. Chue, Using Web structure and summarization techniques for Web content mining, Information Processing and
Management: an International Journal Volume 41 Issue 5, September (2005).
[7] Munzner, T., Interactive Visualization of Large Graphs and Networks, Ph.D. Dissert., Stanford University, J une 2000;
graphics.stanford.edu/papers/munzner_thesis
[8] Munzner, T., Exploring large graphs in 3D hyperbolic space. IEEE Comput. Graph. Appl. 18, 4 (J uly/Aug. 1998), 1823
[9] E. H. Chi., Improving Web usability through visualization. IEEE Internet Computing, 6(2), (2002), 6471.
[10] E. H. Chi, J . Pitkow, J . Mackinlay, P. Pirolli, R. Gossweiler, and S. K. Card., Visualizing the evolution of Web ecologies. In CHI 98:
Proceedings of the SIGCHI conference on Human factors in computing systems, (1998), pp.400407.
[11] S. G. Eick., Visualizing online activity, Communications. ACM, 44(8), (2001), 4550.
[12] Liu, B., Ma, Y. and Yu, P. S. Discovering Unexpected Information from Your Competitors Web Sites. KDD-01, 2001.
[13] Bing Liu, Kaidi Zhao and Lan Yi, Visualizing Web Site Comparisons, WWW 2002, May 7-11, (2002), Honolulu, Hawaii, USA.
[14] Yonghe Niu, Tong Zheng, J iyang Chen, Randy Goebel, WebKIV: Visualizing Structure and Navigation for Web Mining Applications,
IEEE/WIC International Conference on Web Intelligence (WI03)
[15] Steven P. Reiss and Guy Eddon, Visualizing What People are Doing on the Web, IEEE Symposium on Visual Languages and Human-
Centric Computing (VL/HCC05), (2005).
[16] J . Chen, L. Sun, O. R. Zaane, R. Goebel, Visualizing and Discovering Web Navigational Patterns, 7th International Workshop on the
Web and Databases, Paris, J une (2004).
[17] V. Pascual-Cid, An information System for the Understanding of Web Data, IEEE Symposium on Visual Analytics Science and
Technology, October (2008).
[18] V. Pascual-Cid1;2, R. Baeza-Yates2;3, J .C. Dursteler2, S. Minguez1 and C. Middleton1, New Techniques for Visualising Web
Navigational Data, 13th International Conference Information Visualization, (2009).
[19] Toyoda M., Kitsuregawa M., A system for visualizing and analyzing the evolution of the web with a time series of graphs, Proceedings
of the sixteenth ACM conference on Hypertext and hypermedia, (2005), pp.151-160.
[20] Web Content Extractor. http://www.newprosoft.com/Web-content-extractor.htm
[21] Hood, C. and Margetts, H., The Tools of Government in the Digital Age, London: Palgrave, (2006).
[22] J . Chen, T. Zheng, W. Thorne, D. Huntley, O. R. Zaane, R. Goebel, Visualizing Web Navigation Data with Polygon Graphs,
Proceedings of the 11th International Conference Information Visualization, (2007).
[23] WebTrends, http://www.webtrends.com/Products/Analytics/Web
[24] NetTracker, http://www.sane.com/products/NetTracker/
[25] David Durand and Paul Kahn, MAPA: a system for inducing and visualizing hierarchy in websites
Index
A
Ad hoc network
VANET, 245

D
Dynamic MANET on Demand (DYMO) routing protocol, 248
Dynamic source routing (DSR) protocol, 248

H
Hybrid architecture
VANET, 245

J
Jitter comparison
VANET, 250

R
Routing protocol
in VANET, 247248
Routing protocol in VANET
AODV, 247248
DYMO and DSR, 248
ZPR, 248249

V
VANET. see Vehicular ad hoc network (VANET)
Vehicular ad hoc network (VANET)
applications, 246247
hybrid architecture, 245
Jitter comparison, 250
performance evaluation, 249251
pure Ad hoc network, 245
pure cellular architecture, 245
routing protocol in, 247248

Z
Zone routing protocol (ZRP), 248249
ZRP. see Zone routing protocol (ZRP)

Fuzzy Association Rule Mining For Web Usage Visualization: Suash Deb, Simon Fong, Cecilia Ho

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fuzzy Association Rule Mining For Web Usage Visualization: Suash Deb, Simon Fong, Cecilia Ho

Uploaded by

Copyright:

Available Formats

Fuzzy association rule mining for web usage visualization

. Each tuple t r is associatedto

={<Hour, Early morning>, <Hour, Morning>,

. Table III shows the fuzzy transactions with items in

with the columns defining the fuzzy transactions

You might also like