Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Applied Intelligence (2018) 48:3902–3914

https://doi.org/10.1007/s10489-018-1182-6

Mining web access patterns with super-pattern constraint


Trang Van1,2 · Atsuo Yoshitaka3 · Bac Le4

Published online: 2 May 2018


© Springer Science+Business Media, LLC, part of Springer Nature 2018

Abstract
We consider the problem of mining web access patterns with super-pattern constraint. This constraint requires that the
sequential patterns in the sequence database must contain a particular set of patterns as sub-patterns. One common
application of this constraint is web usage mining which mines the user access behavior on the web. In this paper, we
introduce an efficient strategy for mining web access patterns with super-pattern constraint that requires only one database
scan. Firstly, we present the MWAPC (Mining Web Access Patterns based on super-pattern Constraint) algorithm, in which
each frequent pattern has to be checked if it contains at least one pattern from a user-defined set of patterns. Then we
develop an effective algorithm, called EMWAPC that prunes the search space at the beginning of mining process and avoids
checking the constraints one by one based on three proposed propositions. We have conducted the experiments on real web
log databases. The experimental results show that the proposed algorithms outperform the previous methods.

Keywords Web access pattern mining · Super-pattern constraint · Dynamic bit vector · Prefix-web access pattern tree

1 Introduction Web access pattern mining is a particular form of


sequential pattern mining which was first introduced by
Web access pattern mining discovers frequent user access Agrawal and Srikant (1995). It is used to find the frequent
patterns from web log files. It is also known as web usage subsequences as patterns in a set of sequences where each
mining or web log mining. With the explosive growth of sequence consists of a list of elements and each element
World Wide Web, web access pattern mining is valuable contains a set of items. In mining web access patterns, the
not only to website management, creating adaptive websites sequence is a collection of customers’ transactions logs and
[12] but also to businesses using online marketing or thus the element has one item. In other words, the web
e-businesses [23]. Additionally, it is useful for support access sequence is an ordered list of single items.
services [8] and personalization [19]. Mining sequential patterns in web logs helps to gather
information about the access behavior of the customers
 Trang Van visiting a website. Although many algorithms have been
vtt.trang@hutech.edu.vn; trangvtt@grad.uit.edu.vn proposed for mining web access patterns [11, 18, 20,
22, 26], most of them focus on the only measure that
Atsuo Yoshitaka is the frequency. Mining with the frequency still faces
ayoshi@jaist.ac.jp
challenges in both effectiveness and efficiency. The number
Bac Le of discovered patterns is huge while only a small part of
lhbac@fit.hcmus.edu.vn
them actually responds to user requirements. If we can focus
1 Faculty of Information Technology, Ho Chi Minh City
on only those patterns interesting to users, we may be able
University of Technology (HUTECH), Ho Chi Minh City, to save a lot of mining time. Therefore, we try to find the
Vietnam patterns that satisfy the constraint defined by the user.
2 Faculty of Computer Sciences, University of Information Recently, a number of different kinds of constraints have
Technology, VNU-HCMC, Ho Chi Minh, Vietnam been proposed for different applications. In the context of
3 School of Information Science, Japan Advanced Institute web access pattern mining, we propose the super-pattern
of Science Technology, Nomi, Japan constraint that requires the discovered patterns must contain
4 Faculty of Information Technology, University of Science, a particular set of web access patterns specified by the
VNU-HCMC, Ho Chi Minh, Vietnam user as sub patterns. For instance, a user may want to find
Mining web access patterns with super-pattern constraint 3903

only patterns about web click stream starting from tourism the main contribution of this article. Section 6 describes
website, reaching hotel and then booking flights. Here, the the experiments with their performance results, and
user’s interest is represented as super-pattern constraint, finally conclusions and future researches are described in
that is a Boolean function C(p) on the set of all patterns Section 7.
[17]. It is can be expressed as CSuper-pattern (p) ≡ (∃u ∈
U such that p ⊇ u), where U is a given set of web log
access patterns. 2 Basic concepts and problem statement
Two strategies are to solve this problem by pushing
the constraints after pattern mining process or integrating Let E be a set of distinct items. A web log sequence
the constraints into pattern mining process. We are aware s = e1 e2 . . . en (ei ∈ E for 1 ≤ i ≤ n) is an ordered
that pushing constraints after pattern mining process may list of items, where items can be repeated, and n is called
require more time. If we can incorporate constraints into the length of the sequence. A sequence with length n
pattern mining process, we are able to find out the is called a n-sequence, denoted as |s|= n. For example,
patterns to those of interest to the user and achieve better sequence ABCBAC is a 6-sequence. In this paper, a web
performance. Though there are many studies in mining log sequence is abbreviated to a sequence.
web access patterns, mining with super-pattern constraint A sequence β = b1 b2 . . . bm  is called a subsequence of
remains unexplored. This paper thus proposes an efficient another sequence α = a1 a2 . . . an , denoted as β ⊆ α and
method for integrating the super-pattern constraint into the α is a super-sequence β if there exist integers 1 ≤ i1 < i2 <
web access pattern mining. The contributions of this paper · · · < im ≤ n such that b1 = ai1 , b2 = ai2 , ..bm = aim . We
are summarized as follows: call im is the position where β occurs in sequence α (here
we keep the position of β
slastitem), denoted as posβ .
(1) Presenting the problem of mining web access patterns
A subsequence is also called a pattern. For example, the
with super-pattern constraint.
sequence ABC is a subsequence of ABACAC and it is
(2) Introducing a tree structure named prefix-web access
located at the positions {4, 6} (assuming positions starting
pattern tree (PreWAP tree) which stores the informa-
at 1), but CAB is not a subsequence of ABACAC.
tion of candidates represented by dynamic bit vectors.
The web access sequence database WD is a set of input
Based on the property of the PreWAP tree, we develop
web log sequences, each having a unique sequence identifier
a proposition for constraint checking. Instead of check-
SID. WD is generated by applying preprocessing to the
ing constraint for all the candidate patterns completely,
original log file. An input sequence s is said to contain
we can skip checking for a huge number of candidate
pattern p if p is a sub-sequence of sequence s. In other
patterns. Moreover, the PreWAP tree potentially sup-
words, pattern p is said to be present in s.
ports early pruning a lot of candidates in the search
The absolute support (support) of a pattern α, denoted
space with the use of prefixes.
sup(α), is defined as the number of input sequences in
(3) Based on the characteristics of the constraint-satisfied
WD that contains α. Given a minimum support threshold,
patterns, we propose two propositions which derive
minSup, we say that a pattern is frequent if its support is no
two transformation techniques on the dynamic bit
less than minSup.
vectors for early pruning the search space. One helps
to eliminate the unpromising candidates and prune the
Definition 1 (Prefix). A pattern β = b1 b2 . . . bm  is called
sub-trees via prefixes. The other helps to reduce the
a prefix of pattern α = a1 a2 . . . an  if and only if bi = ai
number of join operations when extending patterns.
for all 1 ≤ i ≤ m, m < n. We see that the prefix is
Thus this early-pruning strategy can significantly
also a subsequence. For an example, the prefixes of pattern
reduce the search space and the runtime.
ABBCA are: A, AB, ABB, and ABBC.
(4) Presenting an efficient algorithm for mining web
access patterns with super-pattern constraint.
Definition 2 (Extending a pattern). We create a new
The structure of this paper is as follows. Section 2 pattern by extending a frequent k−pattern (k > 0) with a
presents the main concepts of web access pattern mining, frequent item. The item is added to the end of the pattern.
some definitions used throughout the paper and the problem Let α = a1 a2 . . . an  be a frequent pattern and e be a
statement. Section 3 gives a brief summary of related frequent item. Let SI Dα , SIDe , posα , pose be the sequence
work. Section 4 presents the PreWAP tree. The primary IDs and positions of pattern α and item e. Extending pattern
contribution of current study presented in Section 5 which α with item e, we have new pattern α
= a1 a2 . . . an e
consists of three propositions and two algorithms for where SI Dα
= SI De , posα
= pose if (SI Dα = SI De )
mining web access patterns with super-pattern constraint, ∧(posα < pose ). According to Definition 1, we see that α
namely MWAPC and EMWAPC. Noticeably, the latter is is a prefix of the extended pattern α
.
3904 T. Van et al.

Problem statement Given a web access sequence database paradigm. The representative algorithm is PrefixSpan [16].
WD, a set of constraint patterns U = {u1 , u2 ...un } and the It is also a horizonal database format algorithm but
minSup is specified by the user. The problem of mining web it projects the orginal database into smaller projected
access patterns with super-pattern constraint is to find all databases based on the frequent item sets and then grows the
frequent patterns in the database which contain any pattern patterns in these projected databases. However, a drawback
in U as subsequence. of this method is that it can be costly to repeatedly scan
the database and create database projections. Some variants
F CP = {p | sup(p) ≥ minSup ∧ ∃k : 1 ≤ k ≤ n, p ⊇ uk }. with constraints are PTAC [4], GTC [13] and CloSPEC [3].
PTAC applies aggregation constraint, GTC and CloSPEC
Definition 3 (Constraint satisfied pattern) Given a con- apply time constraints.
straint pattern u, pattern p is called a u-satisfied pattern if p The typical algorithms in the second category are
⊇ u. SPADE [30], SPAM [2] and PRISM [7]. The variants
of these algorithms are cSPADE [29] using length and
For example, consider WDe shown in Table 1 (used as a time constraint, CCSM [15] (variants of SPADE with time
running example throughout this paper), let minSup = 3 and constraint), and Pex-SPAM [9] (a variant of SPAM with
U = {AB, AD, EA}. Since EA is not frequent, we regular expression constraint). The SPADE algorithm uses
have 6 frequent satisfied-patterns, FCP = {AB, ABD, a vertical id-list database format, which consists of a list of
ABDE, ABE, AD, ADE}. pairs (input sequence and event identifier) for each pattern.
It is possible to directly obtain the pattern support from the
sequence id-list without scanning the database. Therefore,
3 Related work this approach needs only one scan if a pre-processing step is
included. Instead of using the id-list, SPAM uses a bitmap
Sequential pattern mining is an important data mining representation. Each bitmap has a bit corresponding to each
tool used for web log mining. All the sequential pattern transaction of the sequences in the database. SPAM is much
mining algorithms are able to use for mining web access faster than SPADE but it is less space efficient than SPADE,
patterns. since the bitmap keeps the transactions even if they never
participate in the support count of the pattern. Recently,
3.1 Sequential pattern mining CM-SPAM and CM-SPADE [5] are improved algorithms.
They add a data structure named CMAP (Co-occurrence
The existing algorithms are classified into two categories: MAP for storing co-occurrence information). Based on
horizontal database format algorithms and vertical database CMAP, the improved algorithms performs early pruning of
format algorithms [14]. the candidates to reduce the search space. They outperform
AprioriAll [1] is the typical algorithm of the first state-of-the-art algorithms for mining sequential patterns
category. It adopts multiple database scans and generates (GSP, PrefixSpan, SPADE and SPAM).
the huge set of candidates. Some improved versions were PRISM makes use of a prime block encoding approach
derived by incorporating constraints in mining process such to compress the bitmap of SPAM. Every candidate pattern is
as GSP [21], SPIRIT [6] and PMPC [28]. GSP incorporates represented by two pieces of information: sequence blocks
time constraints, sliding time windows and taxonomies in (that indicate which input-sequence ids contain the candi-
patterns. SPIRIT uses a regular language to constrain the date) and position blocks (that indicate the positions which
pattern mining process and PMPC uses wildcard constraint. the candidate appears within an input-sequence). PRISM
To avoid generating candidate sequences as Apriori-type only removes empty position blocks, and cannot remove
algorithms, another approach is frequent pattern growth empty sequence blocks. A new approach may overcome
this, and it uses dynamic bit vector architecture [27]. This
Table 1 An example web access sequence database (W De) approach is applied in mining inter-sequence patterns [10]
and closed sequential patterns [24]. We thus study and apply
SID Input sequences this method to solve the problem of mining web access
1 BABDEAD patterns with super-pattern constraint.
2 BCEF
3 CABDE
3.2 Web access pattern mining
4 BABCEF
Because the structure of web access patterns is simpler
5 ABCDE
than sequential pattern in general, there are some particular
6 BCDF
approaches for mining web access patterns.
Mining web access patterns with super-pattern constraint 3905

Pei et al. proposed a web access pattern tree (WAP-tree) Structure of PreWAP tree Each node in a PreWAP tree
for representing the web access databases and WAP-Mine registers two pieces of information: label and DBVP, in
algorithm for mining all frequent patterns from the WAP- which label is a web access pattern and DBVP (Dynamic Bit
tree [18]. Each node in the tree is labeled with an item with Vector for Pattern) is a representation that stores the pattern
its support count and each branch represents a complete information. Thus, each pattern p in the tree is associated
access sequence. The access sequences which share a com- with its DBVP, denoted as DBVP(p). A DBVP consists of
mon prefix have the same path in the tree. All nodes with the two components: a DBV and a list of positions appearing in
same label are linked by shared label linkages into an event- the web access sequences.
queue. Head of each event-queue is registered in a header A DBV is a dynamic bit vector including a bit vector
table. In order to construct the WAP-tree, WAP-Mine algo- which is a list of bytes after removing zero bytes from the
rithm scans the input database twice, one for finding all head and tail, and an index to indicate the location of the
the frequent items and one for inserting the input sequences first non-zero byte in the bit vector. DBV is represented in
which have removed infrequent items into the WAP-tree. the form of DBV = index, {listbytes}. For a web access
After that, it mines all frequent web access patterns from pattern, we use DBV to indicate the input sequences where
WAP-tree. The basic idea of this mining algorithm is condi- the pattern is present. Each byte (8 bits) represents a block
tional search. Firstly, it finds the conditional suffix patterns, of eight input sequences. If the k th sequence in the block
and then it constructs the intermediate conditional WAP- contains the pattern then the k th bit is set to 1 otherwise it is
tree using the pattern found in previous step. WAP-Mine set to 0. Because the fact that the pattern is only present in
does not generate explosive candidate sets as in Apriori-like some input sequences, we use dynamic bit vector to remove
algorithms but it recursively constructs a large number of all zero bytes from head and tail of the bit vector. Figure 1
intermediate WAP-trees during mining process. This means shows a bit vector with 30 bytes. When it is converted to
that it still consumes a lot of time and uses a lot of memory. DBV, we only need 9 bytes (7 bytes for bit vector and 2
There are some modifications based on WAP-tree that are bytes for the index) and the index is 10 (assuming indexes
PLWAP-tree [11], FLWAP-tree [22] and AWAPT [26]. starting at 0).
PLWAP avoids recursive re-construction of intermediate The list of positions indicates the positions where the
WAP-trees by assigning binary position code to tree pattern appears in each input sequence. It is represented in
nodes to quickly determine the suffix of any frequent the form of Start − P osition : {Listofpositions}, where
pattern prefix. Both FLWAP-tree and AWAPT are improved Start-Position is the first appearance of the pattern in the
versions of PLWAP-tree. input sequence.
All algorithms using WAP-tree or like-WAP-tree are
different from the Apriori methods and outperform. They Support counting The support of a pattern p is directly
avoid generating the huge set of candidates, adopt multiple determined from the bit vector in DBVP(p). It is the total
database scans, and make support counting easier. They number of bits 1 of bytes in the DBV of DBVP(p). For
use a representation of input horizontal database which is example, consider the database shown in Table 1; let’s see
a link-tree. However, they still have drawbacks such as how to construct the DBVP for 1-pattern A. Since A
storing intermediate patterns while constructs large numbers occurs in the input sequences {1, 3, 4, 5}, the bit vector
of intermediate WAP-trees or increasing in size of the tree is {10111000} = {184}. We pad this bit vector so that it
nodes in PLWAP or using high memory for FLWAP as the has enough 8 bits (two end bits is padded). So DBVA =
latter creates intermediate. 0, {184}. Next, we find the list of positions for the
pattern A. Since A appears in positions {2, 6} (assuming
positions starting at 1) in the first sequence, the start position
4 Prefix-web access pattern tree is 2 and thus we store 2: {2, 6}. Similarly, we obtain the
positions of A in the remaining sequences including 2:
We introduce a prefix-web access pattern (PreWAP) tree for {2}, 2: {2}, 1: {1}. The completed DBV P (A) is shown in
storing candidate patterns (similar to the prefix-tree [25]). Table 2. The support of A is 4.

Index = 10

0 0 0 0 0 0 0 0 0 0 7 15 0 252 0 0 21 0 0 0 0 0 0 0 0 0 0 0 0 0

DBV = 10, {7, 15, 0, 252, 0, 0, 21}

Fig. 1 A representation of DBV for the bit vector


3906 T. Van et al.

Table 2 DBV P (A) in the W De in Table 1 is created, MWAP will check if it satisfies the constraint or
not. If so, it will be added to the results.
A DBV 0, {184}
In the first step, MWAPC scans the database to find
Bit vector 1 0 1 1 1 0 0 0 frequent 1-patterns with their DBVPs (line 2) and removes
Positions 2: {2, 6} ∅ 2: {2} 2: {2} 1: {1} ∅ ∅ ∅ the infrequent sequences in the constraint set U (line 3).
Next, each node r at the first level is considered as the root
of a sub-tree, which may be processed independently. Here
we check constraint for r.label ∈ F1 since |ui | ≥ 1 for 1 ≤
Pattern extension in PreWAP tree The root of the tree at i ≤ n. If it satisfies, it is added to the result set FCP (line 5).
level 0 is a special virtual node with an empty label. At level Then we perform pattern extension with r to generate larger
k, a node is labeled with a k-sequence. Recursively, we have patterns by calling the procedure EXTENSION-CHECK
nodes at the next level (k + 1) by extending k-sequences. (line 6).
The extension process starts from finding the set of item
How to get the support of the extended pattern? A new I ⊆ F1 , such that we obtain the frequent patterns when
pattern p
is obtained by extending an available pattern p extending pattern with any item e ∈ I (line 7). The size
with a frequent item e (definition 2). In order to determine of set I will be decreased steadily through levels. We check
sup(p
), we find DBV P (p
) by joining DBV P (p) and constraint and recursively call this procedure with one of the
DBV P (e). We use bitwise AND for joining two DBV (Vo, extended patterns (lines 8-10). This process is repeated until
Hong, & Le, 2012) and use the definition 2 for joining two none of the generated children are frequent and the node is
list of positions. If the position within an input sequence a leaf. The algorithm backtracks to generate other patterns
is ∅ then we turn the correspondent bit in the vector to 0. using other nodes.
Table 3 shows an example of pattern extension. We have a
new pattern AB, sup( AB ) = 4 by extending pattern
A with item B.

Property of PreWAP tree Considering a sub-tree rooted at


node n of the PreWAP tree, n is the prefix of its entire
descendants.

5 Proposed algorithm

In this section, we first present an algorithm for mining


web access patterns which are super-patterns of those in
the constraint set U, named MWAPC. Then, we propose
EMWAPC algorithm which utilizes the properties of DBV
and PreWAP tree for mining effectively.

5.1 MWAPC algorithm

MWAPC uses PreWAP tree structure and depth-first search


for enumerating the web access patterns satisfied the
constraint by reading database only once. When a pattern

Table 3 Example of pattern


extension for pattern A with A Bit vector 1 0 1 1 1 0 0 0
item B in WDe Positions 2: {2, 6} ∅ 2: {2} 2: {2} 1: {1} ∅ ∅ ∅
B Bit vector 1 1 1 1 1 1 0 0
Positions 1: {1, 3} 1: {1} 3: {3} 1: {1, 3} 2: {2} 1: {1} ∅ ∅
AB Bit vector 1 0 1 1 1 0 0 0
Positions 2: {3} ∅ 2: {3} 2: {3} 1: {2} ∅ ∅ ∅
Mining web access patterns with super-pattern constraint 3907

5.2 Illustration of the MWAPC proccess Proposition 1 Let p be a u − satisf iedpattern. If (u ⊂ s)


then (p  ⊂ s).
We apply MWAPC algorithm to WDe with minSup = 3 and
U={AB, AD, EA}. The excution is in the following steps. Proof Assume that p ⊆ s (1). Because p is a u-satisfied
1. FCP = ∅. MWAPC scans WDe to find F1 = {A, B, C, pattern, thus based on the definition of constraint satisfied
D, E, F} with their DBVPs and U
= {AB, AD}. pattern we have p ⊇ u (2). From (1) and (2), we have u ⊆ s.
2. For first sub-tree rooted at node A, the algorithm will This is contrary to the assumption. Therefore, Proposition 1
check constraint for pattern A. Since A does not sat- is proven.
isfy U
,F CP = ∅. And then it performs pattern extension
from A by using Procedure EXTENSION-CHECK. Proposition 2 Let X ∈ F1 be an atom, ST(X) be a sub-tree
rooted at X of the PreWAP tree and p ∈ ST (X) be an u-
The algorithm must extend A with all items in F1 one by
satisfied pattern. If (p ⊆ s) then (X ⊆ s) ∧ (u ⊆ s)
one to create AA, AB, AC, AD, AE and AF. Since
∧(pos X ≤ pos E1 ).
AA, AC and AF are not frequent, I1 ={B, D, E}. For the
first extended pattern AB, it checks constraint and AB
is satisfied then F CP = {AB}. MWAPC continuously Proof
uses the Procedure 1 with AB and I1 as the parameter for • Since p ∈ ST (X) ⇒ X is the prefix of p (property of
extension to create larger pattern at the next level. This step PreWAP tree) ⇒ X ⊆ p (3). By assumption, p ⊆ s (4).
is then repeated recursively until no more frequent patterns From (3) and (4), we have X ⊆ s.
are created. The algorithm will stop at this sub-tree. • Since p is a u-satisfied pattern ⇒ p ⊇ u or u ⊆ p
3. Step 2 is repeated for each sub-tree rooted at remaining (based on Definition 3) and similarly p ⊆ s ⇒ u ⊆ s.
nodes in F1 including {B, C, D, E and F}. When no more • Since p ∈ ST (X), the form of p is p = XA1 A2 . . . Am .
patterns are created, the mining process is finished. The By assumption, p is a u-satisfied pattern ⇒ p ⊇ u =
PreWAP tree of MWAPC algorithm and the result set E1 E2 . . . En . Therefore, there are two cases: E1 =
FCP for this example is shown in Fig. 2. X ⇒ posX = pos E1 or E1 = Ai (1 ≤ i < m) ⇒
pos X < pos E1 .
5.3 EMWAPC algorithm

In this section, we propose three propositions for fast mining Therefore, Proposition 2 is proven.
web access patterns with super-pattern constraint. Let u =
E1 E2 . . . En  ∈ U be a constraint pattern, s be an input Proposition 3 Let ST (p) be a sub-tree rooted at pattern p
sequence in the database, F1 be the set of atoms, where atom of the PreWAP tree. If p is a u-satisfied pattern then q is a
is a frequent 1-pattern. The three propositions are as follows. u-satisfied pattern for all q ∈ ST (p).

{}

A :4 B :6 C :4 D :4 E :5 F:3

BF : 3 CF : 3
AB : 4 AD : 3 AE : 4 BC : 4 DE : 3
BD : 4
BE : 5 CD : 4 CE : 4

BCE : 3 BDE : 3
ABD : 3 ADE : 3
BCF : 3
ABE : 4

FCP = { AB , ABD , ABDE , ABE , AD and ADE


ABDE : 3

Fig. 2 The PreWAP tree of MWAPC for WDe with minSup = 3, U = {AB, AD, EA}
3908 T. Van et al.

Proof By assumption, p is a u-satisfied pattern ⇒ u ⊆ p. First, EMWAPC scans the database to find F1 with their
Besides, since q ∈ ST (p), p is the prefix of q (based DBVPs (line 2). Then, it determines the frequency of the con-
on Property of PreWAP tree) ⇒ p ⊆ q. So, we have straint patterns in U by using the DBVP without accessing
u ⊆ q ⇒ q is a u-satisfied pattern. Therefore, Proposition the database (line 3). Hence, EMWAPC do not waste time
3 is proven. considering a large number of input sequences that contain
u ∈ U or not. We can get the DBVP(u) by using the pattern-
Based on three above propositions, the EMWAPC extension with each item in u. Here, the positions where u
algorithm improves the MWAPC with the basic ideas: the occurs in an input sequence are represented by the positions
mining proceeds from each sub-tree rooted at the atom of u’s first item (instead of last item as usual) so that they can
in F1 , EMWAPC prunes the search space of the PreWAP serve for pruning strategy. Note that if ∃e ∈ u but e ∈ / F1 or
tree early before performing pattern extension by using sup(u) < minSup then we delete u from the set U.
proposition 1 and 2. Then, in pattern extension process, Next, the algorithm calls the procedure EARLY-
instead of checking constraint for each generated pattern PRUNING to prune the search space by applying the propo-
as MWAPC, EMWAPC skips checking constraint for a sition 1 & 2 (line 4). The algorithm then performs the pattern
numerous of patterns based on the proposition 3. The details extension in the way similar to MWAPC. However, there is
of EMWAPC are described below. no need to check constraint for the overall created patterns.
Based on the proposition 3, if the root node of a sub-tree is
a satisfied pattern, we simply perform pattern extension and
add extended patterns to FCS without checking constraint
by calling the procedure PREFIX-EXTENSION (lines 6 –
8). It means that the algorithm possibly skips checking for
all patterns in that sub-tree. Otherwise, we perform pat-
tern extension and check constraint by calling the procedure
PREFIX-EXTENSION-CHECK (lines 9, 10). If a satisfied
pattern is found, we also skip checking for its descendants.

EARLY-PRUNING procedure The basic mechanism for prun-


ing the search space early is based on the proposition 1 and
2. The proposition 1 states that if the constraint sequence u is
not present in the input sequence s, neither is the u-satisfied
pattern p (p ∈ F CP ). Therefore, it is possible to eliminate
the input sequences that do not participate in the frequency
of the pattern p early based on the DBVu . Because p is cre-
ated from the atom X ∈ F1 , the elimination is performed in
the DBVX . If the k th bit of DBVu is zero then we set the
k th bit of DBVX to zero by using bitwise AND of these two
DBVs. Moreover, from the proposition 2, it can be inferred
that the input sequences contributing to the frequency of pat-
tern p are those that contain both X and u, and X precedes u.
Therefore, we can delete the positions in the list of positions
of the DBVP(X) where the result of bitwise AND is 1 and X
follows u. Because|U
| ≥ 1 and the pattern only need to sat-
isfy one of the constraint in the set U
, let d be the represen-
tation of the set U
, the DBVP(d) is defined as follows:
DBV d = OR(DBV ui ), pos d = MAX(pos ui ) with ui
∈ U
, 1 ≤ i ≤ |U
|.
The pseudo code of EARLY-PRUNING procedure is as
follows. Let F1∗ be the set of prefixes which are the nodes
at the first level of the PreWAP tree. Due to using DBVP(d)
as mentioned above, the elimination can be performed for
the DBVP of each atom. By applying the proposition 1, we
can delete the atom from F1 immediately if its support does
Mining web access patterns with super-pattern constraint 3909

Table 4 The atoms with their DBV s 1. FCP = ∅. Scan WDe to find F1 = {A, B, C, D, E, F} with
their DBVPs. Table 4 shows the atoms in F1 with their
F1 SID Bit - vector DBV Support
DBVs.
A 1345 10111000 0, {184} 4 2. Find U
. To determine sup(AB), we extend the pattern
B 123456 11111100 0, {252} 6 A with item B and join their DBVPs, we obtain AB
C 23456 01111100 0, {124} 5 with sup(AB) = 4. Table 5 shows how to determine
D 1356 10101100 0, {172} 4 the DBV P (AB). Similarly, we have sup(AD) = 3,
E 12345 11111000 0, {248} 5 sup(EA) = 1 and thus U’ = {AB, AD}.
F 246 01010100 0, {168} 3 3. Prune the search space. First, F1∗ = ∅ we define the
DBVP(d) based on DBV P (AB) and DBV P (AD)
as Table 6. Next, we eliminate the DBVPs of the atoms
in F1 . Since there have been no changes in DBVP(A))
when it is eliminated, A is still frequent and thus F1∗ =
not satisfy the minSup after being eliminated (lines 4 – 5).
{A}. Consider the atoms B, C, D, E and F they are
Otherwise, we continue to apply the proposition 2 (lines
no longer frequent after being eliminated, they are not
6 – 11). If the atom still frequent, it is added to the F1∗ ,
added to F1∗ .
otherwise we have already pruned a sub-tree in the PreWAP
Table 7 show an example of the elimination for
tree. Besides, the elimination process may lead to reducing
DBV P (B), after eliminating sup(B) = 2. In particu-
the support of all the atoms as well as changing the list of
lar, the atom F is not frequent as soon as performing bit-
positions to ∅. This helps to reduce the cost of the pattern
wise AND between DBV (F ) and DBV (d) (01010100
extension. Thus this early-pruning strategy can significantly
& 10111000 = 00010000). Therefore, we delete it
reduce the search space and the runtime.
from F1 .
4. EMWAPC starts processing each sub-tree in the
PreWAPT independently. It only has two sub-tree rooted
at the nodes in F1∗ including A and D. The item set
using for pattern extension is F1 = {A, B, C, D, E, F}.
The pattern extension process is executed in the same
way as MWAPC but EMWAPC does not need to check
constraint for all the patterns. It skips checking for the
patterns in T1 and T2 as shown in Fig. 3.
The algorithm EMWAPC has applied propositions 1,
2 and 3, and thereby it prunes the search space at the
beginning of mining process and skips checking constraint
for a large number of patterns. Therefore, it is faster than
MWAPC.

6 Experimental results

5.4 Illustration of the EMWAPC proccess In this section, we compare the performance of PRISMC,
CM-SPAMC, MWAPC and EMWAPC on real-life
The execution of the EMWAPC algorithm for WDe with databases. PRISMC and CM-SPAMC are in turn pushed-
minSup = 3, U = {AB, AD, EA} is following steps: constraint versions of PRISM [7] and CM-SPAM [5] for

Table 5 Example of the


pattern-extension A with A Bit vector 1 0 1 1 1 0 0 0
item B Positions 2: {2, 6} ∅ 2: {2} 2: {2} 1: {1} ∅ ∅ ∅
B Bit vector 1 1 1 1 1 1 0 0
Positions 1: {1, 3} 1: {1} 3: {3} 1: {1, 3} 2: {2} 1: {1} ∅ ∅
AB Bit vector 1 0 1 1 1 0 0 0
Positions 2: {2} ∅ 2: {2} 2: {2} 1: {1} ∅ ∅ ∅
3910 T. Van et al.

Table 6 Define DBV P (d)


based on DBV P (AB) and AB Bit vector 1 0 1 1 1 0 0 0
DBV P (AD) Positions 2: {2} ∅ 2: {2} 2: {2} 1: {1} ∅ ∅ ∅
AD Bit vector 1 0 1 0 1 0 0 0
Positions 2: {2, 6} ∅ 2: {2} ∅ 1: {1} ∅ ∅ ∅
d Bit vector 1 0 1 1 1 0 0 0
Positions 2: {6} ∅ 2: {2} 2: {2} 1: {1} ∅ ∅ ∅

Table 7 The elimination for


the DBV P (B) B Bit vector 1 1 1 1 1 1 0 0
Positions 1: {1, 3} 1: {1} 3: {3} 1: {1, 3} 2: {2} 1: {1} ∅ ∅
d Bit vector 1 0 1 1 1 0 0 0
Positions 2: {6} ∅ 2: {2} 2: {2} 1: {1} ∅ ∅ ∅
elim − B Bit vector 1 0 0 1 0 0 0 0
Positions 1: {1,3} ∅ ∅ 1: {1} ∅ ∅ ∅ ∅

{}
{}

A :4

AB : 4
AB AD : 3 AE : 4
AE

AB
A B
BDD :3 T2
A
ADDE : 3
DE
ABE
AB E :4
T1
ABD
D
DE
E :3

FC
F CP
P = { AB
AB , AB
ABD
D , ABDE , ABE , AD , and ADE }

Fig. 3 The PreWAP tree of EMWAPC for WDe with minSup = 3, U = {AB, AD, EA}

Table 8 Database
characteristics Database #Sequences #Distinct items Average seq. length (items)

Gazelle 59,602 497 2.51 (std = 4.85)


FIFA 20,450 2,990 34.74 (std = 24.08)
Kosarak10k 10,000 10,094 8.14 (std = 22)
Mining web access patterns with super-pattern constraint 3911

mining web access pattern with pushing constraints after Consequently, this subset is highly interested to the user.
pattern mining process. Both MWAPC and EMWAPC Table 9 shows a comparison of the extracted pattern quantity
use the dynamic bit vector but EMWAP applies three when mining without and with super-pattern constraint on
propositions, including pruning strategy and the constraint Gazelle with Length = 4, TopK = 5%. We also obtain similar
checking reduction. All the algorithms were implemented results for the other databases.
in Visual Studio 2008 C# and executed on a personal com- We then conduct experiments for mining web access
puter with an Intel Core i7 1.9-GHz CPU and 8GB of RAM patterns with super-pattern constraint to evaluate the
running Windows 8.0. performance as two parameters change: minSup and Length.

6.1 Experiment databases


Performance vs. minsup The first set of experiments aims
Experiments were carried on three real-life databases to test the performance of the algorithms as the threshold
including Gazelle, Kosarak and FIFA, which can be minsup decreases. All the algorithms return the same quality
downloaded from http://goo.gl/hDtdt. Gazelle contains the and quantity of discovered patterns, but the runtimes are
sequences of click stream data from an e-commerce. different on each database. The comparison of execution
Kosarak is a very large database with 990,000 sequences times is shown in Figs. 4, 5 and 6.
click stream data from a new portal and therefore we use The results show that CM-SPAMC runs faster than
a subset of it. Both of them were part of the KDD Cup PRISMC on Kosarak database but slower on Gazelle and
2000 challenge databases. And FIFA is a database with click FIFA databases. This is because all items co-occur with each
stream data from the website of FIFA World Cup 98. Table 8 item in almost all sequences in Gazelle and FIFA, so fewer
summarizes the characteristics of these databases. candidates could be pruned. On the other hand, we find that
both MWAPC and EMWAPC are faster than PRISMC and
6.2 Initialize constraint sequences CM-SPAMC for all experimental databases. In most cases,
EMWAPC is always the fastest of all tested algorithms.
In practical applications, the constraint sequence set U is For example, consider the performance of the algorithms
given by the user. For experiment, the U set is generated as the minSup decreases from 1% to 0.6% for Gazelle
randomly in which each element is made of items selected database with TopK = 5% and Length = 4 in Fig. 4. As
from F1 . The cardinality of U depends on the number of expected, the execution times for all algorithms increase
selected items and the length of the constraint sequences since more candidates are potentially frequent for lower
(denoted as Length). Without loss of generality, we select values of minSup. The Fig. 4 shows that, EMWAPC
the top k (%) (denoted as TopK) most frequent items in F1 performs better than MWAPC because EMWAPC prunes
because the number of combinations is very large. the search space early. Moreover, it does much better than

6.3 Performance analysis

First, we perform the experiments for mining web access


patterns with and without super-pattern constraint. The
results show that the number of extracted patterns in
mining with constraint is less than those in mining without
constraint because mining with constraint only returns
the patterns satisfied the constraint defined by the user.

Table 9 Comparison of extracted pattern quantity on Gazelle

minSup (%) |U | |U
| #F P #F CP (with
(without constraint)
constraint)

1 59 6 510 109
0.9 211 7 640 146
0.8 212 8 807 205
0.7 549 12 1074 306
0.6 1181 16 1485 893 Fig. 4 Comparison of mining time with various minSup values for
Gazelle database
3912 T. Van et al.

Fig. 5 Comparison of mining time with various minSup values for


FIFA database Fig. 7 Comparison of mining time with various Length values for
Gazelle database

MWAPC for low support levels. This is due to EMWAPC


possibly skip checking constraint for more candidate Performance vs. Length The second set of experiments was
patterns as the amount of candidates patterns satisfying conducted to test the performance of the three algorithms
the constraint is larger at low support levels. We also as the length of the constraint patterns in the set U
obtain similar results for the other databases. In particular, increased. To do this, based on the length of discovered
since the difference between the approaches is very large frequent patterns in each database, we change the Length in
for the Kosarak database, we show the vertical axis is in appropriate scope.
logarithmic scale. The mining time of PRISMC is over The results are shown in Figs. 7, 8 and 9. They also
5000s when minSup is less than 0.2% in Fig. 6. We note that indicate that EMWAPC uses up the least time among the
the number of distinct items in Kosarak database is much four algorithms, MWAPC uses the second least and CM-
more than Gazelle and FIFA, so EMWAPC can early prune SPAMC runs much slower than the others. But CM-SPAMC
more sub trees with respect to the frequent items in F1 . runs faster than PRISMC on Kosarak database similar to the

Fig. 6 Comparison of mining time with various minSup values for Fig. 8 Comparison of mining time with various Length values for
Kosarak database FIFA database
Mining web access patterns with super-pattern constraint 3913

Based on the PreWAP tree, we proposed two algorithms,


namely MWAPC and EMWAPC, for mining web access
patterns with super-pattern constraint. The main contribu-
tion of this study is EMWAPC algorithm which is developed
from three propositions for quick pruning and reducing
checking. The experimental results show that EMWAPC
outperforms MWAPC, CM-SPAMC and PRISMC in terms
of mining time.
In future, we will apply this constraint to mining other
type of sequential data such as customer transactions. We
will also apply our technique to mining frequent sequential
closed patterns with constraints.

Acknowledgments This research is funded by Vietnam National


Foundation for Science and Technology Development (NAFOSTED)
under grant number 102.05-2015.07.

Fig. 9 Comparison of mining time with various Length values for References
Kosarak database
1. Agrawal R, Srikant R (1995) Mining sequential patterns.
Proceedings of the 11th International Conference on Data
Engineering, pp 3–14
first set of experiments. Fig. 9 shows that the vertical axis is 2. Ayres J, Gehrke JE, Yiu T, Flannick J (2002) Sequential pattern
in logarithmic scale because the gap between the algorithms mining using a bitmap representation. In: Proceedings of the
is great. The mining time of PRISMC is over 5000s at 8th ACM SIGKDD International Conference on Knowledge
minSup = 0.18% for all different values of the Length. It Discovery and Dada Mining, pp 429–435
3. Béchet N, Cellier P, Charnois T, Crémilleux B (2015) Sequence
means that PRISMC is 20 times slower than MWAPC and mining under multiple constraints. Proceedings of the 30th Annual
25 times faster than EMWAPC, respectively. ACM Symposium on Applied Computing, pp 908–914
In addition, the gaps among four algorithms become 4. Chen E, Cao H, Li Q, Qian T (2008) Efficient strategies for
greater when increasing the Length. For the databases tough aggregate constraint-based sequential pattern mining. Inf
Sci 176(1):1498–1518
having long sequences as Kosarak and FIFA, the difference 5. Fournier-Viger FV, Gomariz A, Campos M, Thomas R (2014)
between them becomes larger. This is mainly because when Fast Vertical Mining of Sequential Patterns Using Co-occurrence
the Length increases, |U
| is decreased but the constraint Information. PAKDD’14, pp 40–52
checking requires more time. This shows that skipping 6. Garofalakis MN, Rastogi R, Shim K (1999) SPIRIT: Sequential
pattern with mining regular expression constraints. VLDB 99:7–
constraint checking for numerous candidates is an effective 10
technique to improve the performance of the EMWAPC 7. Gouda K, Hassaan M, Zaki MJ (2010) Prism: An effective
algorithm. approach for frequent sequence mining via prime-block encoding.
From these results, we conclude that incorporating Comput Syst Sci 76(1):88–102
8. Guerbas A, Addam O, Nagi M, Elhajj A, Ridley M, Alhajj R
constraints into pattern mining process is better than (2013) Effective web log mining and online navigational pattern
pushing constraints after pattern mining process. Besides, prediction. Knowl-Based Syst 49:50–62
in most cases EMWAPC always outperforms remain 9. Ho J, Lukov L, Chawla S (2005) Sequential pattern mining with
algorithms. These results confirm the effectiveness of using constraints on large protein databases. In: COMAD, pp 89–100
10. Le B, Tran MT, Vo B (2015) Mining frequent closed inter-
compressible representation (DBVP) pruning the search sequence patterns efficiently using dynamic bit vectors. Appl
space early and avoid checking for a great deal of patterns Intell 43(1):74–84
based on utilizing the properties of the DBVP and PreWAP 11. Lu Y, Ezeife CI (2003) Position Coded Pre-order Linked WAP-
tree. Tree for Web Log Sequential Pattern Mining. In: PAKDD 2003,
LNCS (LNAI), vol 2637, pp 337–349
12. Mary SP, Baburaj E (2016) A novel framework for an efficient
online recommendation system using constraint based web usage
7 Conclusion and future work mining techniques. Biomedical Research, pp 92–98
13. Masseglia F, Poncelet P, Teisseire M (2009) Efficient mining
of sequential patterns with time constraints: Reducing the
This study presented the problem of mining web access combinations. Expert Syst Appl 36(2):2677–2690
patterns with super-pattern constraint. We introduce the Pre- 14. Mooney CH, Roddick JF (2013) Sequential pattern mining-
WAP tree by applying the dynamic bit vector structure. approaches and algorithms. ACM Comput Surv 45(2):19
3914 T. Van et al.

15. Orlando S, Perego R, Silvestri C (2004) A New Algorithm for 23. Thushara Y, Ramesh V (2016) A study of web mining application
gap constrained sequence mining. In: Proceedings of the ACM on E-commerce using google analytics tool. Int J Comput Appl
Symposium on Applied Computing, pp 540–547 149(11):21–26
16. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, 24. Tran MT, Le B, Vo B (2015) Combination of dynamic bit vectors
Hsu MC (2004) Mining sequential patterns by pattern-growth: The and transaction information for mining frequent closed sequences
PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424– efficiently. Eng Appl Artif Intell 38:183–189
1440 25. Van T, Vo B, Le B (2011) Mining sequential rules based on prefix-
17. Pei J, Han J, Wang W (2007) Constraint-based sequential pattern tree. In New Challenges for Intelligent Information and Database
mining: the pattern-growth methods. J Intell Inf Syst 28(2):133–160 Systems, pp 147–156
18. Pei J, Han J, Mortazavi-asl B, Zhu H (2000) Mining access 26. Vijayalakshmi S, Mohan V, Suresh RS (2010) Mining of users
patterns efficiently from web logs. In PAKDD 2000, LNCS, vol access behavior for frequent sequential pattern from web logs. Int
1805, pp 396–407 J Database Manag Syst 2(3):31–45
19. Rathore KS, Sharma S (2016) Web personalization based on 27. Vo B, Hong TP, Le B (2012) DBV-Miner: A Dynamic Bit vector
enhanced web access pattern using sequential pattern mining. Int approach for fast mining frequent closed itemsets. Expert Syst
Eng Comput Sci 5(6):17152–17159 Appl 39(8):7196–7206
20. Rajimol A, Raju G (2012) Web access pattern mining–a survey. 28. Wu X, Zhu X, He Y, Arslan AN (2013) PMBC: Pattern mining
Data Engineering, Management, Lecture Notes in Computer from biological sequences with wildcard constraints. Comput Biol
Science, vol 6411. Springer, Berlin, pp 24–31 Med 43(5):481–492
21. Srikant R, Agrawal R (1996) Mining sequential patterns: 29. Zaki MJ (2000) Sequence mining in categorical domains:
Generalizations and performance improvements. Advances in incorporating constraints. Proceedings of the 9th International
Database Technology, EDBT’96, pp 1–17 Conference on Information and Knowledge Management, pp 422–
22. Tang P, Turkia MP, Gallivan KA (2007) Mining web access 429
patterns with first-occurrencelinked WAP-trees. In SEDE’, vol 07, 30. Zaki MJ (2001) SPADE: An Efficient Algorithm for Mining
pp 247–252 Frequent Sequences. Mach Learn 42(1):31–60

You might also like