Bab 6

Data Mining

Mining Sequential Patterns

Bab 6
Mining Sequential Patterns
Arif Djunaidy
Bab 6 - 2/25
Data Mining Arif Djunaidy FTIF ITS
What is sequential rules mining?
Finding sequential patterns
AprioriAll Algorithm
Generalized Sequential Patterns (GSP)
Bab 6 - 3/25
Data Mining Arif Djunaidy FTIF ITS
Given is a set of objects, with each object associated with its own timeline
of events, find rules that predict strong sequential dependencies among
different events.
What Is Sequential Rules Mining? - 1
Sequence mining: discover sequences of events that commonly occur
Rules are formed by first discovering patterns. Event occurrences in the
patterns are governed by timing constraints.
Much higher computational complexity than association rule discovery:
) number of possible sequential patterns having k events,
where m is the total number of possible events.
Bab 6 - 4/25
Data Mining Arif Djunaidy FTIF ITS
The input data is a set of sequences, called data-sequences
Each data-sequence is a list of transactions, where each transaction
is a sets of literals, called items
Typically there is a transaction-time associated with each
transaction. A sequential pattern also consists of a list of sets of items
What Is Sequential Rules Mining? - 2
The problem is to find all sequential patterns with a user-specified
minimum support, where the support of a sequential pattern is the
percentage of data-sequences that contain the pattern
Bab 6 - 5/25
Data Mining Arif Djunaidy FTIF ITS
In the database of a book-club, each data-sequence may correspond
to all book selections of a customer and each transaction to the
books selected by the customer in one order.

A sequential pattern might be 5% of customers bought
Foundation, then Foundation and Empire, and then Second
Foundation .
Elements of a sequential pattern can be sets of items, for example,
Foundation and Ringworld, followed by Foundation and Empire
and Ringworld Engineers, followed by Second Foundation
What Is Sequential Rules Mining? - 3
Bab 6 - 6/25
Data Mining Arif Djunaidy FTIF ITS
We are given a database D of customer transactions:
Each transaction consists of the following fields:
customer-id, transaction-time, and the items purchased in
the transaction
No customer has more than one transaction with the
same transaction-time
Quantities of items bought in a transaction is not
Each item is a binary variable representing whether an
item was bought or not

Problem Statement - 1
Bab 6 - 7/25
Data Mining Arif Djunaidy FTIF ITS
An itemset is a non-empty set of items.
A sequence is an ordered list of itemsets.
The support for a sequence is defined as the fraction of total
customers who support this sequence
It is assumed that the set of items is mapped to a set of
contiguous integers.
An itemset i is denoted as where i
is an item.
A sequence s is denoted as where s
is an itemset.
A sequence is contained in another sequence
if there exist integers such that
For example:
The sequence { (3) (4 5) (8) } is contained in { (7) (3 8) (9) (4 5
6) (8) }, since (3) (3 8), (4 5) (4 5 6) and (8) (8).
However, the sequence { (3) (5) } is not contained in { (3 5) }
(and vice versa).
Problem Statement - 2
Bab 6 - 8/25
Data Mining Arif Djunaidy FTIF ITS
Given a database D of customer transactions:
The problem of mining sequential patterns is to find the maximal
sequences among all sequences that have a certain user-specified
minimum support.
Each such maximal sequence represents a sequential pattern.
A sequence satisfying the minimum support constraint is called a
large sequence

Problem Statement - 3
Database Sorted by Customer Id and Transaction Time
Customer-Sequence Version of the Database
Bab 6 - 9/25
Data Mining Arif Djunaidy FTIF ITS
Problem Statement : Example
With a minimum support set to 25%, i.e., a minimum support,
of 2 customers, two sequences: {(30) (90)} and {(30) (40 70)} are
maximal among those satisfying the support constraint, and are
the desired sequential patterns
An example of a sequence that does not have minimum support
is the sequence {(10 20) (30)}, which is only supported by
customer 2. The sequences {(30)}, {(40)}, {(70)}, {(90)}, {(30) (40)},
{(30) (70)} and {(40 70)}, though having minimum support, are
not in the answer because they are not maximal
Bab 6 - 10/25
Data Mining Arif Djunaidy FTIF ITS
The length of a sequence is the number of itemsets in the
A sequence of length k is called a k-sequence
The support for an itemset i is defined as the fraction of
customers who bought the items in i in a single
An itemset with minimum support is called a large
itemset or litemset.
Note that each itemset in a large sequence must have minimum
support. Hence, any large sequence must be a list of litemsets
Finding Sequential Patterns
Bab 6 - 11/25
Data Mining Arif Djunaidy FTIF ITS
1. Sort Phase.
The database (D) is sorted, with customer-id as the major key and
transaction-time as the minor key.
This step implicitly converts the original transaction database into
a database of customer sequences.
2. Litemset Phase.
In this phase, we find the set of all litemsets L.
We are also simultaneously finding the set of all large l-sequences,
since this set is just { (l) | l L }
The litemsets is mapped to a set of contiguous integers.
In the example database, the large itemsets are (30), (40), (70),
(40 70) and (90) which is respectively mapped to {1, 2, 3, 4, 5}
(see next slide)
The reason for this mapping is that by treating litemsets as single
entities, we can compare two litemsets for equality in constant
time, and reduce the time required to check if a sequence is
contained in a customer sequence.
Finding Sequential Patterns: The Algorithm - 1
Bab 6 - 12/25
Data Mining Arif Djunaidy FTIF ITS
2. Litemset Phase (example)
Finding Sequential Patterns: The Algorithm - 2
Customer-Sequence Version of the Database
Large itemsets
minsup = 25%
Bab 6 - 13/25
Data Mining Arif Djunaidy FTIF ITS
3. Transformation Phase.
As we will see later (phase 4), we need to repeatedly determine
which of a given set of large sequences are contained in a
customer sequence.
To make this test fast, we transform each customer sequence into an
alternative representation.
In a transformed customer sequence, each transaction is replaced
by the set of all litemsets contained that transaction.
If a transaction does not contain any litemset, it is not retained in the
transformed sequence.
If a customer sequence does not contain any litemset,, this sequence is
dropped from the transformed database. However, it still contributes
to the count of tota1 number of customers.
Finding Sequential Patterns: The Algorithm - 3
Bab 6 - 14/25
Data Mining Arif Djunaidy FTIF ITS
3. Transformation Phase ..... (example):
Finding Sequential Patterns: The Algorithm - 4
minsup = 25%
Bab 6 - 15/25
Data Mining Arif Djunaidy FTIF ITS
4. Sequence Phase.
Use the set of litemsets obtained in phase-3 to find the desired
We will illustrate the use of an AprioriAll algorithm (see later)
5. Maximal Phase.
Find the maximal sequences among the set of large
In some algorithms (such as AprioriAll algorithm), this
phase is combined with the sequence phase to reduce
the time wasted in counting non-maximal sequences.

Finding Sequential Patterns: The Algorithm - 5
Bab 6 - 16/25
Data Mining Arif Djunaidy FTIF ITS
In the first pass, the output of the litemset phase is used to initialize the set of
large l-sequences. The candidates are stored in Hash-Tree to quickly find all
candidates contained in a customer sequence
In each pass, we use the large sequences obtained from the previous pass to
generate the candidate sequences and then measure their support, by making a
pass over the database
At the end of the pass, the support of the candidates is used to determine the
large sequences
The Algorithm: AprioriAll
denotes the set
of all large k-sequences,
and C
the set of
candidate k-sequences
Bab 6 - 17/25
Data Mining Arif Djunaidy FTIF ITS
The apriori-generate function takes as argument L
the set of all large (k-1)-sequences. The function works as
First, join L
with L

AprioriAll: Candidate Generation
Next, delete all sequences
c C
such that some
(k - 1)-subsequence of c is
not in L

Bab 6 - 18/25
Data Mining Arif Djunaidy FTIF ITS
Having found the set of all large sequences in S in the
sequence phase, the following algorithm can be used for
finding maximal sequences. Let the length of the longest
sequence be n. Then,
AprioriAll: Finding Maximal Sequences
Bab 6 - 19/25
Data Mining Arif Djunaidy FTIF ITS
Assume we have a
database with the
as shown below (in
the transformed
form). The minimum
support is assumed =
40% (i.e., 2 customer
AprioriAll: Example
No candidate is generated for the 5
The resulting maximal large sequence: { 1 2 3 4 }, { 1 3 5 } and { 4 5 }
Candidate 3-sequences:
Bab 6 - 20/25
Data Mining Arif Djunaidy FTIF ITS
Algorithmn: AprioriSome
Bab 6 - 21/25
Data Mining Arif Djunaidy FTIF ITS
AprioriSome: Forward Phase
In the forward pass, only sequences
of certain lengths are counted
For example, sequences of length 1, 2,
4 and 6 might be counted in the
forward phase and count sequences of
length 3 and 5 in the backward phase
The function next takes as
parameter the length of sequences
counted in the last pass and returns
the length of sequences to be
counted in the next pass
The apriori-generate function
is used to generate new candidate
However, in the kth pass, we may not
have the large sequence set L

available as we did not count the
(k-1)-candidate sequences. In that case,
we use the candidate set C
generate C

Correctness is maintained because
> L

Bab 6 - 22/25
Data Mining Arif Djunaidy FTIF ITS
AprioriSome: Forward Phase - Example
Using the database used in the example
for the AprioriAll algorithm, we find
the large l-sequences (L
) in the litemset
phase (during the first pass over the
Take for illustration simplicity, f(k) = 2k.
In the second pass, we count C
to get
After the third pass, apriori-
generate is called with L
argument to get C
. We do not count C
and hence do not generate L
Next, apriori-generate is called
with C
to get C
, which after pruning,
turns out to be the same C
(1 2 3 4)
After counting C
to get L
, we try
generating C
, which turns out to be
Bab 6 - 23/25
Data Mining Arif Djunaidy FTIF ITS
AprioriSome: Backward Phase
In the backward phase,
we count sequences for
the lengths we skipped
over during the forward
phase, after first deleting
all sequences contained in
some large sequence.
These smaller sequences
cannot be in the answer
because we are only
interested in maximal
We also delete the large
sequences found in the
forward phase that are non-
Bab 6 - 24/25
Data Mining Arif Djunaidy FTIF ITS
AprioriSome: Backward Phase - Example
When the backward phase is
started, nothing gets deleted from
since there are no longer
We had skipped counting the
support for sequences in C
in the
forward phase.
After deleting those sequences in C

that are subsequences of sequences
in L
, i.e., subsequences of (1 2 3 4),
we are left with the sequences
( 1 3 5 ) and ( 3 4 5 ).
Those would be counted to get
( 1 3 5 ) as a maximal large
Next,, all the sequences in L
(4 5) are deleted since they are
contained in some longer sequence.
For the same reason, all sequences
in L
are also deleted.
(1 2 3 4)
(1 3 5)
(4 5)
Bab 6
Data Mining Arif Djunaidy FTIF ITS
Bab 6

