Professional Documents
Culture Documents
Slides 06FPBasic
Slides 06FPBasic
— Chapter 6 —
1
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
■ Basic Concepts
Evaluation Methods
■ Summary
2
What Is Frequent Pattern Analysis?
■ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
■ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
■ Motivation: Finding inherent regularities in data
■ What products were often purchased together?— Beer and diapers?!
■ What are the subsequent purchases after buying a PC?
■ What kinds of DNA are sensitive to this new drug?
■ Can we automatically classify web documents?
■ Applications
■ Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
Why Is Freq. Pattern Mining Important?
■ Broad applications
4
Basic Concepts: Frequent Patterns
8
Computational Complexity of Frequent Itemset Mining
9
Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
■ Basic Concepts
Evaluation Methods
■ Summary
10
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test
Approach
Data Format
11
The Downward Closure Property and Scalable
Mining Methods
■ The downward closure property of frequent patterns
■ Any subset of a frequent itemset must be frequent
diaper}
■ i.e., every transaction having {beer, diaper, nuts} also
@SIGMOD’00)
■ Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
12
Apriori: A Candidate Generation & Test Approach
13
The Apriori Algorithm—An Example
Database Supmin = 2 Itemset sup
TDB L1
Itemset sup
C
{A} 2
{A} 2
Tid Items 1 {B} 3
10 A, C, D 1st {C} 3
{B} 3
20
30
B, C, E
A, B, C, E
scan {D}
{E}
1
3
{C}
{E}
3
3
40 B, E
C C
L 2
Itemset sup
2nd 2 Itemset
scan
{A, B} 1
Itemset sup {A, B}
2 {A, C} 2
{A, C}
{A, E}
2
1 {A, C}
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
C 3rd L {C, E}
scan
Itemset Itemset sup
3 {B, C, E} 3 {B, C, E} 2
14
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
15
Implementation of Apriori
■ Step 2: pruning
■ Example of Candidate-generation
■ L3={abc, abd, acd, ace, bcd}
■ Self-joining: L3*L3
■ abcd from abc and abd
■ acde from acd and ace
■ Pruning:
■ acde is removed because ade is not in L3
■ C4 = {abcd}
16
How to Count Supports of Candidates?
17
Counting Supports of Candidates Using Hash Tree
Subset Transaction: 1 2 3
function
1,4,7 3,9 6, 56
2,5,
8
1+235
6
13+5 23
6 4
14 5 6
1 73 53 4 35 36
12+35 5 6 6 7
6 12 35 36
4 12 15 7 8
45 5 9 68
7 45 9
8 18
Candidate Generation: An SQL Implementation
■ SQL Implementation of candidate generation
■ Suppose the items in Lk-1 are listed in an order
■ Step 1: self-joining Lk-1
insert into C k
select p.item1, p.item2, …, p.item k-1 , q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.item k-2 =q.itemk-2 , p.itemk-1 < q.
itemk-1
■ Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1 ) then delete c from Ck
■ Use object-relational extensions like UDFs, BLOBs, and Table functions for
efficient implementation [See: S. Sarawagi, S. Thomas, and R. Agrawal.
Integrating association rule mining with relational database systems:
Alternatives and implications. SIGMOD’98]
19
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test Approach
20
Further Improvement of the Apriori Method
21
Partition: Scan Database Only Twice
■ Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
■ Scan 1: partition database and find local frequent
patterns
■ Scan 2: consolidate global frequent patterns
24
DIC: Reduce Number of Scans
ABC
D ■ Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD ■ Once all length-2 subsets of BCD are
determined frequent, the counting of
AB AC B AD BD CD BCD begins
C Transactions
A B C D Aprior 1-
2-
itemsets
{} i …
itemsets
Itemset 1-
la ice 2-
itemsets
S. Brin R. Motwani, J. Ullman,
DIC items 3-
and S. Tsur. Dynamic itemset
counting and implication rules items
for market basket data.
SIGMOD’97 25
Scalable Frequent Itemset Mining Methods
■ Apriori: A Candidate Generation-and-Test Approach
26
Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
■ Bottlenecks of the Apriori approach
■ Breadth-first (i.e., level-wise) search
■ Candidate generation and test
■ Often generates a huge number of candidates
■ The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
■ Depth-first search
■ Avoid explicit candidate generation
■ Major philosophy: Grow long patterns from short ones using local
frequent items only
■ “abc” is a frequent pattern
■ Get all transactions having “abc”, i.e., project DB on abc: DB|abc
■ “d” is a local frequent item in DB|abc abcd is a frequent pattern
27
Construct FP-tree from a Transaction Database
TID
100
Items
{ f, bought
a, c, d, g, i,(ordered)
m, p } { f, frequent
c, a, m, p }
items
200
300 {
{ a,
b, b,
f, c,
h, f,
j, l,
o, m,
w } o } { f, b {
} f, c, a, b, m } min_support =
400 { b, c, k, s, p } { c, b, p } 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
Item frequency f: c:1
frequent 1-itemset
head 4
(single item pattern)
f 4 c: b:1 b:1
2. Sort frequent items in c 4 3
frequency descending a 3 a: p:1
order, f-list b 3 3
m 3 m: b:1
3. Scan DB again, construct
FP-tree
p 3 2
F-list = f-c-a-b-m- p: m:1
p 2 28
Partition Patterns and Databases
■ Patterns containing p
■ …
■ Pattern f
29
Find Patterns Having P From P-conditional Database
Header Table {}
Item frequency f: c:1 Conditional pa ern bases
head 4 item cond. pa ern base
f 4 c: b:1 b:1 c f:3
c 4 3 a fc:3
a 3 a: p:1 b fca:1, f:1, c:1
b 3 3
m 3 m: b:1 m fca:2, fcab:1
p 3 2 p fcam:2, cb:1
p: m:1
2 30