Chapter06 (Frequent Patterns)

Data Mining:
Concepts and Techniques

(3rd ed.)
— Chapter 6 —
Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.
1
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods
 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern
Evaluation Methods
 Summary
2
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
8/22/21 Data Mining: Concepts and Techniques 4
Why Is Freq. Pattern Mining Important?
 Freq. pattern: An intrinsic and important property of
datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data

 Classification: discriminative, frequent pattern analysis
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications
5
Basic Concepts: Frequent Patterns
Tid Items bought  itemset: A set of one or more

10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper  k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs  (absolute) support, or, support
40 Nuts, Eggs, Milk count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk occurrence of an itemset X
Customer Customer
 (relative) support, s, is the
buys both buys diaper fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a minsup
Customer threshold
buys beer
6
Basic Concepts: Association Rules
Tid Items bought  Find all the rules X  Y with
10 Butter, Nuts, Diaper
minimum support and confidence
20 Butter, Coffee, Diaper
30 Butter, Diaper, Eggs
 support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk
 confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Butter:3, Nuts:3, Diaper:4,
Eggs:3, {Butter, Diaper}:3
Customer
buys beer  Association rules: (many more!)
 Butter  Diaper (60%, 100%)
 Diaper  Butter (60%, 75%)
7
Interesting association rules
 P(B|A) = P(AUB) / P(A)



Association rule mining
 Given a transaction database and minsup and
minconf thresholds, compute all association rules
that satisfy minsup and minconf requirements
 Steps
 Find all frequent itemsets
 Generate association rules from frequent
itemsets which satisfy minimum confidence

 Basic Concepts
Evaluation Methods
 Summary
10
Scalable Frequent Itemset Mining Methods
 Apriori: A Candidate Generation-and-Test
Approach
 Improving the Efficiency of Apriori
 FPGrowth: A Frequent Pattern-Growth Approach
 ECLAT: Frequent Pattern Mining with Vertical
Data Format
11
The Downward Closure Property and Scalable
Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
 i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}

 Scalable mining methods: Three major approaches
 Apriori
 Freq. pattern growth
 Vertical data format approach
12
Apriori: A Candidate Generation & Test Approach
 Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!
 Apriori Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated
13
The Apriori Algorithm—An Example
to generate all frequent itemsets
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
C4 = { }. Algorithm
{B, C, E} {B, C, E} 2
terminates
14
What are the association rules for the previous
candidate itemset?
 Steps
 Find all non-empty subsets
 Generate rules and find the confidence for
each rule
 Select all rules that satisfy min.confidence
 These rules are the strong association rules

 Example: For the itemset (B,C,E}
 Subsets: {B,C}, (B,E}, {C,E}, {B}, {C}, {E}
 There will be six rules. For ex: {B,C}=>{E} etc

Cond…
Tid Items
 {B,C}=>{E}
 Conf = 2/2 = 100%
10 A, C, D
 Similarly, find conf for other
20 B, C, E rules
30 A, B, C, E  Select those rules which
40 B, E
satisfy minconf
 Suppose, minconf = 60%.
What are the strong rules
that you can select?

Implementation of Apriori
 How to generate candidates?
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 (First K-2 items should be common)
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 If any subset is infrequent, the itemset will also be infrequent
 acde is removed because ade is not in L3
 C4 = {abcd}
17
Example 6.3
MinSupport = 2

C4 = {} and algorithm terminates. L3 contains all frequent
itemsets
Calculation of candidate 3-itemsets

Rules for Table 6.1

The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 22
 Apriori: A Candidate Generation-and-Test Approach
 ECLAT: Frequent Pattern Mining with Vertical Data Format
 Mining Close Frequent Patterns and Maxpatterns
23
Exercise
 Find all frequent itemsets using Apriori algorithm
and generate all association rules (assume
minsup = 20%, minconf=50%)

Further Improvement of the Apriori Method
 Major computational challenges

 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
25
Partition: Scan Database Only Twice
 Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
 Scan 1: partition database and find local frequent
patterns
 Scan 2: consolidate global frequent patterns
DB1 + DB2 + + DBk = DB

sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
Sampling for Frequent Patterns
 Select a sample of original database, mine frequent

patterns within sample using Apriori
 Scan database again to find missed frequent patterns
27
Bottleneck of Frequent-pattern Mining
 Multiple database scans are costly

 Mining long patterns needs many passes of
scanning and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 # of Candidates: (1100) + (2100) + … + (110000) = 2100-1
= 1.27*1030 !
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?

Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
 Bottlenecks of the Apriori approach
 Huge Candidate generation and test
 The FPGrowth Approach
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only
29
Construct FP-tree from a Transaction Database
1. Scan DB once, find frequent 1-itemset (single item

pattern)
2. Sort frequent items in frequency descending order, f-
list
3. Scan DB again, construct FP-tree
F-list = f-c-a-b-m-p
30
Example 6.3

Cond…

Find Patterns Having P From P-conditional Database
 Starting at the frequent item header table in the FP-tree

 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
37
From Conditional Pattern-bases to Conditional FP-trees
 For each pattern-base

 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of the
pattern base
38
Benefits of the FP-tree Structure
 Completeness
 Preserve complete information for frequent pattern
mining
 Compactness
 Reduce irrelevant info—infrequent items are gone
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
39
 Apriori: A Candidate Generation-and-Test Approach
 ECLAT: Frequent Pattern Mining with Vertical Data Format
 Mining Close Frequent Patterns and Maxpatterns
40
CHARM: Mining by Exploring Vertical
Data Format
 Horizontal date format
 Transaction-id: Itemset format
 Vertical data format

 Item: set of Transaction-id format
 Explained in next slide

CHARM: Mining by Exploring Vertical Data Format
Cond…

 Basic Concepts
Evaluation Methods
 Summary
44
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift
P( A B) Basketball Not basketball Sum (row)

lift  Cereal 2000 1750 3750
P( A) P ( B )
Not cereal 1000 250 1250
2000 / 5000
lift ( B, C )   0.89 Sum(col.) 3000 2000 5000
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C )   1.33
3000 / 5000 *1250 / 5000
B-basketball
C-cereal
45
Are lift and 2 Good Measures of Correlation?
 “Buy walnuts  buy

milk [1%, 80%]” is
misleading if 85% of
customers buy milk
 Support and confidence
are not good to indicate
correlations
 Over 20 interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
 Which are good ones?
46
Summary
 Basic concepts: association rules, support-

confident framework
 Scalable frequent pattern mining methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth, CLOSET+, ...)
 Vertical format approach (ECLAT, CHARM, ...)
 Which patterns are interesting?
 Pattern evaluation methods
47

Chapter06 (Frequent Patterns)

Uploaded by

Copyright:

Available Formats

You might also like

Chapter06 (Frequent Patterns)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter06 (Frequent Patterns)

Uploaded by

Copyright:

Available Formats

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

 Sequential, structural (e.g., sub-graph) patterns

 Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data

 Cluster analysis: frequent pattern-based clustering

 Data warehousing: iceberg cube and cube-gradient

 Semantic data compression: fascicles

Tid Items bought  itemset: A set of one or more

 P(B|A) = P(AUB) / P(A)

8/22/21 Data Mining: Concepts and Techniques 8

 Generate association rules from frequent

itemsets which satisfy minimum confidence

8/22/21 Data Mining: Concepts and Techniques 9

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

 Apriori: A Candidate Generation-and-Test

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical

 If {beer, diaper, nuts} is frequent, so is {beer,

contains {beer, diaper}

 Freq. pattern growth

 Vertical data format approach

 Apriori pruning principle: If there is any itemset which is

 Generate rules and find the confidence for

 These rules are the strong association rules

8/22/21 Data Mining: Concepts and Techniques 15

8/22/21 Data Mining: Concepts and Techniques 16

8/22/21 Data Mining: Concepts and Techniques 18

8/22/21 Data Mining: Concepts and Techniques 20

8/22/21 Data Mining: Concepts and Techniques 21

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data Format

 Mining Close Frequent Patterns and Maxpatterns

8/22/21 Data Mining: Concepts and Techniques 24

 Major computational challenges

DB1 + DB2 + + DBk = DB

 Select a sample of original database, mine frequent

 Multiple database scans are costly

8/22/21 Data Mining: Concepts and Techniques 28

1. Scan DB once, find frequent 1-itemset (single item

8/22/21 Data Mining: Concepts and Techniques 31

8/22/21 Data Mining: Concepts and Techniques 32

 Starting at the frequent item header table in the FP-tree

 For each pattern-base

 Construct the FP-tree for the frequent items of the

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data Format

 Mining Close Frequent Patterns and Maxpatterns

 Vertical data format