Chapter06 (Frequent Patterns)

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 47

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 6 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.
1
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

2
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
8/22/21 Data Mining: Concepts and Techniques 4
Why Is Freq. Pattern Mining Important?
 Freq. pattern: An intrinsic and important property of
datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis

 Sequential, structural (e.g., sub-graph) patterns

 Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data


 Classification: discriminative, frequent pattern analysis

 Cluster analysis: frequent pattern-based clustering

 Data warehousing: iceberg cube and cube-gradient

 Semantic data compression: fascicles

 Broad applications

5
Basic Concepts: Frequent Patterns

Tid Items bought  itemset: A set of one or more


10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper  k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs  (absolute) support, or, support
40 Nuts, Eggs, Milk count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk occurrence of an itemset X
Customer Customer
 (relative) support, s, is the
buys both buys diaper fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a minsup
Customer threshold
buys beer

6
Basic Concepts: Association Rules
Tid Items bought  Find all the rules X  Y with
10 Butter, Nuts, Diaper
minimum support and confidence
20 Butter, Coffee, Diaper
30 Butter, Diaper, Eggs
 support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk
 confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Butter:3, Nuts:3, Diaper:4,
Eggs:3, {Butter, Diaper}:3
Customer
buys beer  Association rules: (many more!)
 Butter  Diaper (60%, 100%)
 Diaper  Butter (60%, 75%)
7
Interesting association rules

 P(B|A) = P(AUB) / P(A)


8/22/21 Data Mining: Concepts and Techniques 8


Association rule mining
 Given a transaction database and minsup and
minconf thresholds, compute all association rules
that satisfy minsup and minconf requirements
 Steps
 Find all frequent itemsets

 Generate association rules from frequent

itemsets which satisfy minimum confidence

8/22/21 Data Mining: Concepts and Techniques 9


Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

10
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical

Data Format
11
The Downward Closure Property and Scalable
Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent

 If {beer, diaper, nuts} is frequent, so is {beer,

diaper}
 i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper}


 Scalable mining methods: Three major approaches
 Apriori

 Freq. pattern growth

 Vertical data format approach

12
Apriori: A Candidate Generation & Test Approach

 Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
 Apriori Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated

13
The Apriori Algorithm—An Example
to generate all frequent itemsets
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
C4 = { }. Algorithm
{B, C, E} {B, C, E} 2
terminates
14
What are the association rules for the previous
candidate itemset?
 Steps
 Find all non-empty subsets

 Generate rules and find the confidence for

each rule
 Select all rules that satisfy min.confidence

 These rules are the strong association rules


 Example: For the itemset (B,C,E}
 Subsets: {B,C}, (B,E}, {C,E}, {B}, {C}, {E}
 There will be six rules. For ex: {B,C}=>{E} etc

8/22/21 Data Mining: Concepts and Techniques 15


Cond…

Tid Items
 {B,C}=>{E}
 Conf = 2/2 = 100%
10 A, C, D
 Similarly, find conf for other
20 B, C, E rules
30 A, B, C, E  Select those rules which
40 B, E
satisfy minconf
 Suppose, minconf = 60%.
What are the strong rules
that you can select?

8/22/21 Data Mining: Concepts and Techniques 16


Implementation of Apriori
 How to generate candidates?
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 (First K-2 items should be common)
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 If any subset is infrequent, the itemset will also be infrequent
 acde is removed because ade is not in L3
 C4 = {abcd}

17
Example 6.3

MinSupport = 2

8/22/21 Data Mining: Concepts and Techniques 18


C4 = {} and algorithm terminates. L3 contains all frequent
itemsets
8/22/21 Data Mining: Concepts and Techniques 19
Calculation of candidate 3-itemsets

8/22/21 Data Mining: Concepts and Techniques 20


Rules for Table 6.1

8/22/21 Data Mining: Concepts and Techniques 21


The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 22
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data Format

 Mining Close Frequent Patterns and Maxpatterns

23
Exercise
 Find all frequent itemsets using Apriori algorithm
and generate all association rules (assume
minsup = 20%, minconf=50%)

8/22/21 Data Mining: Concepts and Techniques 24


Further Improvement of the Apriori Method

 Major computational challenges


 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

25
Partition: Scan Database Only Twice
 Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
 Scan 1: partition database and find local frequent

patterns
 Scan 2: consolidate global frequent patterns

DB1 + DB2 + + DBk = DB


sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
Sampling for Frequent Patterns

 Select a sample of original database, mine frequent


patterns within sample using Apriori
 Scan database again to find missed frequent patterns

27
Bottleneck of Frequent-pattern Mining

 Multiple database scans are costly


 Mining long patterns needs many passes of
scanning and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 # of Candidates: (1100) + (2100) + … + (110000) = 2100-1
= 1.27*1030 !
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?

8/22/21 Data Mining: Concepts and Techniques 28


Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
 Bottlenecks of the Apriori approach
 Huge Candidate generation and test
 The FPGrowth Approach
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only

29
Construct FP-tree from a Transaction Database

1. Scan DB once, find frequent 1-itemset (single item


pattern)
2. Sort frequent items in frequency descending order, f-
list
3. Scan DB again, construct FP-tree

F-list = f-c-a-b-m-p
30
Example 6.3

8/22/21 Data Mining: Concepts and Techniques 31


Cond…

8/22/21 Data Mining: Concepts and Techniques 32


8/22/21 Data Mining: Concepts and Techniques 33
8/22/21 Data Mining: Concepts and Techniques 34
8/22/21 Data Mining: Concepts and Techniques 35
8/22/21 Data Mining: Concepts and Techniques 36
Find Patterns Having P From P-conditional Database

 Starting at the frequent item header table in the FP-tree


 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

37
From Conditional Pattern-bases to Conditional FP-trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of the

pattern base

38
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent pattern
mining
 Compactness
 Reduce irrelevant info—infrequent items are gone
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database

39
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data Format

 Mining Close Frequent Patterns and Maxpatterns

40
CHARM: Mining by Exploring Vertical
Data Format
 Horizontal date format
 Transaction-id: Itemset format

 Vertical data format


 Item: set of Transaction-id format

 Explained in next slide

8/22/21 Data Mining: Concepts and Techniques 41


CHARM: Mining by Exploring Vertical Data Format
Cond…

8/22/21 Data Mining: Concepts and Techniques 43


Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

44
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift

P( A B) Basketball Not basketball Sum (row)


lift  Cereal 2000 1750 3750
P( A) P ( B )
Not cereal 1000 250 1250
2000 / 5000
lift ( B, C )   0.89 Sum(col.) 3000 2000 5000
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C )   1.33
3000 / 5000 *1250 / 5000
B-basketball
C-cereal
45
Are lift and 2 Good Measures of Correlation?

 “Buy walnuts  buy


milk [1%, 80%]” is
misleading if 85% of
customers buy milk
 Support and confidence
are not good to indicate
correlations
 Over 20 interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
 Which are good ones?

46
Summary

 Basic concepts: association rules, support-


confident framework
 Scalable frequent pattern mining methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth, CLOSET+, ...)
 Vertical format approach (ECLAT, CHARM, ...)
 Which patterns are interesting?
 Pattern evaluation methods

47

You might also like