Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

Mining Frequent Patterns, Association and

Correlations

(Data Mining Concepts and Techniques, by Jiawei


Han|Micheline Kamber|Jian Pei)
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
 Three types
 Frequent Itemset
 Frequent Subsequence
 Frequent Substructure
 Motivation: Finding inherent recurrences in data
 What products were often purchased together?— Milk and bread?
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, DNA sequence analysis etc.
Why Is Freq. Pattern Mining Important?

 Discloses an intrinsic and important property of data sets


 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
Example: Market Basket Analysis
 This process analyses customer buying habits by finding
association between the different items that customers place
in their shopping baskets.
 The discovery of these associations can help retailers develop
marketing strategies by gaining insight into frequently
purchased items.
 For example, one can place frequently bought items in close
proximity, or sale offers can be declared to encourage
combines purchase of the items
Association Rule Mining
 Association rule mining is a procedure which aims to observe
frequently occurring patterns, correlations, or associations
from datasets found in various kinds of databases such as
relational databases, transactional databases, and other forms
of repositories.
 Association rules are simple simple If/Then statements that
help discover relationships between seemingly independent
variables/facts/items etc.
 For example A B means if A happens then B happens, so A and
B are related in some way.
 A is called antecedent and B is called consequent.
Support and Confidence
 These are objective measures of pattern interestingness.
 Support: Support indicates how frequently the if/then
relationship appears in the database or represents the
percentage of transactions from a transaction database that a
given rule satisfies
 Confidence: Confidence tells about the number of times these
relationships have been found to be true. It assesses degree of
certainty of the detected association.
 For example, for an association rule like A B in a market basket
analysis, a support of 2% means that 2% of all the transactions
under analysis show that A and B were purchased together. A
confidence of 60% indicates that 60% of the customers who
purchased A also purchased B.
Minimum Support and Minimum Confidence
 In general
S

 Association rules can be treated interesting if they satisfy a


minimum support and minimum confidence.
 These are the threshold values of Support and Confidence and
are set by the user of domain expert.
 The rules which satisfy both minimum support and minimum
confidence are called strong rules. Others are called weak
rules.
Frequent Itemset
 A set of items X is referred to as itemset.
 An itemset that contains k items is called k-itemset. For
example, the set {bread,milk} is a 2-itemset.
 The occurrence frequency of an itemset is the number of
transactions that contain the itemset. This is called as
frequency/support count/count or absolute support.
 If the absolute support of X satisfies the corresponding
minimum support , then it is called a frequent itemset.
 The set of frequent k-itemset is commonly denoted as
Relation between relative and absolute Support
and Confidence

S=
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id Items bought  Itemset X = {x1, …, xk}
10 A, B, D  Find all the rules X  Y with minimum
20 A, C, D support and confidence
30 A, D, E  support, s, probability that a
40 B, E, F
transaction contains X  Y
50 B, C, D, E, F
 confidence, c, conditional
Customer Customer probability that a transaction
buys both buys bread
having X also contains Y

Customer
buys milk S
Basic Concepts: Frequent Patterns and
Association Rules
Transaction-id Items bought  Itemset X = {x1, …, xk}
10 A, B, D  Find all the rules X  Y with minimum
20 A, C, D support and confidence
30 A, D, E  support, s, probability that a
40 B, E, F
transaction contains X  Y
50 B, C, D, E, F
 confidence, c, conditional
Customer Customer probability that a transaction
buys both buys bread
having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer
A  D (60%, 100%)
buys milk
D  A (60%, 75%)
Closed Patterns and Max-Patterns
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists no super-
pattern Y ‫ כ‬X, with the same support as X
 An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y ‫ כ‬X (proposed by Bayardo @
SIGMOD’98)
Scalable Methods for Mining Frequent Patterns
 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)

 Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao

@SDM’02)
Apriori Algorithm
 Apriori is an algorithm proposed by R. Agrawal and R. Srikant in 1994
for mining frequent itemsets for Boolean association rules.
 The name of the algorithm is based on the fact that the algorithm
uses prior knowledge of frequent itemset properties.
 Apriori employs an iterative approach known as a level-wise/breadth
first search, where k-itemsets are used to explore (k+1)-itemsets.
 This is called generate-and-test strategy.
 First, the set of frequent 1-itemsets is found by scanning the
database to accumulate the count for each item, and collecting those
items that satisfy minimum support.The resulting set is denoted by
L1.
 Next, L1 is used to find L2, the set of frequent 2-itemsets, which is
used to find L3, and so on, until no more frequent k-itemsets can be
found.
 The finding of each Lk requires one full scan of the database.
Apriori Property

 Apriori Property : All nonempty subsets of a frequent itemset


must also be frequent.
 This property is based on antimonotonocity: if a set cannot pass
attest, all of its supersets will fail the same test as well
 If there is any itemset which is infrequent, its superset should not
be generated/tested!
 Example : If {bread, milk, nuts} is frequent, so is {bread, milk}
 i.e., every transaction having {bread, milk, nuts} also contains

{bread, milk}
Apriori Algorithm: Method
 Initially, scan DB once to get frequent 1-itemset
 Join Step: Generate length (k+1) candidate itemsets from
length k frequent itemsets by joining with itself.
 This is denoted as
 is a superset of
 Prune: To reduce the size of apriori property is used. Any k-
itemset that is not frequent cannot be a subset of (k+1)-
itemset.
 check the support value and delete all itemsets which
are below min_support. This will give
 Terminate when no frequent or candidate set can be
generated or repeat from Join step.
Flow Chart
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
Problem
 Find all the frequent
itemsets using apriori
algorithm
Generating Association Rules
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Challenges of Frequent Pattern Mining

 Challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
Improving the Efficiency of Apriori Algorithm

1. Hash Based Technique: it is used to reduce the size of the


candidate k-itemset, , for k>1
 2. Transaction Reduction:
 A transaction that does not contain any frequent k-itemsets

cannot contain any frequent (k+1)-itemsets.


 Such a transaction can be marked or removed from future

considerations
 3. Partitioning:
 It consists of two phases:

 Phase I: divide the transaction of D into n non-overlapping

partitions. The minimum support of a partition of D is


calculated as min_sup * no of transactions in the partition.
 For each partition, all local frequent itemsets are found.

 A local frequent itemset may or may not be frequent wrt D.

However, a frequent itemset of D must occur in atleast one


of the partitions. Therefore we can treat all local frequent
itemsets as candidate itemsets, and the collection of all
local frequent itemsets is called global candidate itemsets.
 Phase II: a second scan of D is conducted in which actual
support of each candidate is assessed
 4. Sampling:
 Pick a random sample S of given dataset D, then find

frequent itemsets in S rather than D.


 This technique trades acuuracy over efficiency, because we

are searching for frequent itemsets in S rather than D, it is


possicble we will miss some global frequent itemsets.
 Dynamic itemset counting:
 The dataset D is partitioned into blocks marked by start

points.
 Unlike in Apriori, which determines new candidate itemsets

only immediately before each complete dataset scan,


dynamic itemset counting adds new itemsets at start points.
 Thistechnique uses a count-so-far value as the lower bound

of min_sup. If count-so-far passes min_sup, the itemset is


immediately added into frequent itemset collection.
 This leads to fewer dataset scans.
Bottleneck of Frequent-pattern Mining

 Multiple database scans are costly


 Mining long patterns needs many passes of scanning and
generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 =
1.27*1030 !
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
FP Growth-Frequent Pattern Growth
 Apriori approach generates huge number of candidate keys
 Can we have a way to do it without generating such number of
keys
 FP-Growth
 It adopts divide and conquer strategy

 First it compresses the database representing frequent

items into a FP-Tree which retains the itemset association


information.
 It then divides the database into conditional databases (A

special kind of projected database) and mines each


database separately
Construct FP-tree from a Transaction Database
TID items Items bought (ordered) frequent
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3
order, f-list a:3 p:1
m 3
3. Scan DB again, construct p 3
m:2 b:1
FP-tree
F-list=f-c-a-b-m-p p:2 m:1
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent pattern mining

 Never break a long pattern of any transaction

 Compactness
 Reduce irrelevant info—infrequent items are gone

 Items in frequency descending order: the more frequently

occurring, the more likely to be shared


 Never be larger than the original database (not count node-

links and the count field)


 For Connect-4 DB, compression ratio could be over 100
Partition Patterns and Databases

 Frequent patterns can be partitioned into subsets according to


f-list
 F-list=f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but no a nor b, m, p

 Pattern f

 Completeness and non-redundency


Find Patterns Having P From P-conditional Database

 Starting at the frequent item header table in the FP-tree


 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
a fc:3
b 3 a:3 p:1
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p fcam:2, cb:1
p:2 m:1
Problem
 Find all the frequent
itemsets using FP-
Growth algorithm
Mining Frequent Itemsets Using the Vertical
Data Format
 Both the Apriori and FP-growth methods mine frequent
patterns from a set of transactions known as horizontal data
format.
 Alternatively, data can be presented in item-TID set format (i.e.,
{item : TID_set}), where item is an item name, and TID_set is
the set of transaction identifiers containing the item. This is
known as the vertical data format.
Steps in mining frequent patterns in Vertical
Data Format
 1. Transform the horizontally formatted data into vertical data
format. (One database scan needed)
 2. Calculate the support count of 1-itemset. It is nothing but
the length of the TID_set.
 3. Starting with k=1, the frequent (k+1)-itemset are constructed
from frequent k-itemset based on Apriori property. This is done
by intersection of TID_sets of frequent k-itemsets to compute
TID_sets of (k+1)-frequent itemsets.
 4. The process repeats for k=k+1, until no frequent itemsets or
candidate itemsets can be found.
Analysis
 1. No need to scan the database to compute support of (k+1)-
itemsets. It is simply the length of TID_set.
 However, if TID_sets become quite long, they will take
substantial memory space as well as computation of
intersection operation.
Strong Rules Are Not Necessarily Interesting
 Last example illustrates that the confidence of a rule AB can
be deceiving. It does not measure the real strength (or lack of
strength) of the correlation and implication between A and B.
Hence, alternatives to the support–confidence framework can
be useful in mining interesting data relationships.

 To tackle this weakness, a correlation measure can be used to


augment the support–confidence framework for association
rules. This leads to correlation rules of the form
AB [support, confidence, correlation].
From Association Analysis to Correlation
Analysis
 A correlation rule is measured not only by its support and
confidence but also by the correlation between itemsets A and
B.
 There are many different correlation measures from which to
choose.
 Lift is a simple correlation measure that is given as follows. The
occurrence of itemset A is independent of the occurrence of
itemset B if otherwise,itemsets A and B are dependent and
correlated as events.
 This definition can easily be extended to more than two
itemsets. The lift between the occurrence of A and B can be
measured by computing
lift

 If the resulting value of Eq. (6.8) is less than 1, then the occurrence of A is
negatively correlated with the occurrence of B, meaning that the
occurrence of one likely leads to the absence of the other one. If the
resulting value is greater than 1, then A and B are positively correlated,
meaning that the occurrence of one implies the occurrence of the other. If
the resulting value is equal to 1, then A and B are independent and there is
no correlation between them.

 The above equation is equivalent to P(B|A)/P(B), or conf(AB)/sup(B)


which is also referred to as the lift of the association (or correlation) rule
AB
Correlation analysis using lift
MaxMiner: Mining Max-patterns

 1st scan: find frequent items Tid Items


10 A,B,C,D,E
 A, B, C, D, E
20 B,C,D,E,
 2nd scan: find support for 30 A,C,D,F
 AB, AC, AD, AE, ABCDE
 BC, BD, BE, BCDE
Potential max-
 CD, CE, CDE, DE, patterns
 Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in
later scan
 R. Bayardo. Efficiently mining long patterns from databases. In
SIGMOD’98
Mining Frequent Closed Patterns: CLOSET

 Flist: list of all frequent items in support ascending order


 Flist: d-a-f-e-c Min_sup=2
 Divide search space TID Items
10 a, c, d, e, f
 Patterns having d 20 a, b, e
30 c, e, f
 Patterns having d but no a, etc. 40 a, c, d, f
50 c, e, f
 Find frequent closed pattern recursively
 Every transaction having d also has cfa  cfad is a frequent
closed pattern
 J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining
Frequent Closed Itemsets", DMKD'00.

You might also like