Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Chapter 3

Mining frequent patterns ,


Associations and Correlations
Contents
• What kind of patterns can be mined

– Class/Concept Description: Characterization and Discrimination, Mining


Frequent Patterns, Associations, and Correlations, Classification and
Regression for Predictive Analysis, Cluster Analysis, Outlier Analysis
• Mining frequent patterns - Market Basket Analysis.
• Frequent Itemsets, Closed Itemsets, and Association Rules
• Frequent Itemset Mining Methods
• Apriori Algorithm
• Generating Association Rules from Frequent Itemsets
• Improving efficiency of apriori algorithm
• Frequent pattern growth (FP-growth) algorithm

What kind of patterns can be mined?
• What is Data mining?
– Knowledge discovery from data
– Extraction of interesting patterns (non-trivial,
previously unknown, potentially useful) or
knowledge from huge data sets.
– Also known as KDD (Knowledge discovery process)
• Knowledge discovery process steps:
• Types of data that can be mined:
– Data mining can be applied to any kind of data as
long as the data is appropriate for a target
application
– Basic forms of data for mining:
• Database data
• Data warehouse data
• Transactional data
• Advanced data sets
What kinds of patterns can be mined?
• Data mining purpose to uncover unknown
useful patterns from huge data sets
• Data mining functionalities can be classified
into two categories:
– Descriptive tasks characterize properties the
data into target data. Eg : Count/ average etc
– Predictive perform induction on the current data
inorder to make predictions. Eg : Predict the
number of students joining B.Sc(CS) after 5 years
• Mining frequent patterns, Associations &
Correlations:
– Consider the following situation
• Imagine that you are a sales manager at AllElectronics,
and you are talking to a customer who recently
bought a PC and a digital camera from the store. What
should you recommend to her next? Information about
which products are frequently purchased by your
customers following their purchases of a PC and a
digital camera in sequence would be very helpful in
making your recommendation.
• Frequent patterns and association rules are the
knowledge that you want to mine in such a scenario.
• Eg : Amazon shopping site Recommendations for
other products, based on the product you select
• Frequent patterns are patterns (e.g., itemsets,
subsequences, or substructures) that appear
frequently in a data set.
– For example, a set of items, such as milk and bread, that
appear frequently together in a transaction data set is a
frequent itemset.
• A frequent subsequence a sequence that appears
frequently in a shopping history database
– Eg, Buying first a PC, then a digital camera, and then a
memory card, if it occurs frequently in a shopping history
database, is a (frequent) sequential pattern.
• A frequent substructure Refers to a different
structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or
subsequences. If a substructure occurs frequently, it is
called a (frequent) structured pattern.
• Finding frequent patterns plays an essential
role in mining associations, correlations, and
many other interesting relationships among
data.
• Moreover, it helps in data classification,
clustering, and other data mining tasks.
• Thus, frequent pattern mining has become an
important data mining task and a focused
theme in data mining research.
Basic Concepts
• Frequent pattern mining searches for recurring relationships in a given data
set. Motivation Market-basket anaysis
• Frequent itemset mining leads to the discovery of associations and
correlations among items in large transactional or relational data sets.
• With massive amounts of data continuously being collected and stored,
many industries are becoming interested in mining such patterns from their
databases.
• The discovery of interesting correlation relationships among huge amounts
of business transaction records can help in many business decision-making
processes such as catalog design, cross-marketing, and customer shopping
behavior analysis.
• A typical example of frequent itemset mining is market basket analysis.
• This process analyzes customer buying habits by finding associations
between the different items that customers place in their “shopping
baskets”.
• A supermarket has 200,000 customer
transactions. About 4,000 transactions, or about
2% of the total number of transactions, include
the purchase of diapers. About 5,500 transactions
(2.75%) include the purchase of beer. Of those,
about 3,500 transactions, 1.75%, include both the
purchase of diapers and beer. Based on the
percentages, that large number should be much
lower. However, the fact that about 87.5% of
diaper purchases include the purchase of beer
indicates a link between diapers and beer.
• History
• While the concepts behind association rules can be
traced back earlier, association rule mining was
defined in the 1990s, when computer scientists
Rakesh Agrawal, Tomasz Imieliński and Arun Swami
developed an algorithm-based way to find
relationships between items using point-of-sale (POS)
systems.
• Applying the algorithms to supermarkets, the
scientists were able to discover links between
different items purchased, called association rules,
and ultimately use that information to predict the
likelihood of different products being purchased
together.
• For retailers, association rule mining offered a way to
better understand customer purchase behaviors.
Because of its retail origins, association rule mining is
often referred to as market basket analysis.
Market Basket Analysis
-It analyzes customer-buying habits by finding
associations between the different item that customer
place in their shopping baskets
-It helps retailers in
-developing market strategies
-Advertising strategies
-Planning their shelf space
-Preparing store layouts-proximity
-Plan sales of non-moving items
-Plan discounts, offers etc

Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
• The discovery of these associations can help retailers develop marketing
strategies by gaining insight into which items are frequently purchased
together by customers.
– For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip
• Market-Basket Analysis:
– Suppose, as manager of an AllElectronics branch, you would like to
learn more about the buying habits of your customers. Specifically, you
wonder, “Which groups or sets of items are customers likely to
purchase on a given trip to the store?”
– To answer your question, market basket analysis may be performed on
the retail data of customer transactions at your store.

– You can then use the results to plan marketing or advertising strategies,
or in the design of a new catalog. For instance, market basket analysis
may help you design different store layouts.

– In one strategy, items that are frequently purchased together can be


placed in proximity to further encourage the combined sale of such
items.
– If customers who purchase computers also tend to buy
antivirus software at the same time, then placing the
hardware display close to the software display may help
increase the sales of both items.
– In an alternative strategy, placing hardware and software at
opposite ends of the store may entice customers who
purchase such items to pick up other items along the way.
– For instance, after deciding on an expensive computer, a
customer may observe security systems for sale while
heading toward the software display to purchase antivirus
software, and may decide to purchase a home security
system as well.
– Market basket analysis can also help retailers plan which
items to put on sale at reduced prices.
• Association Rules:
– Frequently purchased together items can be
represented as a pattern , and thus defined as an
Association Rule:
• Computer antivirus software [support =2%,confidence
60%].
– Rule Support & Confidence are two measures of rule
interestingness.
• They reflect the usefulness and certainty of discovered rules.
– A support of 2% 2% of all transactions under analysis
show that computer and antivirus are purchased
together.
– A confidence of 60% 60% of the customers who
purchased a computer also purchased an antivirus.
• Association rules are considered interesting if
they satisfy both a minimum support
threshold and a minimum confidence
threshold.
• These thresholds can be set by users or
domain experts.
• Additional analysis can be performed to
discover interesting statistical correlations
between associated items.
Frequent Itemsets, Closed Itemsets,
and Association Rules
• Concepts :
– Itemset I = {i1,i2,i3….im}
– D, the task relevant data Set of database
transactions , where each transaction T is a nonempty
item set such that T€ I . Each transaction is given a
transaction id Tid. Let A be a set of items. A
transaction T is said to contain A if A € T.
• Association Rule An association rule is an
implication of the form A B, where A is a subset of
I, B subset of I, A ≠φ ;, B ≠φ ; and A ∩B = φ .
• The rule A B holds in the transaction set D with
support s, where s is the percentage of
transactions in D that contain A U B (i.e., the
union of sets A and B say, or, both A and B).
• This is taken to be the probability, P(A U B)
– Support = (No. of transactions that contains A & B) /
Total number of transactions
• The rule A B has confidence c in the transaction
set D, where c is the percentage of transactions in
D containing A that also contain B.
• This is taken to be the conditional probability,
P(B|A)
• Confidence = No. of transactions containing A & B / Total
transactions containing A
• Rules that satisfy a minimum support threshold (min_sup)
& a minimum confidence (min_conf) are called strong.
• A set of items is referred to as an itemset.
• An itemset that contains k items is a k-itemset.
– The set {computer, antivirus software } is a 2-itemset.
• The occurrence frequency of an itemset is the number of
transactions that contain the itemset. This is also known,
simply, as the frequency, support count, or count of the
itemset.
• The itemset Support defined is sometimes referred to as
relative support.
• The occurrence frequency is called the absolute support. If
the relative support of an itemset I satisfies a prespecified
minimum support threshold then I is a frequent itemset,
denoted by Lk , where k denotes the k-itemset.
• Once the support counts for A , & A U B , B are
found, we can derive the strength of the
association rule A B.
• Thus the mining association rules actually
implies mining frequent itemsets.
• Association rule mining is a two-step process,
as follows:
– Find all frequent itemsets each of these itemsets
will occur atleast as frequently as a
predetermined min_sup.
– Generate strong association rules from frequent
itemsets these rules must satisfy minimum
support & minimum confidence.
Transaction-id Items bought ▪ Itemset X = {x1, …, xk}
10 A, B, D ▪ Association rule X Y
▪ support, s, probability that a
20 A, C, D
30 A, D, E
40 B, E, F
transaction contains X ∪ Y
50 B, C, D, E, F ▪ confidence, c, conditional
probability that a transaction
Customer Customer
having X also contains Y
buys both buys X Support(X=>Y)=P(XUY)= support count
Confidence(X=>Y)=P(Y/X)
=P(XUY)/P(X)
Association rules:
A D (80%, 100%)
D A (80%, 75%)
Customer
buys Y

Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
For each frequent itemset l,
generate all nonempty subsets of l
For every nonempty subset s of l,
output the association rule s=>l-s
if support_count(l)/support_count(s) ≥ min_conf
( min_conf=minimum confidence threshold)
Consider l={A,B.E}. min_sup=2
Its nonempty subsets are {A,B},{A,E}, Tid Items
10 A, B, E
{B,E}, {A}, {B},{E}
20 B, E
{A,B}=>E confidence=2/4= 50%
30 B, C
{A,E}=>B confidence=2/2=100%
40 A, B, D
{B,E}=>A confidence=2/3=66%
50 A, C
{A}=>{B,E} confidence=2/6=33%
60 B, C
{B}=>{A,E} confidence=2/7=28%
70 A, C
{E}=>{A,B} confidence=2/3=66%
80 A, B, C, E
If minimum confidence threshold is 70% 90 A, B, C
then {A,E}=> B is the only association rule
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
• Association mining involves two steps :
– Find all frequent item sets
– Generate strong association rules from frequent item sets
• Major challenge in mining for frequent itemsets is
– Generation of a huge number of frequent item sets, satisfying MS
– A large freq itemset all its subsets are also frequent
– Hence a large freq item set will have a large number of frequent small
item sets
– To reduce this, the concept of
• Closed frequent itemset ( a closed Itemset no proper superset of it with
same support count)
• Maximal frequent itemset.
• Another example, suppose 5000 transactions
have been made through a popular eCommerce
website. Now they want to calculate the support,
confidence, and lift for the two products, let’s say
pen and notebook for example out of 5000
transactions, 500 transactions for pen, 700
transactions for notebook, and 1000 transactions
for both.
• SUPPORT: It is been calculated with the number
of transactions divided by the total number of
transactions made,
• support(pen) = transactions related to pen/total
transactions
– Support -> 500/5000=10 percent
• CONFIDENCE: It is been calculated for whether
the product sales are popular on individual sales
or through combined sales. That is calculated with
combined transactions/individual transactions.
• Confidence = combine transactions/individual
transactions
– confidence-> 1000/500=20 percent
• LIFT: Lift is calculated for knowing the ratio for the
sales. = confidence % / support %
• Lift-> 20/10=2
• When the Lift value is below 1 means the
combination is not so frequently bought by
consumers.
• But in this case, it shows that the probability
of buying both the things together is high
when compared to the transaction for the
individual items sold.
• Closed and maximal frequent itemsets. Suppose that a
transaction database has only two transactions: [a1, a2, : : : ,
a100 ]; {a1, a2, : : : , a50}.
• Let the minimum support count threshold be min sup D 1.
• We find two closed frequent itemsets and their support
counts, that is, C ={a1, a2, : : : , a100 : 1}; {a1, a2, : : : , a50: 2}.
• There is only one maximal frequent itemset: M={a1, a2, : : : ,
a100: 1}.
• The set of closed frequent itemsets contains complete
information regarding the frequent itemsets.
Frequent itemset mining methods
• Many kinds of frequent pattern mining algorithms
like association rules, correlation analysis etc
• Classified in many ways based on following
criteria:
• Based on completeness of patterns to be mined
» Complete set of freq itemsets
» Closed freq itemsets
» Maximal freq itemsets
» Constrained freq itemsets
» Approximate freq itemsets
» Top k-freq itemsets
• Based on levels of abstraction
• Based on the number of data dimensions involved
in the rule
• Based on the types of values handled by the rules
– Boolean association rule(presence/absence of items)
– Quantitative association rule
• Based on the kinds of rules to be mined.
– Association rules
– Correlation rules
• Based on the kinds of patterns to be mined
– Frequent itemsets / sequential patterns / structured
patterns
Basic Algorithms
Apriori Algorithm – It is based on large itemset or Aproiori
property
Apriori property- all nonempty subsets of a frequent itemset must also be
frequent- Large itemsets are downward closed( if an itemset satisfies the
minimum support, so does its subsets)
If we know that an itemset is small , we need not consider supersets of it as
candidates because they also will be small
Apriori employs an iterative approach known as level-wise search, where
k-itemsets are used to explore k+1-itemsets
•Initially, scan DB once to get frequent 1-itemset
•Generate length (k+1) candidate itemsets from length k frequent itemsets
•Test the candidates against DB
•Terminate when no frequent or candidate set can be generated
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested
Method:
Lk denotes the set of frequent k-itemsets- Large itemset
Ck is the superset of Lk – Candidate for Large itemset
Tid Items Itemset sup
Itemset sup
10 A, B, E {A} 6 L1 {A} 6
20 B, E C1 {B} 7
{B} 7
30 B, C {C} 6
1st scan {C} 6
40 A, B, D {D} 1
{E} 3
50 A, C {E} 3
60 B, C
70 A, C
80 A, B, C, E Supmin = 2
90 A, B, C
Two-step process is followed consisting of join and prune actions
to generate Lk from Lk-1
Join Step- Apriori assumes that items within a transaction or
itemset are sorted in lexicographic order.
The Candidate set Ck is generated by taking the join Lk-1xLk-1, where
members of Lk-1 are joinable if their first k-2 items are in common.
This ensures that no duplicates are generated
Prune step- To reduce the size of Ck, Apriori property is used as
follows
Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset. Hence if any (k-1)-subset of a candidate
k-itemset is not in Lk-1, the candidate cannot be frequent and can
be removed from Ck
The count of each candidate in Ck is used to determine Lk
(minimum support count)
Supmin = 2
L1 C2
L2
Itemset sup Itemset sup Itemset sup
{A} 6 L1xL1 {A, B} 4 {A, B} 4
{B} 7 {A, C} 4 2nd scan {A, C} 4
{C} 6 {A, E} 2 {A, E} 2
{E} 3 {B, C} 4 {B, C} 4
{B, E} 3 {B, E} 3
{C, E} 1
L2 Supmin = 2
C3= {{A,B,C},{A,B,E},{A,C,E},{B,C,E}}
Itemset
The 2 item subsets of {A,B,C} are {A,B,},{B,C}, {A,C}
{A, B}
which are all in L2
{A, C}
{A, E} L2xL2 The 2 item subsets of {A,B,E} are {A,B,},{B,E}, {A,E}
{B, C} which are all in L2
{B, E} The 2 item subsets of {A,C,E} are {A,C,},{C,E} and {A,E}
. {C,E} is not in L2
Remove {A,C,E}
The 2 item subsets of {B,C,E} are {B,C,}, {C,E} and {B,E}
. {C,E} is not in L2
Itemset
{A, B, C} Remove {B,C,E}
{A, B, E}
{A, C, E} Itemset sup
3rd scan L3
{B, C, E} {A, B, C} 2
C3 {A, B, E} 2
Supmin = 2
L3
Itemset L3xL3 C4= {{A,B,C,E}}
{A, B, C} The 3 item subsets of {A,B,C,E} are {A,B,C}, {B,C,E},
{A, B, E} {A,C,E} and {A,B,E} ,
{B, C, E} and {A, C, E} are not in L3
Remove {A,B, C,E}
Thus C4 is empty and algorithm terminates having
found all the frequent itemsets
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
Algorithm Apriori
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = Apriori_generate(Lk)
// candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Algorithm Apriori_generate(Lk)
For each itemset l1 in Lk
For each itemset l2 in Lk
If k-1 elemts in l1 and l2 are equal
//If l1[1]=l2[1] and l1[2]=l2[2]and….l1[k-1]=l2[k-1] and
//l1[k]<l2[k]
C=l1xl2
add C to Ck+1
for each k subset s of c
if s does not belong to Lk then
delete c break
The Apriori algorithm assumes that the dataset is memory
resident. The max number of DB scans is one more than the
cardinality of largest itemset.
Large number of data scans is a weakness of apriori
• The apriori procedure performs two kinds of
actions, namely, join and prune, as described
before.
• In the join component, Lk-1 is joined with Lk-1
to generate potential candidates
• The prune component employs the Apriori
property to remove candidates that have a
subset that is not frequent
• Generating association rules from frequent
item sets:
– Once the frequent itemsets from transactions in a
database D have been found, it is straightforward
to generate strong association rules from them
(where strong association rules satisfy both
minimum support and minimum confidence)
– This can be done using confidence
• Conf(A B) = Support_count(A U B) / support_count(A)
• Where support_count(A U B) is the number of
transactions containing the itemsets A U B, and
support_count(A) is the number of transactions
containing A.
• Thus association rules can be generated as follows:

For each frequent itemset l,


generate all nonempty subsets of l
For every nonempty subset s of l,
output the association rule s=>l-s , if
support_count(l)/support_count(s) ≥ min_conf
( min_conf=minimum confidence threshold)

• Because the rules are generated from frequent itemsets,


each one automatically satisfies the minimum support.
Frequent itemsets can be stored ahead of time in hash
tables along with their counts so that they can be accessed
quickly.
• Improving efficiency of Apriori :
I1 6
• TID List of item IDs I2 7
• T100 I1, I2, I5 I3 6
I4 2
• T200 I2, I4 I5 2
• T300 I2, I3
• T400 I1, I2, I4 C2
I1 ,i2 4
• T500 I1, I3 I1,i3 4
• T600 I2, I3 I1,i4, 1
i1,i5 2
• T700 I1, I3 I2,i3 4
• T800 I1, I2, I3, I5 I2,i4, 2
i2,i5 2
• T900 I1, I2, I3 I3,i4 0
I3,i5 1
I4,i5 0
• I1,i2 i1,i3 i1,i5 i2,i3 i2, i4 i2, i5

• I1 i2 i3 C3 2
• I1 i2 i5 C3 2
• I1 i2 i4
• I1 i3 i5
• I2 i3 i5
• I2 i3 i4
• C4 i1 i2i3 i5 1
Frequent Pattern-Tree
• Multiple database scans are costly
• Mining long patterns needs many passes of scanning and generates lots
of candidates
Solution-finding frequent itemsets without candidate generation
Frequent pattern growth-The transaction database is represented as an
FP-tree(frequent pattern tree)and then FP-tree is mined to get frequent
itemsets
Create root of the tree as null
The set of frequent itemsets are sorted in the order of descending support
count denoted by L
L= {(B,7),(A,6),(C,6),(E,2), (D.1)}
The items in each transaction are processed in L order and a branch is
created for each transaction
Increment the node count if it is shared with other transactions
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Frequent Pattern-Tree
null Ti Items
Itemset su d
p
10 A, B, E
{B} 7
20 B, E
{A} 6
{C} 6
B:7 A:2 30 B, C
{E} 3 40 A, B, D
{D} 1 50 A, C
60 B, C
A:4 C:2 E:1 C:2 70 A, C
80 A, B, C, E
90 A, B, C
E:1 C:2 D:1

E:1
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Support-Count
Node-link Frequent Pattern-Tree
B 7
A 6 null To facilitate tree
traversal, an item
C 6 header table is built
E 3 so that each item
B:7 A:2
D 1 points to its
occurences in the
tree via a chain of
node-links
A:4 C:2 E:1 C:2

E:1 C:2 D:1

E:1
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Frequent Pattern-Tree Mining
• Scan DB once, find frequent 1-itemsets and their support counts
• Sort frequent items in descending order of support counts to get a
list of frequent items L
• Create a root of an FP-tree as null
• Scan DB again, For each transaction in DB sort items in Order of L.
Let the sorted list be [p|P ] where p is the first element and P is the
remaining list. If p is already present in the tree increment its node
count else create a new node and link it to node links with same
value. If P is nonempty insert P recursively

Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Support-Count
Node-link

B 7
A 6 null
Create root as null
C 6
First Transaction in L order
E 3
B:1 B A E
D 1

A:1

E:1

Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Support-Count
Itemset Node-link

B 7
A 6 null
Create root as null
C 6
First Transaction in L order
E 3
B:2 B A E
D 1
Second transaction in L order
B E
A:1 E:1

E:1

Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Frequent Pattern-Tree Mining
• Starting at the frequent item header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item p
• Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

Conditional pattern base


B 7 nu
A 6 ll E –{B,A:1} {B, A, C:1} {B:1}
C 6 C- {B, A:2} {B:2} {A:2}
E 3 B: A:
D 1 7 2 A – { B:4}

A: C: E: C:
4 2 1 2
E: C: D:
1 2 1
E:
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Frequent Pattern-Tree Mining nul
Conditional pattern base l
E –{B,A:1} {B, A, C:1}{B:1} B:3

A:2 nul
C- {B, A:2} {B:2} {A:2} l
B:4 A:2
A:2
A – { B:4} null

B:4
Conditional FP tree Frequent Patterns
E –{B:3 ,A:2} {E, B:3 },{E, A:2}, {E,A,B:2}
C- {B:4 A:2} {A:2} {C, B:4} {C, A:4} {C,A,B:2}
A – { B:4} { A, B:4}
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007

You might also like