Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

MACHINE LEARNING

Mining Association Rules

Presented by: Dr. S. Nadeem Ahsan


(Slides adopted from the presentation http://staffwww.itn.liu.se/~aidvi/courses/06/dm/lectures/lec7.pdf, and
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations)
Welcome!!

Lecture 05
Learning Outcomes
• Concept of Basket Data, items, Item sets, and
Frequent Rules.
• Frequent Pattern Mining & Apriori Algorithm
Association Rules
• Mining Association Rules between Sets of Items
in Large Databases
(R. Agrawal, T. Imielinski & A. Swami) 1993.

• Fast Algorithms for Mining Association Rules


(R. Agrawal & R. Srikant) 1994.
Basket Data
Retail organizations, e.g., supermarkets, collect
and store massive amounts sales data, called ba
sket data.
A record consist of
– transaction date
– items bought
Or, basket data may consist of items bought by a
customer over a period.
Example Association Rule
90% of transactions that purchase bread and but
ter also purchase milk

Antecedent: bread and butter


Consequent: milk
Confidence factor: 90%
Example Queries
• Find all rules that have “Diet Coke” in the antecedent.
• Find all rules that have “bread” in the antecedent and “
egg” in the consequent.
• Find all the rules relating items located on shelves A a
nd B in the store.
Formal Model
• I = i1, i2, …, im: set of literals (items)
• D : database of transactions
• T  D : a transaction. T  I
– TID: unique identifier, associated with each T
• X: a subset of I
– T contains X if X  T.
Formal Model (Cont.)
• Association rule: X  Y
here X  I, Y  I and X Y = .
• Rule X  Y has a confidence c in D
if c% of transactions in D that contain X also cont
ain Y.
• Rule X  Y has a support s in D
if s% of transactions in D contain X  Y.
Problem
• Given a set of transactions,
• Generate all association rules
• that have the support and confidence greater tha
n the user-specified
minimum support (minsup) and
minimum confidence (minconf).
How Good is an Association Rule?
Customer Items Purchased
1 OJ, soda POS Transactions
2 Milk, OJ, window cleaner
3 OJ, detergent
4 OJ, detergent, soda Co-occurrence of
5 Window cleaner, soda Products

OJ Window Milk Soda Detergent


cleaner
OJ 4 1 1 2 2
Window cleaner 1 2 1 1 0
Milk 1 1 1 0 0
Soda 2 1 0 3 1
Detergent 2 0 0 1 2
How Good is an Association Rule?
OJ Window Milk Soda Detergent
cleaner
OJ 4 1 1 2 2
Window cleaner 1 2 1 1 0
Milk 1 1 1 0 0
Soda 2 1 0 3 1
Detergent 2 0 0 1 2

Simple patterns:
1. OJ and soda are more likely purchased together than
any other two items
2. Detergent is never purchased with milk or window cleaner
3. Milk is never purchased with soda or detergent
How Good is an Association Rule?
Customer Items Purchased
1 OJ, soda POS Transactions
2 Milk, OJ, window cleaner
3 OJ, detergent
4 OJ, detergent, soda
5 Window cleaner, soda

• What is the confidence for this rule:


– If a customer purchases soda, then customer also purchases OJ
– 2 out of 3 soda purchases also include OJ, so 67%
• What about the confidence of this rule reversed?
– 2 out of 4 OJ purchases also include soda, so 50%
• Confidence = Ratio of the number of transactions with all the items to the numbe
r of transactions with just the “if” items
Association Rule Mining
• Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {oil},
1 Bread, Milk
{Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, oil, Eggs {oil, Bread}  {Milk},
3 Milk, Diaper, oil, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not
5 Bread, Milk, Diaper, Coke causality!
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
TID Items
• Example: {Milk, Bread, Diaper}
– k-itemset 1 Bread, Milk
• An itemset that contains k items 2 Bread, Diaper, oil, Eggs
• Support count () 3 Milk, Diaper, oil, Coke
– Frequency of occurrence of an itemset 4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2 5 Bread, Milk, Diaper, Coke
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or eq
ual to a minsup threshold
Definition: Association Rule TID Items
 Association Rule 1 Bread, Milk
– An implication expression of the form X  Y, 2 Bread, Diaper, oil, Eggs
where X and Y are itemsets 3 Milk, Diaper, oil, Coke
– Example: 4 Bread, Milk, Diaper, Beer
{Milk, Diaper}  {oil} 5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics


Example:
– Support (s)
{Milk, Diaper}  Beer
 Fraction of transactions that contain both X and
Y
 (Milk , Diaper, Beer) 2
– Confidence (c) s   0.4
|T| 5
 Measures how often items in Y
appear in transactions that  (Milk, Diaper, Beer) 2
contain X c   0.67
 (Milk , Diaper) 3
Mining Association Rules
TID Items
Example of Rules:
1 Bread, Milk {Milk,Diaper}  {oil} (s=0.4, c=0.67)
2 Bread, Diaper, oil, Eggs {Milk,oil}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, oil, Coke
{Diaper,oil}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper}  {Milk,oil} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke
{Milk}  {Diaper,oil} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, oil}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules

• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent items
et, where each rule is a binary partitioning of a frequent it
emset

• Frequent itemset generation is still computationally expensive


Frequent Itemsets Generation
Frequent Itemsets Generation
Computational Complexity
Frequent Itemsets Generation Strategies
Apriori Algorithm
Power Set
• Given a set S, power set, P is the set of all subsets of S
• Known property of power sets
– If S has n number of elements, P will have N = 2 n number of ele
ments.
• Examples:
– For S = {}, P={{}}, N = 20 = 1
– For S = {Milk}, P={{}, {Milk}}, N=2 1=2
– For S = {Milk, Diaper}
– P={{},{Milk}, {Diaper}, {Milk, Diaper}}, N=2 2=4
– For S = {Milk, Diaper, Beer},
– P={{},{Milk}, {Diaper}, {Beer}, {Milk, Diaper}, {Diaper, Beer}, {Beer
, Milk}, {Milk, Diaper, Beer}}, N=23=8

24
Brute Force approach to Frequent Itemset Generation
• For an itemset with 3 elements, we h
ave 8 subsets TID Items
– Each subset is a candidate frequent it 1 Bread, Milk
emset which needs to be matched ag 2 Bread, Diaper, oil, Eggs
ainst each transaction 3 Milk, Diaper, oil, Coke
4 Bread, Milk, Diaper, Beer
1-itemsets 5 Bread, Milk, Diaper, Coke
Itemset Count
2-itemsets
{Milk} 4
Itemset Count
{Diaper} 4
{Milk, Diaper} 3
{oil} 3
{Diaper, oil} 3
3-itemsets {oil, Milk} 2

Itemset Count
{Milk, Diaper, oil} 2

Important Observation: Counts of subsets can’t be smaller than the count of an itemset!

25
Reducing Number of Candidates
• Apriori principle:
– If an itemset is frequent, then all of its subsets must also be frequent

• Apriori principle holds due to the following property of the support mea
sure:
X , Y : ( X  Y )  s ( X )  s (Y )
– Support of an itemset never exceeds the support of its subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Oil 3 {Bread,Milk} 3
Diaper 4 {Bread,Oil} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3
candidates involving Coke
{Milk,Beer} 2
{Milk,Diaper} 3 or Eggs)
{Oil,Diaper} 3

Triplets (3-itemsets)
Itemset Count
{Bread,Milk,Diaper} 3

Write all possible 3-itemsets


and prune the list based on infrequent 2-itemsets
Apriori Algorithm
• Method:

– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent itemsets
• Prune candidate itemsets containing subsets of length k that are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that are frequent

Note: This algorithm makes several passes over the transaction list
Apriori Algorithm (Mining Frequent Patterns)
Rule Generation
• Given a frequent itemset L, find all non-empty subsets f  L such
that f  L – f satisfies the minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

• If |L| = k, then there are 2k – 2 candidate association rules (ignori


ng L   and   L)
• Because, rules are generated from frequent itemsets, they auto
matically satisfy the minimum support threshold
– Rule generation should ensure production of rules that satisfy only the min
imum confidence threshold
Computing Confidence for Rules
Itemset Support Itemset Support
• Unlike computing support, computing
confidence does not require several p {Milk} 4/5=0.8 {Milk, Diaper} 3/5=0.6

asses over the transaction list {Diaper} 4/5=0.8 {Diaper, Bread} 3/5=0.6

• Supports computed from frequent ite {Bread} 4/5=0.8 {Bread, Milk} 3/5=0.6
mset generation can be reused
Itemset Support
• Tables on the side show support valu
{Bread, Milk, Diaper} 3/5=0.6
es for all (except null) the subsets of i
temset {Bread, Milk, Diaper}
Example of Rules:
• Confidence values are shown for the
(23-2) = 6 rules generated from this fr {Bread, Milk}  {Diaper} (s=0.6, c=1.0)
equent itemset {Milk, Diaper}  {Bread} (s=0.6, c=1.0)
{Diaper, Bread}  {Milk} (s=0.6, c=1.0)
{Bread}  {Milk, Diaper} (s=0.6, c=0.75)
{Diaper}  {Milk, Bread} (s=0.6, c=0.75)
{Milk}  {Diaper, Bread} (s=0.6, c=0.75)

31
Rule generation in Apriori Algorithm
• For each frequent k-itemset wh {Bread, Milk, Diaper}
ere k > 2 1-item rules
– Generate high confidence rules wi
th one item in the consequent
{Bread, Milk}  {Diaper}
– Using these rules, iteratively gene {Milk, Diaper}  {Bread}
rate high confidence rules with m {Diaper, Bread}  {Milk}
ore than one items in the consequ
ent 2-item rules
– if any rule has low confidence the {Bread}  {Milk, Diaper}
n all the other rules containing the {Diaper}  {Milk, Bread}
consequents can be pruned (not
generated) {Milk}  {Diaper, Bread}

32
Transactional Data For an Electronics Branch
(Example)
Generation of candidate itemsets and frequent
itemsets, where the minimum support count is 2.
Review Questions
1. What is support and confidence ?
2. How many number of rules can be generated from an item set containing
02 items,
3. How many number of item sets can be generated from 4 distinct items?
4. Why Appriori Algorithm is slow?
Thank you

You might also like