Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

ASSOCIATION RULE MINING

AND APRIORI ALGORITHM


By,
AMBOOJ
ANAM IQBAL
Guided By,
HINA FIRDAUS
M.Tech CSE 3rd Semester Dr. Harleen Kaur
WHAT IS FREQUENT PATTERN ANALYSIS?

 First proposed by Agrawal, Imielinski, and Swami frequent itemsets and


association rule mining
 Motivation: Finding inherent regularities in data.
 Pattern mining algorithms can be applied on various types of data such as :
 transaction databases
 sequence databases
 stream, graph etc.
 Pattern mining algorithms can be designed to discover various types of patterns:
 subgraphs,
 associations,
 indirect associations,
 trends,
 periodic patterns,
 sequential rules, lattices, sequential patterns, high-utility patterns, etc.
WHY IS FREQUENT PATTERN MINING IMPORTANT?

 Frequent pattern:
An intrinsic and important property of datasets.
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
 Classification: discriminative, frequent pattern analysis
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications
WHAT IS ASSOCIATION MINING?
 Proposed by Agrawal et al in 1993.
 Association rule mining
it is a procedure which is meant to find frequent patterns, correlations, associations, or causal
structures from data sets found in various kinds of databases such as relational databases,
transactional databases, and other forms of data repositories.
 It is an important data mining model studied extensively by the database and data
mining community.
APPLICATION OF ASSOCIATION

 Market Basket Analysis:


given a database of customer transactions, where each transaction is a set of items the goal is
to find groups of items which are frequently purchased together.
 Telecommunication
each customer is a transaction containing the set of phone calls
 Credit Cards/ Banking Services
each card/account is a transaction containing the set of customer’s payments
 Medical Treatments
each patient is represented as a transaction containing the ordered set of diseases
 Basketball-Game Analysis
each game is represented as a transaction containing the ordered set of ball passes
MARKET BASKET ANALYSIS
 INPUT: list of purchases by purchaser
 do not have names
 identify purchase patterns
 what items tend to be purchased together
 obvious: steak-potatoes; beer-pretzels

 what items are purchased sequentially


 obvious: house-furniture; car-tires

 what items tend to be purchased by season


CONTINUE…
 Categorize customer purchase behavior
 identify actionable information
 purchase profiles
 profitability of each purchase profile
 use for marketing
 layout or catalogs

 select products for promotion

 space allocation, product placement

 Market Basket Benefits


 selection of promotions, merchandising strategy
 sensitive to price: Italian entrees, pizza, pies, Oriental entrees, orange juice

 uncover consumer spending patterns


 correlations: orange juice & waffles

 joint promotional opportunities


POSSIBLE MARKET BASKETS
Customer 1: diapers, baby lotion, grapefruit juice, baby food, milk
Customer 2: soda, potato chips, milk
Customer 3: soup, beer, milk, ice cream
Customer 4: soda, coffee, milk, bread
Customer 5: beer, potato chips
CO-OCCURRENCE TABLE
LIMITATIONS
 takes over 18 months to implement
 market basket analysis only identifies hypotheses, which need to be tested
 neural network, regression, decision tree analyses
 measurement of impact needed
 difficult to identify product groupings
 complexity grows exponentially
BENEFITS

 simple computations
 can be undirected (don’t have to have hypotheses before
analysis)
 different data forms can be analyzed
ASSOCIATION RULES
 Wal-Mart customers who purchase Barbie dolls have a 60%
likelihood of also purchasing one of three types of candy bars
[Forbes, Sept 8, 1997]
 Customers who purchase maintenance agreements are very likely
to purchase large appliances (author experience)
 When a new hardware store opens, one of the most commonly sold
items is toilet bowl cleaners (author experience)
 So what…
WHAT IS ASSOCIATION RULE MINING?
 Association Analysis is used for discovering interesting relationships
hidden in large data sets.

 Proposed by Agrawal et al in 1993.

 It is an important data mining model studied extensively by the


database and data mining community.

 Assume all data are categorical.

 Initially used for Market Basket Analysis to find how items purchased
by customers are related.
CONTINUED…
 Finding frequent patterns, associations, correlations, or causal
structures among sets of items in transaction databases, relational
databases, and other information repositories.

 Association rules are if/then statements that help uncover


relationships between seemingly unrelated data in a relational
database or other information repository.

 Association rules are widely used in various areas such as


telecommunication networks, market and risk management,
inventory control etc.

 Programmers use association rules to build programs capable of


machine learning.
 The following rule can be extracted
from the data set shown in table 1:

 {Diapers}  {Beer}

 The rule suggests that a strong


relationship exists between the sale
of diapers and beer because many
customers who buy diapers also buy
beer.
 Retailers can use this type of rules to help them identify new
opportunities for cross-selling their products to the customers.

 Applications Basket data analysis, cross-marketing, catalog design,


loss-leader analysis, web log analysis, fraud detection
 An association rule has two parts, an antecedent (if) and a consequent (then).

 An antecedent is an item found in the data.

 A consequent is an item that is found in combination with the antecedent.

 Rule form: Antecedent → Consequence

 „Given:
 (1) database of transactions,
 (2) each transaction is a list of items purchased by a customer in a visit.

 Find: all rules that correlate the presence of one set of items ( itemset ) with that
of another set of items.

 E
„ .g., 98% of people who purchase tires and auto accessories also get
automotive services done
THE MODEL: RULES
 A transaction t contains X, a set of items (itemset) in I, if X  t.

 An association rule is an implication of the form:


X  Y, where X, Y  I, and X Y = 

 An itemset is a set of items.


 E.g., X = {milk, bread, cereal} is an itemset.

 A k-itemset is an itemset with k items.


 E.g., {milk, bread, cereal} is a 3-itemset
ASSOCIATION RULES
 Association rule types:
 Actionable Rules – contain high-quality, actionable information
 Trivial Rules – information already well-known by those familiar with
the business
 Inexplicable Rules – no explanation and do not suggest action
 Trivial and Inexplicable Rules occur most often
HOW GOOD IS AN ASSOCIATION RULE?

Customer Items Purchased


1 OJ, soda
POS Transactions
2 Milk, OJ, window cleaner
3 OJ, detergent
4 OJ, detergent, soda Co-occurrence of
5 Window cleaner, soda Products

OJ Window Milk Soda Detergent


cleaner
OJ 4 1 1 2 2
Window cleaner 1 2 1 1 0
Milk 1 1 1 0 0
Soda 2 1 0 3 1
Detergent 2 0 0 1 2
HOW GOOD IS AN ASSOCIATION RULE?
OJ Window Milk Soda Detergent
cleaner
OJ 4 1 1 2 2
Window cleaner 1 2 1 1 0
Milk 1 1 1 0 0
Soda 2 1 0 3 1
Detergent 2 0 0 1 2

Simple patterns:
1. OJ and soda are more likely purchased together than
any other two items

2. Detergent is never purchased with milk or window cleaner

3. Milk is never purchased with soda or detergent


HOW GOOD IS AN ASSOCIATION RULE?
Customer Items Purchased
1 OJ, soda
2 Milk, OJ, window cleaner POS Transactions
3 OJ, detergent
4 OJ, detergent, soda
5 Window cleaner, soda

 What is the confidence for this rule:


 If a customer purchases soda, then customer also purchases OJ
 2 out of 3 soda purchases also include OJ, so 67%
 What about the confidence of this rule reversed?
 2 out of 4 OJ purchases also include soda, so 50%
 Confidence = Ratio of the number of transactions with all the items to
the number of transactions with just the “if” items
CREATING ASSOCIATION RULES
1. Choosing the right set of
items
2. Generating rules by
deciphering the counts in
the co-occurrence matrix
3. Overcoming the practical
limits imposed by
thousands or tens of
thousands of unique items
ASSOCIATION RULES
Support
 “The support is the percentage of transactions that demonstrate the rule.”

 Example: Database with transactions (customer_# : item_a1, item_a2,.. )

1: 1, 3, 5.
2: 1, 8, 14, 17, 12.
3: 4, 6, 8, 12, 9, 104.
4: 2, 1, 8.
 support {8,12} = 2 (,or 50% ~ 2 of 4 customers)

 support {1, 5} = 1 (,or 25% ~ 1 of 4 customers )

 support {1} = 3 (,or 75% ~ 3 of 4 customers)

 An itemset is called frequent if its support is equal or greater than an


agreed upon minimal value – the support threshold
ASSOCIATION RULES
Confidence
 The confidence is the conditional probability that, given X present in a
transition , Y will also be present.

 An association rule is of the form: X => Y

 X => Y: if someone buys X, he also buys Y

 Confidence measure, by definition:

 Confidence(X=>Y) equals support(X,Y) / support(X)


EXAMPLE
Example: Database with transactions ( customer_# : item_a1, item_a2, …
)

1: 3, 5, 8.
2: 2, 6, 8.
Conf ( {5} => {8} ) ?
3: 1, 4, 7, 10. supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
4: 3, 8, 10. then conf( {5} => {8} ) = 4/5 = 0.8 or 80%
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
EXAMPLE

Example: Database with transactions ( customer_# : item_a1, item_a2, … )

1: 3, 5, 8.
2: 2, 6, 8. Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} => {5} ) ?
3: 1, 4, 7, 10. supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
4: 3, 8, 10. then conf( {8} => {5} ) = 4/7 = 0.57 or 57%
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
EXAMPLE
Example: Database with transactions ( customer_# : item_a1,
item_a2, … )

1: 3, 5, 8. Conf ( {9} => {3} ) ?


2: 2, 6, 8. supp({9}) = 1 , supp({3}) = 1 , supp({3,9}) = 1,
then conf( {9} => {3} ) = 1/1 = 1.0 or 100%. OK?
3: 1, 4, 7, 10.
4: 3, 8, 10.
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
EXAMPLE

Example: Database with transactions ( customer_# : item_a1,


item_a2, … )

Conf( {9} => {3} ) = 100%. Done.

Notice: High Confidence, Low Support.


-> Rule ( {9} => {3} ) not meaningful
WHAT IS ASSOCIATION RULE MINING ALGORITHM
 There are a large number of them!!
 They use different strategies and data structures.
 Their resulting sets of rules are all the same.
 Given a transaction data set T, and a minimum support and a minimum
confident, the set of association rules existing in T is uniquely determined.
 Some of the proposed algorithms are:
 AIS Algorithm
 SETM Algorithm
 Apriori Algorithm *
 AprioriHybrid Algorithm.
 AprioriTid Algorithm
 FP growth Algorithm
CONTINUE…
APRIORI ALGORITHM
 In computer science and data mining, Apriori is a classic algorithm for
learning association rules.
 Apriori is designed to operate on databases containing transactions (for
example, collections of items bought by customers, or details of a website
frequentation).
 The algorithm attempts to find subsets which are common to at least a
minimum number C (the cutoff, or confidence threshold) of the itemsets.
 Apriori uses a "bottom up" approach, where frequent subsets are extended
one item at a time (a step known as candidate generation, and groups of
candidates are tested against the data.
 The algorithm terminates when no further successful extensions are found.
 Apriori uses breadth-first search and a hash tree structure to count
candidate item sets efficiently.
APRIORI ALGORITHM PSEUDOCODE
Ck: Candidate itemsets of size k
Lk : frequent itemsets of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++)
Ck+1 = GenerateCandidates(Lk)
for each transaction t in database do
increment count of candidates in Ck+1 that are contained in t
endfor
Lk+1 = candidates in Ck+1 with support ≥min_sup
endfor
return k Lk;
FREQUENT ITEMSET PROPERTY
GENERATE CANDIDATES

• Assume the items in Lk are listed in an order (e.g., alphabetical)


• Step 1: self-joining Lk (IN SQL)
insert into Ck+1
select p.item1, p.item2, …, p.itemk, q.itemk
from Lk p, Lk q
where p.item1=q.item1, …, p.itemk-1=q.itemk-1, p.itemk < q.itemk
• Step 2: pruning
forall itemsets c in Ck+1 do
forall k-subsets s of c do
if (s is not in Lk) then delete c from Ck+1
STEPS TO PERFORM APRIORI ALGORITHM
FORMULAS TO NOTE
 Min_sup count = Minimum support percentage * total number of transaction in
database
 Minimum confidence percentage < Confidence percentage after association rule
APRIORI ALGORITHM EXAMPLES
If the minimum support is 50%, minimum confidence is 50% in database D.
Illustrate the Apriori algorithm for finding frequent itemsets in D

Database D
TID Items
100 134
200 235
300 1235
400 25
CONTINUE…
Database D itemset sup.
TID Items
C1 {1} 2 L1 itemset sup.
{1} 2
100 1 3 4 {2} 3
200 2 3 5 {2} 3
300 1 2 3 5
Scan D {3}
{4}
3
1
{3} 3
400 2 5 {5} 3
{5} 3

L2 itemset sup C2 itemset sup


{1 2} 1 C2 itemset
{1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 5} 3 {2 3}
{2 5}
2
3
Scan D {2 3}
{3 5} 2 {2 5}
{3 5} 2
{3 5}

C3 itemset
{2 3 5} Scan D L3 itemset sup
{2 3 5} 2
CONTINUE…
Association Support Confidence Confidence (%)
Rule {2,3,5}
2->3^5 2 2/3=0.66 66.6%

3->2^5 2 2/3=0.66 66.6%

5->2^3 2 2/3=0.66 66.6%

2^3->5 2 2/2=1.0 100%

2^5->3 2 2/3=0.66 66.6%

3^5->2 2 2/2=1.0 100%


CONTINUE…
With the confidence threshold set to 50%, the Strong Association Rules are
(sorted by confidence):
1. 2^3->5 (1.0)
2. 3^5->2 (1.0)
3. 2->3^5 (0.66)
4. 3->2^5 (0.66)
5. 5->2^3 (0.66)
6. 2^5->3 (0.66)
PRACTICE PROBLEM
Trace the results of using the Apriori algorithm on the grocery store example with support threshold
s=33.34% and confidence threshold c=60%. Show the candidate and frequent itemsets for each
database scan. Enumerate all the final frequent itemsets. Also indicate the association rules that
are generated and highlight the strong ones, sort them by confidence.
SOLUTION
 Support threshold =33.34% => threshold is at least 2 transactions.
 Applying Apriori
 Note that {HotDogs, Buns, Coke} and {HotDogs, Buns, Chips} are not candidates when k=3
because their subsets {Buns, Coke} and {Buns, Chips} are not frequent.
 Note also that normally, there is no need to go to k=4 since the longest transaction has only 3 items.
 All Frequent Itemsets: {HotDogs}, {Buns}, {Ketchup}, {Coke}, {Chips}, {HotDogs, Buns},
{HotDogs, Coke}, {HotDogs, Chips}, {Coke, Chips}, {HotDogs, Coke, Chips}.
SOLUTION
Association rules:

With the confidence threshold set to 60%, the Strong Association Rules are (sorted by confidence):
APRIORI ADVANTAGES/DISADVANTAGES
 Advantages
 Uses large itemset property
 Easily parallelized
 Easy to implement
 Disadvantages
 Assumes transaction database is memory resident.
 Requires many database scans.

You might also like