Professional Documents
Culture Documents
AssociationRule and Apriori
AssociationRule and Apriori
Frequent pattern:
An intrinsic and important property of datasets.
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
WHAT IS ASSOCIATION MINING?
Proposed by Agrawal et al in 1993.
Association rule mining
it is a procedure which is meant to find frequent patterns, correlations, associations, or causal
structures from data sets found in various kinds of databases such as relational databases,
transactional databases, and other forms of data repositories.
It is an important data mining model studied extensively by the database and data
mining community.
APPLICATION OF ASSOCIATION
simple computations
can be undirected (don’t have to have hypotheses before
analysis)
different data forms can be analyzed
ASSOCIATION RULES
Wal-Mart customers who purchase Barbie dolls have a 60%
likelihood of also purchasing one of three types of candy bars
[Forbes, Sept 8, 1997]
Customers who purchase maintenance agreements are very likely
to purchase large appliances (author experience)
When a new hardware store opens, one of the most commonly sold
items is toilet bowl cleaners (author experience)
So what…
WHAT IS ASSOCIATION RULE MINING?
Association Analysis is used for discovering interesting relationships
hidden in large data sets.
Initially used for Market Basket Analysis to find how items purchased
by customers are related.
CONTINUED…
Finding frequent patterns, associations, correlations, or causal
structures among sets of items in transaction databases, relational
databases, and other information repositories.
{Diapers} {Beer}
„Given:
(1) database of transactions,
(2) each transaction is a list of items purchased by a customer in a visit.
Find: all rules that correlate the presence of one set of items ( itemset ) with that
of another set of items.
E
„ .g., 98% of people who purchase tires and auto accessories also get
automotive services done
THE MODEL: RULES
A transaction t contains X, a set of items (itemset) in I, if X t.
Simple patterns:
1. OJ and soda are more likely purchased together than
any other two items
1: 1, 3, 5.
2: 1, 8, 14, 17, 12.
3: 4, 6, 8, 12, 9, 104.
4: 2, 1, 8.
support {8,12} = 2 (,or 50% ~ 2 of 4 customers)
1: 3, 5, 8.
2: 2, 6, 8.
Conf ( {5} => {8} ) ?
3: 1, 4, 7, 10. supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
4: 3, 8, 10. then conf( {5} => {8} ) = 4/5 = 0.8 or 80%
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
EXAMPLE
1: 3, 5, 8.
2: 2, 6, 8. Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} => {5} ) ?
3: 1, 4, 7, 10. supp({5}) = 5 , supp({8}) = 7 , supp({5,8}) = 4,
4: 3, 8, 10. then conf( {8} => {5} ) = 4/7 = 0.57 or 57%
5: 2, 5, 8.
6: 1, 5, 6.
7: 4, 5, 6, 8.
8: 2, 3, 4.
9: 1, 5, 7, 8.
10: 3, 8, 9, 10.
EXAMPLE
Example: Database with transactions ( customer_# : item_a1,
item_a2, … )
L1 = {frequent items};
for (k = 1; Lk !=; k++)
Ck+1 = GenerateCandidates(Lk)
for each transaction t in database do
increment count of candidates in Ck+1 that are contained in t
endfor
Lk+1 = candidates in Ck+1 with support ≥min_sup
endfor
return k Lk;
FREQUENT ITEMSET PROPERTY
GENERATE CANDIDATES
Database D
TID Items
100 134
200 235
300 1235
400 25
CONTINUE…
Database D itemset sup.
TID Items
C1 {1} 2 L1 itemset sup.
{1} 2
100 1 3 4 {2} 3
200 2 3 5 {2} 3
300 1 2 3 5
Scan D {3}
{4}
3
1
{3} 3
400 2 5 {5} 3
{5} 3
C3 itemset
{2 3 5} Scan D L3 itemset sup
{2 3 5} 2
CONTINUE…
Association Support Confidence Confidence (%)
Rule {2,3,5}
2->3^5 2 2/3=0.66 66.6%
With the confidence threshold set to 60%, the Strong Association Rules are (sorted by confidence):
APRIORI ADVANTAGES/DISADVANTAGES
Advantages
Uses large itemset property
Easily parallelized
Easy to implement
Disadvantages
Assumes transaction database is memory resident.
Requires many database scans.