Professional Documents
Culture Documents
4 Mining .Association 1
4 Mining .Association 1
4 Mining .Association 1
Data Mining 1
Data Mining 2
1
Market Basket Analysis
Applications
* Maintenance Agreement (What the store should
do to boost Maintenance Agreement sales)
Home Electronics * (What other products should
the store stocks up?)
Data Mining 4
2
Rule Measures: Support and Confidence
Customer
buys both
Customer Find all the rules X & Y Z with
buys diaper
minimum confidence and support
support, s, probability that a
transaction contains {X & Y & Z}
confidence, c, conditional
Data Mining 5
Data Mining 6
3
Definition: Frequent Itemset
Itemset
A collection of one or more items TID Items
Example: {Milk, Bread, Diaper} 1 Bread, Milk
k-itemset 2 Bread, Diaper, Beer, Eggs
An itemset that contains k items 3 Milk, Diaper, Beer, Coke
Support count () 4 Bread, Milk, Diaper, Beer
Frequency of occurrence of an itemset 5 Bread, Milk, Diaper, Coke
E.g. ({Milk, Bread, Diaper}) = 2
Support
Fraction of transactions that contain
an itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold
TID Items
1 Bread, Milk
Association Rule
2 Bread, Diaper, Beer, Eggs
– An implication expression of the form
3 Milk, Diaper, Beer, Coke
X Y, where X and Y are itemsets
4 Bread, Milk, Diaper, Beer
– Example:
{Milk, Diaper} {Beer} 5 Bread, Milk, Diaper, Coke
Example:
Rule Evaluation Metrics
{Milk, Diaper} Beer
– Support (s)
Fraction of transactions that contain (Milk , Diaper, Beer) 2
both X and Y s 0.4
– Confidence (c)
|T| 5
Measures how often items in Y (Milk, Diaper, Beer) 2
appear in transactions that c 0.67
contain X
(Milk , Diaper ) 3
Data Mining 8
4
Association Rule Mining Task
Brute-force approach:
List all possible association rules
thresholds
Computationally prohibitive!
Data Mining 9
5
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
6
Frequent Itemset Generation
Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M = 2d !!!
Data Mining 13
7
Reducing Number of Candidates M=2d
Apriori principle:
If an itemset is frequent, then all of its subsets
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
supersets ABCDE
Data Mining 16
8
Illustrating Apriori TID
1
Items
Bread, Milk
Principle 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
Item Count Items (1-itemsets) 4 Bread, Milk, Diaper, Beer
Bread 4 5 Bread, Milk, Diaper, Coke
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2
or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
Data Mining 17
9
The Apriori Algorithm — Example
Minimum Support = 2
Data Mining 20
10
Example of Generating Candidates
Data Mining 21
Data Mining 22
11
Reducing Number of Comparisons
Candidate counting:
Scan the database of transactions to determine the
support of each candidate itemset
To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
Transactions Hash Structure
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
Data Mining 23
Hash function
234
3,6,9
1,4,7 567
145 345 356 367
2,5,8 136
357 368
124 689
457 125 159
458 24
12
Association Rule Discovery: Hash tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Data Mining 25
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Data Mining 26
13
Association Rule Discovery: Hash
tree
Hash Function Candidate Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Data Mining 27
Subset Operation
Transaction, t
Given a transaction t, what are the 1 2 3 5 6
possible subsets of size 3?
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
Data Mining 28
14
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
3+ 56 2,5,8
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining 29
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
3+ 56 2,5,8
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining 30
15
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1,4,7 3,6,9
2,5,8
1+ 2356
2+ 356
12+ 356 123
3+ 56 125
13+ 56 126
234 135
15+ 6 567 136
145 136 156
345 356 367 235
357 368 236
124 125 159 689 256
457 458 356
Match transaction against 11 out of 15 candidates
Data Mining 31
16
Methods to Improve Apriori’s Efficiency
Data Mining 33
Association Rules
non-empty
meaning: if transaction includes Itemset1
A => B,C
Data Mining 34
17
From Frequent Itemsets to Association Rules
35
support of itemset I J
conf (R) = sup(R) / sup(I) is the confidence of R
fraction of transactions with I J that have I
Data Mining 36
18
Association Rules Example
37
38
19
Find Strong Association Rules
Problem:
Find all association rules with given minsup and
minconf
First, find all frequent itemsets
Data Mining 39
Apriori algorithm.
For each frequent item set I
Data Mining 40
20
Weather Data: Play or not Play?
41
Data Mining
Data Mining 42
21
Rules for the weather data
Data Mining 43
Data Mining 44
22
Rule Generation
rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC BD, AD BC, BC AD,
BD AC, CD AB,
Rule Generation
23
Rule Generation for Apriori Algorithm
Lattice of rules
ABCD=>{ } }
ABCD=>{
Low
Confidence
Rule
BCD=>A
BCD=>A ACD=>B
ACD=>B ABD=>C
ABD=>C ABC=>D
ABC=>D
CD=>AB
CD=>AB BD=>AC
BD=>AC BC=>AD
BC=>AD AD=>BC
AD=>BC AC=>BD
AC=>BD AB=>CD
AB=>CD
D=>ABC
D=>ABC C=>ABD
C=>ABD B=>ACD
B=>ACD A=>BCD
A=>BCD
Pruned
Rules
Data Mining 47
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
D=>ABC
24