4 Mining .Association 1

Mining Association Rules
Data Mining 1
What Is Association Mining?
 Association rule mining:

 Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
 Applications:
 Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
 Examples.
 Rule form: “Body ead [support, confidence]”.
 buys(x, “diapers”)  buys(x, “7-up”) [0.5%, 60%]
 major(x, “CS”) ^ takes(x, “DB”) grade(x, “A”) [1%, 75%]
Data Mining 2
1
Market Basket Analysis
Typically, association rules are considered interesting if they satisfy

both a minimum support threshold and a minimum confidence
threshold. Data Mining 3
Association Rule: Basic Concepts
 Given: (1) database of transactions, (2) each transaction is

a list of items (purchased by a customer in a visit)
 Find: all rules that correlate the presence of one set of

items with that of another set of items
 E.g., 98% of people who purchase tires and auto
accessories also get automotive services done
 Applications
 *  Maintenance Agreement (What the store should
do to boost Maintenance Agreement sales)
 Home Electronics  * (What other products should
the store stocks up?)
Data Mining 4
2
Rule Measures: Support and Confidence
Customer
buys both
Customer  Find all the rules X & Y  Z with
buys diaper
minimum confidence and support
 support, s, probability that a
transaction contains {X & Y & Z}
 confidence, c, conditional
Customer probability that a transaction

buys beer having {X & Y} also contains Z
Transaction ID Items Bought Let minimum support 50%, and

2000 A,B,C minimum confidence 50%,
1000 A,C we have
4000 A,D  A  C (50%, 66.6%)
5000 B,E,F  C  A (50%, 100%)
Data Mining 5
Association Rule Mining
 Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Data Mining 6
3
Definition: Frequent Itemset
 Itemset
 A collection of one or more items TID Items
 Example: {Milk, Bread, Diaper} 1 Bread, Milk
 k-itemset 2 Bread, Diaper, Beer, Eggs
 An itemset that contains k items 3 Milk, Diaper, Beer, Coke
 Support count () 4 Bread, Milk, Diaper, Beer
 Frequency of occurrence of an itemset 5 Bread, Milk, Diaper, Coke
 E.g. ({Milk, Bread, Diaper}) = 2
 Support
 Fraction of transactions that contain
an itemset
 E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
 An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
TID Items
1 Bread, Milk
 Association Rule
2 Bread, Diaper, Beer, Eggs
– An implication expression of the form
X  Y, where X and Y are itemsets
4 Bread, Milk, Diaper, Beer
– Example:
{Milk, Diaper}  {Beer} 5 Bread, Milk, Diaper, Coke
Example:
 Rule Evaluation Metrics
{Milk, Diaper}  Beer
– Support (s)
 Fraction of transactions that contain  (Milk , Diaper, Beer) 2
both X and Y s   0.4
– Confidence (c)
|T| 5
 Measures how often items in Y  (Milk, Diaper, Beer) 2
appear in transactions that c   0.67
contain X
 (Milk , Diaper ) 3
Data Mining 8
4
Association Rule Mining Task
 Given a set of transactions T, the goal of

association rule mining is to find all rules having
 support ≥ minsup threshold
 confidence ≥ minconf threshold
 Brute-force approach:
 List all possible association rules
 Compute the support and confidence for each rule
 Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
Data Mining 9
TID Items Example of Rules:

1 Bread, Milk
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support
but can have different confidence
• Thus, we may decouple the support and confidence requirements
Data Mining 10
5
 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary
partitioning of a frequent itemset
 Frequent itemset generation is still

computationally expensive
Data Mining 11
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Given d items, there

ABCD ABCE ABDE ACDE BCDE
are 2d possible
candidate itemsets
ABCDE
Data Mining 12
6
Frequent Itemset Generation
 Brute-force approach:
 Each itemset in the lattice is a candidate frequent itemset
 Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
N 3 Milk, Diaper, Beer, Coke M
5 Bread, Milk, Diaper, Coke
w
 Match each transaction against every candidate
 Complexity ~ O(NMw) => Expensive since M = 2d !!!
Data Mining 13
Frequent Itemset Generation Strategies
 Reduce the number of candidates (M)

 Complete search: M=2
d
 Use pruning techniques to reduce M
 Reduce the number of transactions (N)

 Reduce size of N as the size of itemset increases
 Reduce the number of comparisons (NM)

 Use efficient data structures to store the candidates or
transactions
 No need to match every candidate against every
transaction
Data Mining 14
7
Reducing Number of Candidates M=2d
 Apriori principle:
 If an itemset is frequent, then all of its subsets
must also be frequent
 Apriori principle holds due to the following

property of the support measure:
X ,Y : ( X  Y )  s( X )  s(Y )
 Support of an itemset never exceeds the
support of its subsets
 This is known as the anti-monotone property of
support
Data Mining 15
Illustrating Apriori Principle

null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
supersets ABCDE
Data Mining 16
8
Illustrating Apriori TID
1
Items
Bread, Milk
Principle 2 Bread, Diaper, Beer, Eggs
Item Count Items (1-itemsets) 4 Bread, Milk, Diaper, Beer
Bread 4 5 Bread, Milk, Diaper, Coke
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2
or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count

6C + 6C + 6C = 41 {Bread,Milk,Diaper} 2
1 2 3
With support-based pruning,
6 + 6 + 1 = 13 Itemset Count
{Bread,Milk} 3
{Bread,Diaper} 3
{Milk,Diaper} 3
{Beer,Diaper} 3
Data Mining 17
The Apriori Algorithm
 Join Step: Ck is generated by joining Lk-1with itself

 Prune Step: Any (k-1)-itemset that is not frequent cannot be
a subset of a frequent k-itemset
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Data Mining 18
9
The Apriori Algorithm — Example
Minimum Support = 2
Database D itemset sup.

L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Data Mining 19
How to Generate Candidates?
 Suppose the items in Lk-1 are listed in an order

 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Data Mining 20
10
Example of Generating Candidates
 L3={abc, abd, acd, ace, bcd}

 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
Data Mining 21
How to Count Supports of

Candidates?
 Why counting supports of candidates a problem?
 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets
and counts
 Interior node contains a hash table
 Subset function: finds all the candidates contained in
a transaction
Data Mining 22
11
Reducing Number of Comparisons
 Candidate counting:
 Scan the database of transactions to determine the
support of each candidate itemset
 To reduce the number of comparisons, store the
candidates in a hash structure
 Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
Transactions Hash Structure
TID Items
1 Bread, Milk
N 3 Milk, Diaper, Beer, Coke k
5 Bread, Milk, Diaper, Coke
Buckets
Data Mining 23
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:

{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7},
{3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)
Hash function
234
3,6,9
1,4,7 567
145 345 356 367
2,5,8 136
357 368
124 689
457 125 159
458 24
12
Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Data Mining 25
Association Rule Discovery: Hash

tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Data Mining 26
13
Association Rule Discovery: Hash
tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Data Mining 27
Subset Operation
Transaction, t
Given a transaction t, what are the 1 2 3 5 6
possible subsets of size 3?
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
Level 3 Subsets of 3 items
Data Mining 28
14
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
3+ 56 2,5,8
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining 29
Hash Function
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
3+ 56 2,5,8
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining 30
15
Hash Function
1,4,7 3,6,9
2,5,8
1+ 2356
2+ 356
12+ 356 123
3+ 56 125
13+ 56 126
234 135
15+ 6 567 136
145 136 156
345 356 367 235
357 368 236
124 125 159 689 256
457 458 356
Match transaction against 11 out of 15 candidates
Data Mining 31
Factors Affecting Complexity
 Choice of minimum support threshold

 lowering support threshold results in more frequent itemsets
 this may increase number of candidates and max length of
frequent itemsets
 Dimensionality (number of items) of the data set
 more space is needed to store support count of each item
 if number of frequent items also increases, both computation and
I/O costs may also increase
 Size of database
 since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
 Average transaction width
 transaction width increases with denser data sets
 This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)
Data Mining 32
16
Methods to Improve Apriori’s Efficiency
 Hash-based itemset counting: A k-itemset whose corresponding

hashing bucket count is below the threshold cannot be frequent
 Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans
 Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
 Dynamic itemset counting: add new candidate itemsets only when
all of their subsets are estimated to be frequent
Data Mining 33
Association Rules
 Association rule R : Itemset1 => Itemset2

 Itemset1, 2 are disjoint and Itemset2 is
non-empty
 meaning: if transaction includes Itemset1
then it also has Itemset2

 Examples
 A,B => E,C
 A => B,C
Data Mining 34
17
From Frequent Itemsets to Association Rules
 Q: Given frequent set {A,B,E}, what are

possible association rules?
 A => B, E
 A, B => E
 A, E => B
 B => A, E
 B, E => A
 E => A, B
 __ => A,B,E (empty rule), or true => A,B,E
35
Rule Support and Confidence
 Suppose R : I => J is an association rule

 sup (R) = sup (I  J) is the support count
 support of itemset I  J
 conf (R) = sup(R) / sup(I) is the confidence of R
fraction of transactions with I  J that have I

 Association rules with minimum support and count are

sometimes called “strong” rules
Data Mining 36
18
Association Rules Example
TID List of items
 Q: Given frequent set {A,B,E}, 1

2
A, B, E
B, D
what association rules have 3 B, C
minsup = 2 and minconf= 50% 4 A, B, D
? 5 A, C
6 B, C
A, B => E : conf=2/4 = 50% 7 A, C
8 A, B, C, E
9 A, B, C
37
Association Rules Example
TID List of items

 Q: Given frequent set {A,B,E}, what 1 A, B, E
association rules have minsup = 2 and 2 B, D
minconf= 50% ? 3 B, C
A, B => E : conf=2/4 = 50% 4 A, B, D
A, E => B : conf=2/2 = 100% 5 A, C
6 B, C
B, E => A : conf=2/2 = 100% 7 A, C
E => A, B : conf=2/2 = 100% 8 A, B, C, E
Don’t qualify 9 A, B, C
A =>B, E : conf=2/6 =33%< 50%

B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
38
19
Find Strong Association Rules
 A rule has the parameters minsup and minconf:

 sup(R) >= minsup and conf (R) >= minconf
 Problem:
 Find all association rules with given minsup and
minconf
 First, find all frequent itemsets
Data Mining 39
Generating Association Rules
 Two stage process:

 Determine frequent itemsets e.g. with the
Apriori algorithm.
 For each frequent item set I
 for each subset J of I

 determine all association rules of the form: I-J => J
 Main idea used in both stages : subset property
Data Mining 40
20
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?

sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
41
Data Mining
Example: Generating Rules

from an Itemset
 Frequent itemset from golf data:
Humidity = Normal, Windy = False, Play = Yes (4)
 Seven potential rules:

If Humidity = Normal and Windy = False then Play = Yes 4/4
If Humidity = Normal and Play = Yes then Windy = False 4/6
If Windy = False and Play = Yes then Humidity = Normal 4/6
If Humidity = Normal then Windy = False and Play = Yes 4/7
If Windy = False then Humidity = Normal and Play = Yes 4/8
If Play = Yes then Humidity = Normal and Windy = False 4/9
If True then Humidity = Normal and Windy = False and Play = Yes 4/14
Data Mining 42
21
Rules for the weather data
 Rules with support > 1 and confidence = 100%:

Association rule Sup. Conf.
1 Humidity=Normal Windy=False Play=Yes 4 100%
2 Temperature=Cool Humidity=Normal 4 100%
3 Outlook=Overcast Play=Yes 4 100%
4 Temperature=Cold Play=Yes Humidity=Normal 3 100%
... ... ... ... ...
58 Outlook=Sunny Temperature=Hot Humidity=High 2 100%
 In total: 3 rules with support four, 5 with support three,

and 50 with support two
Data Mining 43
Filtering Association Rules
 Problem: any large dataset can lead to very

large number of association rules, even with
reasonable Min Confidence and Support
 Confidence by itself is not sufficient
 e.g. if all transactions include Z, then
 any rule I => Z will have confidence 100%.
 Other measures to filter rules
Data Mining 44
22
Rule Generation
 Given a frequent itemset L, find all non-empty

subsets f  L such that f  L – f satisfies the
minimum confidence requirement
 If {A,B,C,D} is a frequent itemset, candidate
rules:
ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,
 If |L| = k, then there are 2k – 2 candidate

association rules (ignoring L   and   L)
Data Mining 45
Rule Generation
 How to efficiently generate rules from frequent

itemsets?
 In general, confidence does not have an anti-
monotone property
c(ABC D) can be larger or smaller than c(AB D)
 But confidence of rules generated from the same

itemset has an anti-monotone property
 e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
 Confidence is anti-monotone w.r.t. number of items on
the RHS of the rule (num. is fix den. increases).
Data Mining 46
23
Rule Generation for Apriori Algorithm
Lattice of rules
ABCD=>{ } }
ABCD=>{
Low
Confidence
Rule
BCD=>A
BCD=>A ACD=>B
ACD=>B ABD=>C
ABD=>C ABC=>D
ABC=>D
CD=>AB
CD=>AB BD=>AC
BD=>AC BC=>AD
BC=>AD AD=>BC
AD=>BC AC=>BD
AC=>BD AB=>CD
AB=>CD
D=>ABC
D=>ABC C=>ABD
C=>ABD B=>ACD
B=>ACD A=>BCD
A=>BCD
Pruned
Rules
Data Mining 47
Rule Generation for Apriori Algorithm
 Candidate rule is generated by merging two

rules that share the same prefix
in the rule consequent CD=>AB BD=>AC
 join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
D=>ABC
 Prune rule D=>ABC if its

subset AD=>BC does not have
high confidence
Data Mining 48
24

4 Mining .Association 1

Uploaded by

Copyright:

Available Formats

You might also like

4 Mining .Association 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4 Mining .Association 1

Uploaded by

Copyright:

Available Formats

Mining Association Rules

What Is Association Mining?

 Association rule mining:

Typically, association rules are considered interesting if they satisfy

Association Rule: Basic Concepts

 Given: (1) database of transactions, (2) each transaction is

 Find: all rules that correlate the presence of one set of

Customer probability that a transaction

Transaction ID Items Bought Let minimum support 50%, and

5000 B,E,F  C  A (50%, 100%)

Association Rule Mining

 Given a set of transactions, find rules that will predict the

Definition: Association Rule

 Given a set of transactions T, the goal of

 confidence ≥ minconf threshold

 Compute the support and confidence for each rule

 Prune rules that fail the minsup and minconf

Mining Association Rules

TID Items Example of Rules:

– Generate all itemsets whose support  minsup

 Frequent itemset generation is still

Frequent Itemset Generation

Given d items, there

Frequent Itemset Generation Strategies

 Reduce the number of candidates (M)

 Use pruning techniques to reduce M

 Reduce the number of transactions (N)

 Reduce the number of comparisons (NM)

must also be frequent

 Apriori principle holds due to the following

Illustrating Apriori Principle

ABCD ABCE ABDE ACDE BCDE

If every subset is considered, Itemset Count

The Apriori Algorithm

 Join Step: Ck is generated by joining Lk-1with itself

Database D itemset sup.

How to Generate Candidates?

 Suppose the items in Lk-1 are listed in an order

 L3={abc, abd, acd, ace, bcd}

How to Count Supports of

Generate Hash Tree

Suppose you have 15 candidate itemsets of length 3:

Hash Function Candidate Hash Tree

Association Rule Discovery: Hash

Level 3 Subsets of 3 items

Subset Operation Using Hash Tree

Factors Affecting Complexity

 Choice of minimum support threshold

 Hash-based itemset counting: A k-itemset whose corresponding

 Association rule R : Itemset1 => Itemset2

then it also has Itemset2

 Q: Given frequent set {A,B,E}, what are

Rule Support and Confidence

 Suppose R : I => J is an association rule

 Association rules with minimum support and count are

TID List of items

 Q: Given frequent set {A,B,E}, 1

Association Rules Example