1.2 Association Rule Mining: Abdulfetah Abdulahi A

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 43

Cosc 6211 Advanced Concepts in Data Mining

1.2 Association Rule Mining

Abdulfetah Abdulahi A.

Based on Jiawei Han and Jian Pei


What Is Association Mining?
• Association rule mining:
– Finding frequent patterns, associations, correlations, or causal structures among sets of items
or objects in transaction databases, relational databases, and other information repositories.
• Frequent pattern mining: finding regularities in data
– What products were often purchased together? — Milk and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
• Applications:
– Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering,
classification, etc.
• Examples.
– Rule form: “Body -> Head [support, confidence]”.
– buys(x, “tshirt”) -> buys(x, “tie”) [0.5%, 60%]
– major(x, “CS”) ^ takes(x, “DB”) -> grade(x, “A”) [1%, 75%]

2
on
cti r
cts
ns
a me u
tra sto od
cu pr t
id
id u gh
bo
T1 C33 p2, p5, p8
sales T2 C45 p5, p8, p11
market-basket
T3 C12 p1, p9
records: T4 C14 p5, p8, p11 data

T5 C12 p2, p9
T6 C12 p9

• Trend: Products p5, p8 often bough together


• Trend: Customer 12 likes product p9

3
Basic Concepts: Frequent Patterns and
Association Rules

• Itemset X={x1, …, xk}


Transaction-id Items bought
10 A, B, C • Find all the rules XY with min
confidence and support
20 A, C
– support, s, probability that a
30 A, D
transaction contains XY
40 B, E, F – confidence, c, conditional probability
that a transaction having X also
Customer
Customer contains Y.
buys both
buys diaper

Let min_support = 50%, min_conf = 50%:

A  C (50%, 66.7%)

C  A (50%, 100%)
Customer

buys milk
4
Association Rule: Basic Concepts

• Given: (1) database of transactions, (2) each transaction is a


list of items (purchased by a customer in a visit)
• Find: all rules that correlate the presence of one set of items
with that of another set of items
– E.g., 98% of people who purchase tires and auto accessories also get
automotive services done
• Applications
– *  Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales)
– Attached mailing in direct marketing
– faulty “collisions”

5
Definition: Frequent Itemset
TID Items
• Itemset 1 Bread, Milk
– A collection of one or more items 2 Bread, Diaper, Soda, Eggs
3 Milk, Diaper, Soda, Coke
• Example: {Milk, Bread, Diaper}
4 Bread, Milk, Diaper, Soda
– k-itemset 5 Bread, Milk, Diaper, Coke
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold

6
Definition: Association Rule
 Association Rule TID Items
1 Bread, Milk
– An implication expression of the form X  2 Bread, Diaper, Soda, Eggs
Y, where X and Y are itemsets 3 Milk, Diaper, Soda, Coke
– Example: 4 Bread, Milk, Diaper, Soda
{Milk, Diaper}  {Soda} 5 Bread, Milk, Diaper, Coke
 Rule Evaluation Metrics
– Support of the rule
• Either expressed in count form or
{milk ,diaper} soda
percentage form
– Confidence (c)
• In general LHS RHS, Confidence = P(RHS|LHS) s={milk ,diaper, soda}=2

– Confidence Support ,s in fraction: 2/5


• =P(RHS|LHS)
• =count(LHS and RHS) / count(LHS)
c={milk ,diaper, sod}/{milk, diaper}=2/3

7
Exercise
Outlook Temperature Humidity Play • Minimum support=2
sunny hot high no
– {sunny, hot, no}
sunny hot high no
– {sunny, hot, high, no}
overcast hot high yes
rainy mild high yes
– {rainy, normal}
rainy cool normal yes • Min Support =3
rainy cool normal no –?
overcast cool normal yes • How strong is {sunny, no}?
sunny mild high no – Count =
sunny cool normal yes
– Percentage =
rainy mild normal yes
sunny mild normal yes
overcast mild high yes – What is the confidence of
overcast hot normal yes Outlook=sunnyPlay=no?
rainy mild high no

8
Types of Association Rules
• Quantitative
– Age(X, “30…39”) and income(X, “42K…48K”)  buys(X, TV)
• Single vs. Multi dimensions:
– Buys(X, computer)  buys(X, “financial soft”);
– Multi: above example
• Levels of abstraction
– Age(X, ..)  buys(X, “laptop computer”)
– Age(X, ..)  buys(X, “computer);
• Extensions
– Correlation, causality analysis
•Association does not necessarily imply correlation or causality
– Maxpatterns (a frequent pattern s.t. any proper subpattern is not frequent) and
closed itemsets (if there exist no proper superset c’ of c s.t. any transaction
containing c also contains c’)

9
Closed and max Patterns/itemset
• Closed Itemset: support of all parents are Min_sup=2
not equal to the support of the itemset.
Tid Items
• Maximal Itemset: all parents of that
itemset must be infrequent. 10 A,B,C,D,E
Example: 20 B,C,D,E,
• {C} is closed as support of 30 A,C,D,F
– parents (supersets) of {c} not equal to 3
• {A C}:2, {B C}:2, {C D}:1, {C E}:2 not equal support
of {c}:3. And the same for {A C}, {B E} & {B C E}.
• {A C} is maximal as
• all parents (supersets) {A B C}, {A C D}, {A C E} are
infrequent. And the same for {B C E}.

10
Find frequent Max Patterns
Outlook Temperature Humidity Play
sunny hot high no
• Minimum support=2
sunny hot high no – {sunny, hot, no} ??
overcast hot high yes
rainy mild high yes
rainy cool normal yes
rainy cool normal no
overcast cool normal yes
sunny mild high no
sunny cool normal yes
rainy mild normal yes
sunny mild normal yes
overcast mild high yes
overcast hot normal yes
rainy mild high no

11
Find closed Patterns/itemsets

TID Items
Minsup=2
10 a, b, c
20 a, b, c Frequent itemset?
30 a, b, d Closed itemsets?
40 a, b, d,
50 c, e, f

12
Example max and closed itemsets
Frequent, but

superset BC
Support Maximal(s=3) Closed
also frequent.
Frequent, and
A 4 No No

B 5 No Yes its only superset,

C 3 No No ABC, not freq.


Superset BC
AB 4 Yes Yes
has same count.
AC 2 No No Its only super-
BC 3 Yes Yes set, ABC, has
ABC 2 No Yes smaller count.

13
Association Rule Mining Task

• Given a set of transactions T, the goal of


association rule mining is to find all rules
having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

14
Two methods:
Apriori and FP growth

15
Mining Association Rules—an Example

Transaction-id Items bought Min. support 50%

10 A, B, C Min. confidence 50%

20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
For rule A  C: {A, C} 50%

support = support({A}{C}) = 50%


confidence = support({A}{C})/support({A}) = 66.6%

16
Method 1:
Apriori: A Candidate Generation-and-test Approach

• Any subset of a frequent itemset must be frequent


– if {milk, diaper, nuts} is frequent, so is {milk, diaper}
– Every transaction having {milk, diaper, nuts} also contains {milk diaper}
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
• Method:
– generate length (k+1) candidate itemsets from length k frequent
itemsets, and
– test the candidates against DB
• The performance studies show its efficiency and scalability
• Agrawal & Srikant 1994, Mannila, et al. 1994

17
The Apriori Algorithm — An Example
minSup=0.5 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
st
{C} 3
20 B, C, E 1 scan
{D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2
Itemset sup
C2 Itemset
{A, B} 1 nd
L2 Itemset sup 2 scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
Itemset rd
C3 3 scan
L3 Itemset sup
{B, C, E} {B, C, E} 2 18
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
19
Important Details of Apriori

• How to generate candidates?

– Step 1: self-joining Lk

– Step 2: pruning

• How to count supports of candidates?


20
Example of Candidate-generation

• L3={abc, abd, acd, ace, bcd}

• Self-joining: L3*L3
– abcd from abc and abd
– acde from acd and ace

• Pruning:
– acde is removed because ade is not in L3

• C4={abcd}
21
Derive rules from frequent itemsets

• Frequent itemsets != association rules


• One more step is required to find association rules
• For each frequent itemset X,
For each proper nonempty subset A of X,
– Let B = X - A
– A B is an association rule if
• Confidence (A  B) ≥ minConf,
where support (A  B) = support (AB), and
confidence (A  B) = support (AB) / support (A)

22
Example – deriving rules from frequent itemses

• Suppose 234 is frequent, with supp=50%


– Proper nonempty subsets: 23, 24, 34, 2, 3, 4, with
supp=50%, 50%, 75%, 75%, 75%, 75% respectively
– These generate these association rules:
• 23 => 4, confidence=100%
• 24 => 3, confidence=100%
• 34 => 2, confidence=67%
• 2 => 34, confidence=67%
• 3 => 24, confidence=67%
• 4 => 23, confidence=67%
• All rules have support = 50%

23
How to Count Supports of Candidates?

• Why counting supports of candidates a problem?


– The total number of candidates can be very huge
– One transaction may contain many candidates
• Method:
– Candidate itemsets are stored in a hash-tree
– Leaf node of hash-tree contains a list of itemsets and
counts
– Interior node contains a hash table
– Subset function: finds all the candidates contained in a
transaction
24
Improving Apriori’s Efficiency

• Hash-based itemset counting: A k-itemset whose corresponding hashing


bucket count is below the threshold cannot be frequent

• Transaction reduction: A transaction that does not contain any frequent k-


itemset is useless in subsequent scans

• Partitioning: Any itemset that is potentially frequent in DB must be frequent in


at least one of the partitions of DB

• Sampling: mining on a subset of given data, need a lower support threshold + a


method to determine the completeness

• Dynamic itemset counting: add new candidate itemsets immediately (unlike


Apriori) when all of their subsets are estimated to be frequent

25
Is Apriori Fast Enough? — Performance Bottlenecks

• The core of the Apriori algorithm:


– Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets
– Use database scan and pattern matching to collect counts for the
candidate itemsets
• The bottleneck of Apriori: candidate generation
– Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a 1, a2, …, a100}, one
needs to generate 2100  1030 candidates.
– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest pattern

26
Mining Frequent Patterns Without Candidate Generation

• Heuristic: let P be a frequent itemset, S be the set of


transactions contain P, and x be an item. If x is a
frequent item in S, {x} P must be a frequent
itemset
• No candidate generation!
• A compact data structure, FP-tree, to store
information for frequent pattern mining
• Recursive mining algorithm for mining complete set
of frequent patterns

27
Example
Items Bought

f,a,c,d,g,i,m,p Min Support = 3

a,b,c,f,l,m,o

b,f,h,j,o

b,c,k,s,p

a,f,c,e,l,p,m,n
28
Scan the database

• List of frequent items, sorted: (item:support)


– <(f:4), (c:4), (a:3),(b:3),(m:3),(p:3)>
• The root of the tree is created and labeled
with “{}”
• Scan the database
– Scanning the first transaction leads to the first
branch of the tree: <(f:1),(c:1),(a:1),(m:1),(p:1)>
– Order according to frequency

29
Scanning TID=100

Transaction
{}

Database Header Table

Node
f:1
TID Items
Item count head

100 f,a,c,d,g,i,m,p f 1
c:1
c 1

a 1
a:1

m 1
m:1

p 1 p:1

30
Scanning TID=200
Items Frequent Single Items:

Bought F1=<f,c,a,b,m,p>

f,a,c,d,g,i,m,p TID=200

Possible frequent items:


a,b,c,f,l,m,o Intersect with F1:

f,c,a,b,m
b,f,h,j,o Along the first branch of <f,c,a,m,p>,

intersect:
b,c,k,s,p
<f,c,a>

a,f,c,e,l,p,m,n Generate two children


<b>, <m>

31
Scanning TID=200

{}
Transaction Header Table

Database Node
f:2
TID Items Item count head

200 f,c,a,b,m f 1
c:2
c 1

a 1
a:2
b 1

m 2
m:1 b:1

p 1 p:1 m:1

32
The final FP-tree

{}
Transaction Header Table

Database Node
f:4 c:1
TID Items Item count head

100 f,a,c,d,g,i,m,p f 1
c:3 b:1 b:1
200 a,b,c,f,l,m,o c 2

300 b,f,h,j,o a 1
a:3 p:1
400 b,c,k,s,p b 3

500 a,f,c,e,l,p,m,n m 2
m:2 b:1
p 2

Min support = 3 p:2 m:1

Frequent 1-items in
frequency descending order:
f,c,a,b,m,p 33
FP-Tree Construction

• Scans the database only twice


• Subsequent mining: based on the FP-tree

34
How to Mine an FP-tree?

• Step 1: form conditional pattern base

• Step 2: construct conditional FP-tree

• Step 3: recursively mine conditional FP-trees

• If the conditional FP-tree contains a single path, simply enumerate

all the patterns

35
Conditional Pattern Base

• Let {I} be a frequent item


– A sub database which
• consists of the set of prefix paths in the FP-tree {}

• With item {I} as a co-occurring suffix pattern


• Example: f:4 c:1

– {m} is a frequent item


c:3 b:1 b:1
– {m}’s conditional pattern base:
• <f,c,a>: support =2
a:3 p:1
• <f,c,a,b>: support = 1
– Mine recursively on such databases m:2 b:1

p:2 m:1

36
Conditional Pattern Tree

• Let {I} be a suffix item, {DB|I} be the


conditional pattern base {}

• The frequent pattern tree TreeI is known


f:4
as the conditional pattern tree
• Example: c:3

– {m} is a frequent item


– {m}’s conditional pattern base: a:3

• <f,c,a>: support =2
m:2
• <f,c,a,b>: support = 1
– {m}’s conditional pattern tree

37
Composition of patterns a and b

• Let a be a frequent item in DB, B be a’s conditional


pattern base, and b be an itemset in B. Then a + b is
frequent in DB if and only if b is frequent in B.
• Example:
– Starting with a={p}
– {p}’s conditional pattern base (from the tree) B=
• (f,c,a,m): 2
(c,b): 1
• Let b be {c}.
• Then a+b={p,c}, with support = 3.

38
Why Is Frequent Pattern Growth Fast?
• Divide-and-conquer:
– decompose both the mining task and DB according to the
frequent patterns obtained so far
– leads to focused search of smaller databases
• Other factors
– no candidate generation, no candidate test
– compressed database: FP-tree structure
– no repeated scan of entire database
– basic ops—counting and FP-tree building, not pattern search
and matching

39
Multi-Dimensional Association:
Concepts
• Single-dimensional rules:
buys(X, “milk”)  buys(X, “bread”)
• Multi-dimensional rules:  2 dimensions or predicates
– Inter-dimension association rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”)
– hybrid-dimension association rules (repeated predicates)
age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”)
• Categorical Attributes
– finite number of possible values, no ordering among values
• Quantitative Attributes
– numeric, implicit ordering among values

40
From association mining to correlation analysis

• From association to correlation and causal structure


analysis.
• Association does not necessarily imply correlation or causal
relationships

41
• Association rule mining using weka.
• Next: Classification

42
Reading from text book:

• Mining multilevel association rules from


transactional databases
• Mining multidimensional association rules
from transactional databases and data
warehouse
• Constraint-based association mining

43

You might also like