Professional Documents
Culture Documents
Data - Analytics - Chapter 3
Data - Analytics - Chapter 3
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
• The discovery of these associations can help retailers develop marketing
strategies by gaining insight into which items are frequently purchased
together by customers.
– For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip
• Market-Basket Analysis:
– Suppose, as manager of an AllElectronics branch, you would like to
learn more about the buying habits of your customers. Specifically, you
wonder, “Which groups or sets of items are customers likely to
purchase on a given trip to the store?”
– To answer your question, market basket analysis may be performed on
the retail data of customer transactions at your store.
– You can then use the results to plan marketing or advertising strategies,
or in the design of a new catalog. For instance, market basket analysis
may help you design different store layouts.
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
For each frequent itemset l,
generate all nonempty subsets of l
For every nonempty subset s of l,
output the association rule s=>l-s
if support_count(l)/support_count(s) ≥ min_conf
( min_conf=minimum confidence threshold)
Consider l={A,B.E}. min_sup=2
Its nonempty subsets are {A,B},{A,E}, Tid Items
10 A, B, E
{B,E}, {A}, {B},{E}
20 B, E
{A,B}=>E confidence=2/4= 50%
30 B, C
{A,E}=>B confidence=2/2=100%
40 A, B, D
{B,E}=>A confidence=2/3=66%
50 A, C
{A}=>{B,E} confidence=2/6=33%
60 B, C
{B}=>{A,E} confidence=2/7=28%
70 A, C
{E}=>{A,B} confidence=2/3=66%
80 A, B, C, E
If minimum confidence threshold is 70% 90 A, B, C
then {A,E}=> B is the only association rule
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
• Association mining involves two steps :
– Find all frequent item sets
– Generate strong association rules from frequent item sets
• Major challenge in mining for frequent itemsets is
– Generation of a huge number of frequent item sets, satisfying MS
– A large freq itemset all its subsets are also frequent
– Hence a large freq item set will have a large number of frequent small
item sets
– To reduce this, the concept of
• Closed frequent itemset ( a closed Itemset no proper superset of it with
same support count)
• Maximal frequent itemset.
• Another example, suppose 5000 transactions
have been made through a popular eCommerce
website. Now they want to calculate the support,
confidence, and lift for the two products, let’s say
pen and notebook for example out of 5000
transactions, 500 transactions for pen, 700
transactions for notebook, and 1000 transactions
for both.
• SUPPORT: It is been calculated with the number
of transactions divided by the total number of
transactions made,
• support(pen) = transactions related to pen/total
transactions
– Support -> 500/5000=10 percent
• CONFIDENCE: It is been calculated for whether
the product sales are popular on individual sales
or through combined sales. That is calculated with
combined transactions/individual transactions.
• Confidence = combine transactions/individual
transactions
– confidence-> 1000/500=20 percent
• LIFT: Lift is calculated for knowing the ratio for the
sales. = confidence % / support %
• Lift-> 20/10=2
• When the Lift value is below 1 means the
combination is not so frequently bought by
consumers.
• But in this case, it shows that the probability
of buying both the things together is high
when compared to the transaction for the
individual items sold.
• Closed and maximal frequent itemsets. Suppose that a
transaction database has only two transactions: [a1, a2, : : : ,
a100 ]; {a1, a2, : : : , a50}.
• Let the minimum support count threshold be min sup D 1.
• We find two closed frequent itemsets and their support
counts, that is, C ={a1, a2, : : : , a100 : 1}; {a1, a2, : : : , a50: 2}.
• There is only one maximal frequent itemset: M={a1, a2, : : : ,
a100: 1}.
• The set of closed frequent itemsets contains complete
information regarding the frequent itemsets.
Frequent itemset mining methods
• Many kinds of frequent pattern mining algorithms
like association rules, correlation analysis etc
• Classified in many ways based on following
criteria:
• Based on completeness of patterns to be mined
» Complete set of freq itemsets
» Closed freq itemsets
» Maximal freq itemsets
» Constrained freq itemsets
» Approximate freq itemsets
» Top k-freq itemsets
• Based on levels of abstraction
• Based on the number of data dimensions involved
in the rule
• Based on the types of values handled by the rules
– Boolean association rule(presence/absence of items)
– Quantitative association rule
• Based on the kinds of rules to be mined.
– Association rules
– Correlation rules
• Based on the kinds of patterns to be mined
– Frequent itemsets / sequential patterns / structured
patterns
Basic Algorithms
Apriori Algorithm – It is based on large itemset or Aproiori
property
Apriori property- all nonempty subsets of a frequent itemset must also be
frequent- Large itemsets are downward closed( if an itemset satisfies the
minimum support, so does its subsets)
If we know that an itemset is small , we need not consider supersets of it as
candidates because they also will be small
Apriori employs an iterative approach known as level-wise search, where
k-itemsets are used to explore k+1-itemsets
•Initially, scan DB once to get frequent 1-itemset
•Generate length (k+1) candidate itemsets from length k frequent itemsets
•Test the candidates against DB
•Terminate when no frequent or candidate set can be generated
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested
Method:
Lk denotes the set of frequent k-itemsets- Large itemset
Ck is the superset of Lk – Candidate for Large itemset
Tid Items Itemset sup
Itemset sup
10 A, B, E {A} 6 L1 {A} 6
20 B, E C1 {B} 7
{B} 7
30 B, C {C} 6
1st scan {C} 6
40 A, B, D {D} 1
{E} 3
50 A, C {E} 3
60 B, C
70 A, C
80 A, B, C, E Supmin = 2
90 A, B, C
Two-step process is followed consisting of join and prune actions
to generate Lk from Lk-1
Join Step- Apriori assumes that items within a transaction or
itemset are sorted in lexicographic order.
The Candidate set Ck is generated by taking the join Lk-1xLk-1, where
members of Lk-1 are joinable if their first k-2 items are in common.
This ensures that no duplicates are generated
Prune step- To reduce the size of Ck, Apriori property is used as
follows
Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent k-itemset. Hence if any (k-1)-subset of a candidate
k-itemset is not in Lk-1, the candidate cannot be frequent and can
be removed from Ck
The count of each candidate in Ck is used to determine Lk
(minimum support count)
Supmin = 2
L1 C2
L2
Itemset sup Itemset sup Itemset sup
{A} 6 L1xL1 {A, B} 4 {A, B} 4
{B} 7 {A, C} 4 2nd scan {A, C} 4
{C} 6 {A, E} 2 {A, E} 2
{E} 3 {B, C} 4 {B, C} 4
{B, E} 3 {B, E} 3
{C, E} 1
L2 Supmin = 2
C3= {{A,B,C},{A,B,E},{A,C,E},{B,C,E}}
Itemset
The 2 item subsets of {A,B,C} are {A,B,},{B,C}, {A,C}
{A, B}
which are all in L2
{A, C}
{A, E} L2xL2 The 2 item subsets of {A,B,E} are {A,B,},{B,E}, {A,E}
{B, C} which are all in L2
{B, E} The 2 item subsets of {A,C,E} are {A,C,},{C,E} and {A,E}
. {C,E} is not in L2
Remove {A,C,E}
The 2 item subsets of {B,C,E} are {B,C,}, {C,E} and {B,E}
. {C,E} is not in L2
Itemset
{A, B, C} Remove {B,C,E}
{A, B, E}
{A, C, E} Itemset sup
3rd scan L3
{B, C, E} {A, B, C} 2
C3 {A, B, E} 2
Supmin = 2
L3
Itemset L3xL3 C4= {{A,B,C,E}}
{A, B, C} The 3 item subsets of {A,B,C,E} are {A,B,C}, {B,C,E},
{A, B, E} {A,C,E} and {A,B,E} ,
{B, C, E} and {A, C, E} are not in L3
Remove {A,B, C,E}
Thus C4 is empty and algorithm terminates having
found all the frequent itemsets
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
Algorithm Apriori
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = Apriori_generate(Lk)
// candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Algorithm Apriori_generate(Lk)
For each itemset l1 in Lk
For each itemset l2 in Lk
If k-1 elemts in l1 and l2 are equal
//If l1[1]=l2[1] and l1[2]=l2[2]and….l1[k-1]=l2[k-1] and
//l1[k]<l2[k]
C=l1xl2
add C to Ck+1
for each k subset s of c
if s does not belong to Lk then
delete c break
The Apriori algorithm assumes that the dataset is memory
resident. The max number of DB scans is one more than the
cardinality of largest itemset.
Large number of data scans is a weakness of apriori
• The apriori procedure performs two kinds of
actions, namely, join and prune, as described
before.
• In the join component, Lk-1 is joined with Lk-1
to generate potential candidates
• The prune component employs the Apriori
property to remove candidates that have a
subset that is not frequent
• Generating association rules from frequent
item sets:
– Once the frequent itemsets from transactions in a
database D have been found, it is straightforward
to generate strong association rules from them
(where strong association rules satisfy both
minimum support and minimum confidence)
– This can be done using confidence
• Conf(A B) = Support_count(A U B) / support_count(A)
• Where support_count(A U B) is the number of
transactions containing the itemsets A U B, and
support_count(A) is the number of transactions
containing A.
• Thus association rules can be generated as follows:
• I1 i2 i3 C3 2
• I1 i2 i5 C3 2
• I1 i2 i4
• I1 i3 i5
• I2 i3 i5
• I2 i3 i4
• C4 i1 i2i3 i5 1
Frequent Pattern-Tree
• Multiple database scans are costly
• Mining long patterns needs many passes of scanning and generates lots
of candidates
Solution-finding frequent itemsets without candidate generation
Frequent pattern growth-The transaction database is represented as an
FP-tree(frequent pattern tree)and then FP-tree is mined to get frequent
itemsets
Create root of the tree as null
The set of frequent itemsets are sorted in the order of descending support
count denoted by L
L= {(B,7),(A,6),(C,6),(E,2), (D.1)}
The items in each transaction are processed in L order and a branch is
created for each transaction
Increment the node count if it is shared with other transactions
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Frequent Pattern-Tree
null Ti Items
Itemset su d
p
10 A, B, E
{B} 7
20 B, E
{A} 6
{C} 6
B:7 A:2 30 B, C
{E} 3 40 A, B, D
{D} 1 50 A, C
60 B, C
A:4 C:2 E:1 C:2 70 A, C
80 A, B, C, E
90 A, B, C
E:1 C:2 D:1
E:1
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Support-Count
Node-link Frequent Pattern-Tree
B 7
A 6 null To facilitate tree
traversal, an item
C 6 header table is built
E 3 so that each item
B:7 A:2
D 1 points to its
occurences in the
tree via a chain of
node-links
A:4 C:2 E:1 C:2
E:1
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Frequent Pattern-Tree Mining
• Scan DB once, find frequent 1-itemsets and their support counts
• Sort frequent items in descending order of support counts to get a
list of frequent items L
• Create a root of an FP-tree as null
• Scan DB again, For each transaction in DB sort items in Order of L.
Let the sorted list be [p|P ] where p is the first element and P is the
remaining list. If p is already present in the tree increment its node
count else create a new node and link it to node links with same
value. If P is nonempty insert P recursively
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Support-Count
Node-link
B 7
A 6 null
Create root as null
C 6
First Transaction in L order
E 3
B:1 B A E
D 1
A:1
E:1
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Support-Count
Itemset Node-link
B 7
A 6 null
Create root as null
C 6
First Transaction in L order
E 3
B:2 B A E
D 1
Second transaction in L order
B E
A:1 E:1
E:1
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Frequent Pattern-Tree Mining
• Starting at the frequent item header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item p
• Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
A: C: E: C:
4 2 1 2
E: C: D:
1 2 1
E:
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007
Frequent Pattern-Tree Mining nul
Conditional pattern base l
E –{B,A:1} {B, A, C:1}{B:1} B:3
A:2 nul
C- {B, A:2} {B:2} {A:2} l
B:4 A:2
A:2
A – { B:4} null
B:4
Conditional FP tree Frequent Patterns
E –{B:3 ,A:2} {E, B:3 },{E, A:2}, {E,A,B:2}
C- {B:4 A:2} {A:2} {C, B:4} {C, A:4} {C,A,B:2}
A – { B:4} { A, B:4}
Data Mining And Data Warehousing By Dr. Mrs. S. C. Shirwaikar Dated 27 June 2007