Professional Documents
Culture Documents
Unit 4 - DA - Frequent Itemsets and Associations
Unit 4 - DA - Frequent Itemsets and Associations
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
Let I= {i1, ..., ik} be a set of items. Let D, be a set of transactions where each transaction T
is a set of items such that T ⊆ I. Each transaction is associated with an identifier, called
TID. Let A be a set of items. A transaction T is said to contain A if and only if A ⊆ T.
Transaction TID Items bought What is k-itemset?
T1 10 Beer, Nuts, Diaper When k=1, then k-Itemset is itemset 1.
T2 20 Beer, Coffee, Diaper When k=2, then k-Itemset is itemset 2.
T3 30 Beer, Diaper, Eggs When k=3, then k-Itemset is itemset 3.
T4 40 Nuts, Eggs, Milk
When k=4, then k-Itemset is itemset 4.
When k=5, then k-Itemset is itemset 5.
T5 50 Nuts, Coffee, Diaper, Eggs, Milk
For example, you are in a supermarket to buy milk. Referring to the below example,
there are nine baskets containing varying combinations of milk, cheese, apples, and
bananas.
Question - are you more likely to buy apples or cheese in the same transaction than
somebody who did not buy milk?
The next step is to determine the relationships and the rules. So, association rule
mining is applied in this context. It is a procedure which aims to observe frequently
occurring patterns, correlations, or associations from datasets found in various kinds
of databases such as relational databases, transactional databases, and other forms of
repositories.
School of Computer Engineering
Market-Basket Model cont…
10
The association rule has three measures that express the degree of confidence in the
rule, i.e. Support, Confidence, and Lift. Since the market-basket has its origin in retail
application, it is sometimes called transaction.
Support: The number of transactions that include items in the {A} and {B} parts of
the rule as a percentage of the total number of transactions. It is a measure of how
frequently the collection of items occur together as a percentage of all transactions.
Example: Referring to the earlier dataset, Support(milk) = 6/9, Support(cheese) =
7/9, Support(milk & cheese) = 6/9. This is often expressed as milk => cheese i.e.
bought milk and cheese together.
Confidence: It is the ratio of the no of transactions that includes all items in {B} as
well as the no of transactions that includes all items in {A} to the no of transactions
that includes all items in {A}. Example: Referring to the earlier dataset,
Confidence(milk => cheese) = (milk & cheese)/(milk) = 6/ 6.
Lift: The lift of the rule A=>B is the confidence of the rule divided by the expected
confidence, assuming that the itemsets A and B are independent of each other.
Example: Referring to the earlier dataset, Lift(milk => cheese) = [(milk &
cheese)/(milk) ]/[cheese/Total] = [6/6] / [7/9] = 1/0.777.
Milk ? ?
Cheese ? ?
1 Milk => Cheese ? ? ? ?
Apple, Milk ? ?
2 (Apple, Milk) => Cheese ? ? ? ?
Apple, Cheese ? ?
Support(Grapes) ?
Confidence({Grapes, Apple} => {Mango}) ?
Lift ({Grapes, Apple} => {Mango}) ?
Imagine you’re at the supermarket, and in your mind, you have the items you
wanted to buy. But you end up buying a lot more than you were supposed to.
This is called impulsive buying and brands use the Apriori algorithm to
leverage this phenomenon.
The Apriori algorithm uses frequent itemsets to generate association rules,
and it is designed to work on the databases that contain transactions.
With the help of these association rule, it determines how strongly or how
weakly two objects are connected.
It is the iterative process for finding the frequent itemsets from the large
dataset.
This algorithm uses a breadth-first search and Hash Tree to calculate the
itemset associations efficiently.
It is mainly used for market basket analysis and helps to find those products
that can be bought together. It can also be used in the healthcare field to find
drug reactions for patients.
Hash tree is a data structure used for data verification and synchronization.
It is a tree data structure where each non-leaf node is a hash of it’s child
nodes. All the leaf nodes are at the same depth and are as far left as possible.
It is also known as Merkle Tree.
top hash
hash(hA+B) + hash(hC+D) = hA+B+C+D
A B C D
C1
Itemset Support Count
A 6
Now, we will take out all the itemsets that
have the greater support count that the B 7
minimum support (2). It will give us the C 6
table for the frequent itemset L1. Since all
D 2
the itemsets have greater or equal support
count than the minimum support, except the E 1
E, so E itemset will be removed.
School of Computer Engineering
Apriori Algorithm Example cont…
25 L1
Itemset Support Count Step-2: Candidate Generation C2, and L2:
In this step, we will generate C2 with the help
A 6 of L1. In C2, we will create the pair of the
B 7 itemsets of L1 in the form of subsets. After
C 6 creating the subsets, we will again find the
support count from the main transaction table
D 2 of datasets, i.e., how many times these pairs
C2 have occurred together in the given dataset. So,
we will get the below table for C2:
Itemset Support Count
{A, B} 4
Again, we need to compare the C2 Support
{A,C} 4 count with the minimum support count, and
{A, D} 1 after comparing, the itemset with less support
count will be eliminated from the table C2. It
{B, C} 4
will give us the below table for L2. In this case,
{B, D} 2 {A,D}, {C,D} itemset will be removed.
{C, D} 0
School of Computer Engineering
Apriori Algorithm Example cont…
26
L2
Step-3: Candidate Generation C3, and L3:
Itemset Support Count For C3, we will repeat the same two processes,
{A, B} 4 but now we will form the C3 table with subsets
{A, C} 4 of three itemsets together, and will calculate
the support count from the dataset. It will give
{B, C} 4 the below table:
{B, D} 2
C3
Now we will create the L3 table. As we can Itemset Support Count
see from the above C3 table, there is only
{A, B, C} 2
one combination of itemset that has
support count equal to the minimum {B, C, D} 0
support count. So, the L3 will have only one {A, C, D} 0
combination, i.e., {A, B, C}.
L3
{A, B, D} 0