Professional Documents
Culture Documents
Unit 4 - DA - Frequent Itemsets and Clustering
Unit 4 - DA - Frequent Itemsets and Clustering
Strictly for internal circulation (within KIIT) and reference only. Not for outside circulation without permission
The next step is to determine the relationships and the rules. So, association rule
mining is applied in this context. It is a procedure which aims to observe frequently
occurring patterns, correlations, or associations from datasets found in various kinds
of databases such as relational databases, transactional databases, and other forms of
repositories.
School of Computer Engineering
Market-Basket Model cont…
The association rule has three measures that express the degree of confidence in the
rule, i.e. Support, Confidence, and Lift. Since the market-basket has its origin in retail
application, it is sometimes called transaction.
Support: The number of transactions that include items in the {A} and {B} parts of
the rule as a percentage of the total number of transactions. It is a measure of how
frequently the collection of items occur together as a percentage of all transactions.
Example: Referring to the earlier dataset, Support(milk) = 6/9, Support(cheese) =
7/9, Support(milk & cheese) = 6/9. This is often expressed as milk => cheese i.e.
bought milk and cheese together.
Confidence: It is the ratio of the no of transactions that includes all items in {B} as
well as the no of transactions that includes all items in {A} to the no of transactions
that includes all items in {A}. Example: Referring to the earlier dataset,
Confidence(milk => cheese) = (milk & cheese)/(milk) = 6/ 6.
Lift: The lift of the rule A=>B is the confidence of the rule divided by the expected
confidence, assuming that the itemsets A and B are independent of each other.
Example: Referring to the earlier dataset, Lift(milk => cheese) = [(milk &
cheese)/(milk) ]/[cheese/Total] = [6/6] / [7/9] = 1/0.777.
Apple, Cheese ? ?
Support(Grapes) ?
Confidence({Grapes, Apple} => {Mango}) ?
Lift ({Grapes, Apple} => {Mango}) ?
top hash
hash(hA+B) + hash(hC+D) = hA+B+C+D
A B C D
C1
Itemset Support Count
A 6
Now, we will take out all the itemsets that
have the greater support count that the B 7
minimum support (2). It will give us the table C 6
for the frequent itemset L1. Since all the D 2
itemsets have greater or equal support count
than the minimum support, except the E, so E E 1
itemset will be removed.
School of Computer Engineering
Apriori Algorithm Example cont…
L1
Itemset Support Count Step-2: Candidate Generation C2, and L2:
In this step, we will generate C2 with the help
A 6 of L1. In C2, we will create the pair of the
B 7 itemsets of L1 in the form of subsets. After
C 6 creating the subsets, we will again find the
support count from the main transaction table
D 2 of datasets, i.e., how many times these pairs
C2 have occurred together in the given dataset. So,
we will get the below table for C2:
Itemset Support Count
{A, B} 4
Again, we need to compare the C2 Support
{A,C} 4 count with the minimum support count, and
{A, D} 1 after comparing, the itemset with less support
count will be eliminated from the table C2. It
{B, C} 4
will give us the below table for L2. In this case,
{B, D} 2 {A,D}, {C,D} itemset will be removed.
{C, D} 0
School of Computer Engineering
Apriori Algorithm Example cont…
L2
Step-3: Candidate Generation C3, and L3:
Itemset Support Count
For C3, we will repeat the same two processes,
{A, B} 4 but now we will form the C3 table with subsets
{A, C} 4 of three itemsets together, and will calculate
the support count from the dataset. It will give
{B, C} 4 the below table:
{B, D} 2
C3
Now we will create the L3 table. As we can Itemset Support Count
see from the above C3 table, there is only
{A, B, 2
one combination of itemset that has C}
support count equal to the minimum
support count. So, the L3 will have only one {B, C, 0
combination, i.e., {A, B, C}. D}
L3 {A, C, 0
Itemset Support Count D}
{A, B, 2 {A, B, 0
C} D}
School of Computer Engineering
Apriori Algorithm Example cont…
Step-4: Finding the association rules for the subsets:
To generate the association rules, first, we will create a new table (AR) with the possible
rules from the occurred combination. For all the rules, we will calculate the Confidence
using formula sup(A^B)/A. After calculating the confidence value for all rules, we will
exclude the rules that have less confidence than the minimum threshold (50%).
AR
Rules Support Confidence
A^B → 2 sup{(A^B)^C}/sup(A^B)= 2/4 = 0.5 =50%
C
B^C → A 2 sup{(B^C) ^A}/sup(B^C)= 2/4=0.5=50%
A^C → 2 sup{(A ^C)^B}/sup(A^C)= 2/4=0.5=50%
B
C → A^B 2 sup{(C^( A^B)}/sup(C)= 2/5=0.4=40%
A → B^C 2 sup{(A^( B^C)}/sup(A)= 2/6=0.33=33.33%
As the given threshold or minimum confidence is 50%, so the first three rules A^B → C,
B → B^C 2 sup{(B^( B^C)}/sup(B)= 2/7=0.28=28%
B^C → A, and A^C → B can be considered as the strong association rules for the given
problem.
School of Computer Engineering
Apriori Algorithm Flow
Now, based on the similarity of these clusters, the most similar clusters combined
together and this process is repeated until only a single cluster is left.
Proximity Matrix
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
The diagonal elements of this matrix is always 0 as the distance of a point with itself is
always 0. The Euclidean distance formula is used to calculate the rest of the distances. So,
to calculate the distance between
Point 1 and 2: √(10-7)^2 = √9 = 3
Point 1 and 3: √(10-28)^2 = √324 = 18 and so on…
Similarly, all the distances are calculated and the proximity matrix is filled.
Step 1: First, all the points to an individual cluster is assigned. Different colors here
represent different clusters. Hence, 5 different clusters for the 5 points in the data.
Step 2: Next, look at the smallest distance in the proximity matrix and merge the points
with the smallest distance. Then the proximity matrix is updated.
Roll No 1 2 3 4 5
1 0 3 18 10 25
2 3 0 21 13 28
3 18 21 0 8 7
4 10 13 8 0 15
5 25 28 7 15 0
Let’s look at the updated clusters and accordingly update the proximity matrix. Here, we
have taken the maximum of the two marks (7, 10) to replace the marks for this cluster.
Instead of the maximum, the minimum value or the average values can also be
considered.
Roll No Mark
(1, 2) 10
3 28
4 20
5 35
Step 3: Step 2 is repeated until only a single cluster is left. So, look at the minimum
distance in the proximity matrix and then merge the closest pair of clusters. We will get
the merged clusters after repeating these steps:
Here, we can see that we have merged sample 1 and 2. The vertical line represents the
distance between these samples.
School of Computer Engineering
Dendrogram cont…
Similarly, we plot all the steps where we merged the clusters and finally, we get a
dendrogram like this:
We can clearly visualize the steps of hierarchical clustering. More the distance of the
vertical lines in the dendrogram, more the distance between those clusters.
The number of clusters will be the number of vertical lines which are being intersected
by the line drawn using the threshold. In the above example, since the red line intersects
2 vertical lines, we will have 2 clusters. One cluster will have a sample (1,2,4) and the
other will have a sample (3,5).
School of Computer Engineering
Hierarchical Clustering closeness of two clusters
The decision of merging two clusters is taken on the basis of closeness of these
clusters. There are multiple metrics for deciding the closeness of two clusters
and primarily are: Manhattan distance
Euclidean distance Maximum distance
Squared Euclidean distance Mahalanobis
distance
The below diagram explains the working of the K-means Clustering Algorithm:
1. Begin
2. Step-1: Select the number K to decide the number of clusters.
3. Step-2: Select random K points or centroids. (It can be other from the input
dataset).
4. Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
5. Step-4: Calculate the variance and place a new centroid of each cluster.
6. Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
7. Step-6: If any reassignment occurs, then go to step-4 else go to step-7.
8. Step-7: The model is ready.
9. End
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
School of Computer Engineering
Working of K-Means Algorithm cont…
Self-Study