Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Machine Learning

Unit 4: Unsupervised Learning and Association Mining

Hierarchical Clustering: Hierarchical clustering is basically an unsupervised clustering


technique which involves creating clusters in a predefined order. The clusters are ordered
in a top to bottom manner. In this type of clustering, similar clusters are grouped together
and are arranged in a hierarchical manner. It can be further divided into two types namely
agglomerative hierarchical clustering and Divisive hierarchical clustering. In this
clustering, we link the pairs of clusters all the data objects are there in the hierarchy.

Non-Hierarchical Clustering: Non-Hierarchical Clustering involves formation of new


clusters by merging or splitting the clusters. It does not follow a tree like structure like
hierarchical clustering. This technique groups the data in order to maximize or minimize
some evaluation criteria. K means clustering is an effective way of non-hierarchical
clustering. In this method the partitions are made such that non-overlapping groups
having no hierarchical relationships between themselves.

Difference between Hierarchical Clustering and Non-Hierarchical Clustering:

S.NO. Hierarchical Clustering: Non-Hierarchical Clustering:

Non-Hierarchical Clustering involves


Hierarchical Clustering involves
formation of new clusters by merging or
1. creating clusters in a predefined
splitting the clusters instead of following a
order from top to bottom.
hierarchical order.

It is considered less reliable than It is comparatively more reliable than


2.
Non-Hierarchical Clustering. Hierarchical Clustering.

It is considered slower than Non- It is comparatively faster than Hierarchical


3.
Hierarchical Clustering. Clustering.
S.NO. Hierarchical Clustering: Non-Hierarchical Clustering:

It is very problematic to apply


It can work better than Hierarchical
4. this technique when we have
clustering even when error is there.
data with high level of error.

The clusters are difficult to read and


It is comparatively easier to read
5. understand as compared to Hierarchical
and understand.
clustering.

It is relatively unstable than


6. It is a relatively stable technique.
Non-Hierarchical clustering.

Hierarchical clustering is a popular unsupervised machine learning technique used to


group similar data points into clusters based on their similarity or dissimilarity. It is called
“hierarchical” because it creates a tree-like hierarchy of clusters, where each node
represents a cluster that can be further divided into smaller sub-clusters.

There are two types of hierarchical clustering techniques:

 Agglomerative and
 Divisive clustering

Agglomerative Clustering

Agglomerative clustering is a type of hierarchical clustering algorithm that merges the


most similar pairs of data points or clusters, building a hierarchy of clusters until all the
data points belong to a single cluster. It starts with each data point as its own cluster and
then iteratively merges the most similar pairs of clusters until all data points belong to a
single cluster

Divisive Clustering

Divisive Clustering is the technique that starts with all data points in a single cluster and
recursively splits the clusters into smaller sub-clusters based on their dissimilarity. It is
also known as, “top-down” clustering. It starts with all data points in a single cluster, and
then recursively splits the clusters into smaller sub-clusters based on their dissimilarity.

Unlike agglomerative clustering, which starts with each data point as its own cluster and
iteratively merges the most similar pairs of clusters, divisive clustering is a “divide and
conquer” approach that breaks a large cluster into smaller sub-clusters

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the


clustering problems in machine learning or data science.

K-Means Algorithm

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabelled


dataset into different clusters. Here K defines the number of pre-defined clusters that need
to be created in the process, as if K=2, there will be two clusters, and for K=3, there will
be three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabelled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The


main aim of this algorithm is to minimize the sum of distances between the data point and
their corresponding clusters.

The algorithm takes the unlabelled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
Working of K-Means Algorithm

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables
is given below:
Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.

We need to choose some random k points or centroid to form the cluster. These points can
be either the points from the dataset or any other point. So, here we are selecting the
below two points as k points, which are not the part of our dataset. Consider the below
image:

Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.

As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
We will repeat the process by finding the center of gravity of centroids, so the new centroids
will be as shown in the below image:

As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:

We can see in the above image; there are no dissimilar


data points on either side of the line, which means our model is formed. Consider the
below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:

K-Means Algorithm has a few limitations which are as follows:

 It only identifies spherical-shaped clusters i.e it cannot identify, if the clusters are
non-spherical or of various sizes and densities.
 It suffers from local minima and has a problem when the data contains outliers.

Bisecting K-Means

Bisecting K-Means Algorithm is a modification of the K-Means algorithm. It is a hybrid


approach between partitional and hierarchical clustering. It can recognize clusters of any
shape and size. This algorithm is convenient because:

It beats K-Means in entropy measurement.

When K is big, bisecting k-means is more effective. Every data point in the data collection
and k centroids are used in the K-means method for computation. On the other hand,
only the data points from one cluster and two centroids are used in each Bisecting stage
of Bisecting k-means. As a result, computation time is shortened.

While k-means is known to yield clusters of varied sizes, bisecting k-means results in
clusters of comparable sizes.

Bisecting K-Means Algorithm:

Initialize the list of clusters to accommodate the cluster consisting of all points.

repeat

Discard a cluster from the list of clusters.


{ Perform several “trial” bisections of the selected cluster. }

for i = 1 to number of trials do

Bisect the selected clusters using basic K-means.

end for

Select the 2 clusters from the bisection with the least total SSE.

until Until the list of clusters contain ‘K’ clusters


The working of this algorithm can be condensed into two steps.

Firstly, let us assume the number of clusters required at the final stage, ‘K’ = 3 (Any value
can be assumed, if not mentioned).

Step 01:

All points/objects/instances are put into 1 cluster.

Step 02:

Apply K-Means (K=3). The cluster ‘GFG’ is split into two clusters ‘GFG1’ and ‘GFG2’. The
required number of clusters aren’t obtained yet. Thus, ‘GFG1’ is further split into two
(since it has a higher SSE (formula to calculate SSE is explained below))
In the above diagram, as we split the cluster ‘GFG’ into ‘GFG1’ and ‘GFG2’, we calculate
the SSE of the two clusters separately using the above formula. The cluster, with the
higher SSE, will be split further. The cluster, with the lower SSE, contains lesser errors
comparatively, and hence won’t be split further.

Here, if we get the calculation that the cluster ‘GFG1’ is the one with higher SSE, we split
it into (GFG1)` and (GFG1)`. The number of clusters required at the final stage is
mentioned as ‘3’, and we obtained 3 clusters.

If the required number of clusters is not obtained, we should continue splitting until they
are produced.

K-Medoids clustering

K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering.
First, Clustering is the process of breaking down an abstract group of data points/ objects
into classes of similar objects such that all the objects in one cluster have similar traits. ,
a group of n objects is broken down into k number of clusters based on their similarities.
Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up with this method.

K-medoids is an unsupervised method with unlabelled data to be clustered. It is an


improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity. Compared to other partitioning algorithms, the algorithm is simple, fast, and
easy to implement.

The partitioning will be carried on such that:

 Each cluster must have at least one object


 An object must belong to only one cluster

K-Medoids:

Medoid: A Medoid is a point in the cluster from which the sum of distances to other data
points is minimal.

(or)

A Medoid is a point in the cluster from which dissimilarities with all the other points in
the clusters are minimal.

Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm


takes a Medoid as a reference point.
There are three types of algorithms for K-Medoids Clustering:

 PAM (Partitioning Around Clustering)


 CLARA (Clustering Large Applications)
 CLARANS (Randomized Clustering Large Applications)

PAM is the most powerful algorithm of the three algorithms but has the disadvantage of
time complexity. The following K-Medoids are performed using PAM. In the further parts,
we'll see what CLARA and CLARANS are.

Algorithm:

 Given the value of k and unlabelled data:


 Choose k number of random points from the data and assign these k points to k
number of clusters. These are the initial medoids.
 For all the remaining data points, calculate the distance from each medoid and
assign it to the cluster with the nearest medoid.
 Calculate the total cost (Sum of all the distances from all the data points to the
medoids)
 Select a random point as the new medoid and swap it with the previous medoid.
Repeat 2 and 3 steps.
 If the total cost of the new medoid is less than that of the previous medoid, make
the new medoid permanent and repeat step 4.
 If the total cost of the new medoid is greater than the cost of the previous medoid,
undo the swap and repeat step 4.
 The Repetitions have to continue until no change is encountered with new medoids
to classify data points.

CLARA:

It is an extension to PAM to support Medoid clustering for large data sets. This algorithm
selects data samples from the data set, applies Pam on each sample, and outputs the best
Clustering out of these samples. This is more effective than PAM. We should ensure that
the selected samples aren't biased as they affect the Clustering of the whole data.

CLARANS:

This algorithm selects a sample of neighbors to examine instead of selecting samples from
the data set. In every step, it examines the neighbors of every node. The time complexity
of this algorithm is O(n2), and this is the best and most efficient Medoids algorithm of all.
Advantages of using K-Medoids:

 Deals with noise and outlier data effectively


 Easily implementable and simple to understand
 Faster compared to other partitioning algorithms

Disadvantages:

 Not suitable for Clustering arbitrarily shaped groups of data points.


 As the initial medoids are chosen randomly, the results might vary based on the
choice in different runs.

Association Rule Learning

Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can be
more profitable. It tries to find some interesting relations or associations among the
variables of dataset. It is based on different rules to discover the interesting relations
between variables in the database.

The association rule learning is one of the very important concepts of machine learning,
and it is employed in Market Basket analysis, Web usage mining, continuous production,
etc. Here market basket analysis is a technique used by the various big retailer to discover
the associations between items. We can understand it by taking an example of a
supermarket, as in a supermarket, all products that are purchased together are put
together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk,
so these products are stored within a shelf or mostly nearby. Consider the below diagram:
Working of Association Rule Learning

Association rule learning works on the concept of If and Else Statement, such as if A then
B.

Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation between
two items is known as single cardinality. It is all about creating rules, and if the number
of items increases, then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several metrics. These metrics
are given below:

 Support
 Confidence
 Lift

Support

Support is the frequency of A or how frequently an item appears in the dataset. It is defined
as the fraction of the transaction T that contains the itemset X. If there are X datasets,
then for transactions T, it can be written as:

Confidence

Confidence indicates how often the rule has been found to be true. Or how often the items
X and Y occur together in the dataset when the occurrence of X is already given. It is the
ratio of the transaction that contains X and Y to the number of records that contain X.

Lift

It is the strength of any rule, which can be defined as below formula:


It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

If Lift= 1: The probability of occurrence of antecedent and consequent is independent of


each other.

Lift>1: It determines the degree to which the two itemsets are dependent to each other.

Lift<1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.

Apriori Algorithm in Machine Learning

The Apriori algorithm uses frequent itemsets to generate association rules, and it is
designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected. This
algorithm uses a breadth-first search and Hash Tree to calculate the itemset associations
efficiently. It is the iterative process for finding the frequent itemsets from the large
dataset.

This algorithm was given by the R. Agrawal and Srikant in the year 1994. It is mainly used
for market basket analysis and helps to find those products that can be bought together.
It can also be used in the healthcare field to find drug reactions for patients.

Frequent Itemset

Frequent itemsets are those items whose support is greater than the threshold value or
user-specified minimum support. It means if A & B are the frequent itemsets together,
then individually A and B should also be the frequent itemset.8:10llscreen

Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.

Note: To better understand the apriori algorithm, and related term such as support and
confidence, it is recommended to understand the association rule learning.
Steps for Apriori Algorithm

Below are the steps for the apriori algorithm:

Step-1: Determine the support of itemsets in the transactional database, and select the
minimum support and confidence.

Step-2: Take all supports in the transaction with higher support value than the minimum
or selected support value.

Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.

Step-4: Sort the rules as the decreasing order of lift.

Apriori Algorithm Working

We will understand the apriori algorithm using an example and mathematical calculation:

Example: Suppose we have the following dataset that has various transactions, and from
this dataset, we need to find the frequent itemsets and generate the association rules using
the Apriori algorithm:

Solution:

Step-1: Calculating C1 and L1:

In the first step, we will create a table that contains support count (The frequency of each
itemset individually in the dataset) of each itemset in the given dataset. This table is called
the Candidate set or C1.

Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support,
except the E, so E itemset will be removed.
Step-2: Candidate Generation C2, and L2:

In this step, we will generate C2 with the help of L1. In C2, we will create the pair of the
itemsets of L1 in the form of subsets.

After creating the subsets, we will again find the support count from the main transaction
table of datasets, i.e., how many times these pairs have occurred together in the given
dataset. So, we will get the below table for C2:

Again, we need to compare the C2 Support count with the minimum support count, and
after comparing, the itemset with less support count will be eliminated from the table C2.
It will give us the below table for L2

Step-3: Candidate generation C3, and L3:

For C3, we will repeat the same two processes, but now we will form the C3 table with
subsets of three itemsets together, and will calculate the support count from the dataset.
It will give the below table:
Now we will create the L3 table. As we can see from the above C3 table, there is only one
combination of itemset that has support count equal to the minimum support count. So,
the L3 will have only one combination, i.e., {A, B, C}.

Step-4: Finding the association rules for the subsets:

To generate the association rules, first, we will create a new table with the possible rules
from the occurred combination {A, B.C}. For all the rules, we will calculate the Confidence
using formula sup( A ^B)/A. After calculating the confidence value for all rules, we will
exclude the rules that have less confidence than the minimum threshold(50%).

Consider the below table:

Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%

As the given threshold or minimum confidence is 50%, so the first three rules A ^B → C,
B^C → A, and A^C → B can be considered as the strong association rules for the given
problem.

Advantages of Apriori Algorithm


 This is easy to understand algorithm
 The join and prune steps of the algorithm can be easily implemented on large
datasets.

Disadvantages of Apriori Algorithm

 The apriori algorithm works slow compared to other algorithms.


 The overall performance can be reduced as it scans the database for multiple times.
 The time complexity and space complexity of the apriori algorithm is O(2D), which
is very high. Here D represents the horizontal width present in the database.

FP -Growth

The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable method for
mining the complete set of frequent patterns by pattern fragment growth, using an
extended prefix-tree structure for storing compressed and crucial information about
frequent patterns named frequent-pattern tree (FP-tree). In his study, Han proved that his
method outperforms other popular methods for mining frequent patterns, e.g. the Apriori
Algorithm and the Tree Projection. In some later works, it was proved that FP-Growth
performs better than other methods, including Eclat and Relim. The popularity and
efficiency of the FP-Growth Algorithm contribute to many studies that propose variations
to improve its performance.

FP Growth Algorithm

The FP-Growth Algorithm is an alternative way to find frequent item sets without using
candidate generations, thus improving performance. For so much, it uses a divide-and-
conquer strategy. The core of this method is the usage of a special data structure named
frequent-pattern tree (FP-tree), which retains the item set association information.

This algorithm works as follows:

 First, it compresses the input database creating an FP-tree instance to represent


frequent items.
 After this first step, it divides the compressed database into a set of conditional
databases, each associated with one frequent pattern.
 Finally, each such database is mined separately.
 Using this strategy, the FP-Growth reduces the search costs by recursively looking
for short patterns and then concatenating them into the long frequent patterns.
 In large databases, holding the FP tree in the main memory is impossible. A strategy
to cope with this problem is to partition the database into a set of smaller databases
(called projected databases) and then construct an FP-tree from each of these
smaller databases.

FP-Tree

The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then
mapped onto a path in the FP-tree. This is done until all transactions have been read.
Different transactions with common subsets allow the tree to remain compact because
their paths overlap.

A frequent Pattern Tree is made with the initial item sets of the database. The purpose of
the FP tree is to mine the most frequent pattern. Each node of the FP tree represents an
item of the item set.

The root node represents null, while the lower nodes represent the item sets. The
associations of the nodes with the lower nodes, that is, the item sets with the other item
sets, are maintained while forming the tree.

Han defines the FP-tree as the tree structure given below:

One root is labelled as "null" with a set of item-prefix subtrees as children and a frequent-
item-header table.

Each node in the item-prefix subtree consists of three fields:

 Item-name: registers which item is represented by the node;


 Count: the number of transactions represented by the portion of the path reaching
the node;
 Node-link: links to the next node in the FP-tree carrying the same item name or null
if there is none.

Each entry in the frequent-item-header table consists of two fields:

 Item-name: as the same to the node;


 Head of node-link: a pointer to the first node in the FP-tree carrying the item name.
Dimensionality Reduction

The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modelling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting the higher
dimensions dataset into lesser dimensions dataset ensuring that it provides similar
information." These techniques are widely used in machine learning for obtaining a better
fit predictive model while solving the classification and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.

Principal Component Analysis (PCA)

Principal Component Analysis is a statistical process that converts the observations of


correlated features into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal Components. It
is one of the popular tools that is used for exploratory data analysis and predictive
modelling.

PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels.

SVD, or Singular Value Decomposition, is one of several techniques that can be used to
reduce the dimensionality, i.e., the number of columns, of a data set. Why would we want
to reduce the number of dimensions? In predictive analytics, more columns normally
means more time required to build models and score data. If some columns have no
predictive value, this means wasted time, or worse, those columns contribute noise to the
model and reduce model quality or predictive accuracy.
Dimensionality reduction can be achieved by simply dropping columns, for example, those
that may show up as collinear with others or identified as not being particularly predictive
of the target as determined by an attribute importance ranking technique. But it can also
be achieved by deriving new columns based on linear combinations of the original
columns. In both cases, the resulting transformed data set can be provided to machine
learning algorithms to yield faster model build times, faster scoring times, and more
accurate models.

While SVD can be used for dimensionality reduction, it is often used in digital signal
processing for noise reduction, image compression, and other areas.
SVD is an algorithm that factors an m x n matrix, M, of real or complex values into three
component matrices, where the factorization has the form USV*. U is an m x p matrix. S is
a p x p diagonal matrix. V is an n x p matrix, with V* being the transpose of V, a p x
n matrix, or the conjugate transpose if M contains complex values. The value p is called
the rank. The diagonal entries of S are referred to as the singular values of M. The columns
of U are typically called the left-singular vectors of M, and the columns of V are called the
right-singular vectors of M.

Consider the following visual representation of these matrices:

One of the features of SVD is that given the decomposition of M into U, S, and V, one can
reconstruct the original matrix M, or an approximation of it. The singular values in the
diagonal matrix S can be used to understand the amount of variance explained by each of
the singular vectors.

You might also like