Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Go, change the world

RV College of
Engineering

Improvi
Chapter 9
Unsupervised Learning

7/23/2023 1
Go, change the world
RV College of
Engineering Introduction
• Unsupervised learning is a machine learning concept where the unlabeled and unclassified information is analysed to
discover hidden knowledge.

• The algorithms work on the data without any prior training, but they are constructed in such a way that they can identify
patterns, groupings, sorting order, and numerous other interesting knowledge from the set of data.

• Unsupervised vs Supervised

• Unsupervised: This analysis may reveal an interesting correlation between the features or a common Behaviour
within the subgroup of the data, which provides better understanding of the data.

• Association
• Grouping

• Supervised

• Prediction
• Classification

• Example: pushing movie promotions to the correct group of people.


7/23/2023 2
RV College of APPLICATIONS OF UNSUPERVISED LEARNING Go, change the world
Engineering

• Segmentation of target consumer populations by an advertisement


consulting agency on the basis of few dimensions such as demography,
financial data, purchasing habits, etc. so that the advertisers can reach
their target consumers efficiently

• Anomaly or fraud detection in the banking sector by identifying the


pattern of loan defaulters

• Image processing and image segmentation such as face recognition,


expression identification, etc.

• Grouping of important characteristics in genes to identify important


influencers in new areas of genetics

• Utilization by data scientists to reduce the dimensionalities in sample Data set for the conference
data to simplify modelling attendees

• Document clustering and identifying potential labelling options

7/23/2023 3
Go, change the world
RV College of CLUSTERING
Engineering

Clustering – finding subgroups, or clusters, in a data set - similar (or


related to each other) but are different from (or unrelated to) the objects
from the other groups.

Text data mining: this includes tasks such as text categorization, text
clustering, document summarization, concept extraction, sentiment
analysis, and entity relation modelling.

Customer segmentation: creating clusters of customers on the basis of


parameters such as demographics, financial conditions, buying habits, etc.,
which can be used by retailers and advertisers to promote their products in
the correct segment.

Anomaly checking: checking of anomalous behaviors such as fraudulent


bank transaction, unauthorized computer intrusion, suspicious movements
on a radar scanner, etc.

Data mining: simplify the data mining task by grouping a large number of
features from an extremely large data set to make the analysis manageable

7/23/2023 4
Go, change the world
RV College of
Engineering
Clustering Techniques
• Partitioning methods
• Hierarchical methods
• Density-based methods

7/23/2023 5
Go, change the world
RV College of
Engineering Partition Methods - KMeans

7/23/2023 6
Go, change the world
RV College of
Engineering Partition Methods - KMeans

7/23/2023 7
Advantages: Go, change the world
• K-meansRVis College of
Partition Methods - KMeans
relatively scalable and efficient in processing large data sets.
Engineering

• The computational complexity of the algorithm is O(nkt)

• n: the total number of objects


• k: the number of clusters
• t: the number of iterations
• Normally: k<<n and t<<n

Disadvantages:

• Can be applied only when the mean of a cluster is defined

• Users need to specify k

• K-means is not suitable for discovering clusters with nonconvex shapes or clusters of very different size

• It is sensitive to noise and outlier data points (can influence the mean value)

7/23/2023 8
Go, change the world
RV College of
Engineering Partition Methods - KMedoids
• Minimize the sensitivity of k-means to outliers

• Pick actual objects to represent clusters instead of mean values

• Each remaining object is clustered with the representative object (Medoid) to which is the most similar

• The algorithm minimizes the sum of the dissimilarities between each object and its corresponding
representative object

E: the sum of absolute error for all objects in the data set
P: the data point in the space representing an object
Oi: is the representative object of cluster Ci

7/23/2023 9
Go, change the world
RV College of
Engineering Partition Methods - KMedoids

7/23/2023 10
RV College of Partition Methods – K - Medoids Go, change the world
Engineering

7/23/2023 11
Go, change the world
RV College of
Engineering Partition Methods - KMedoids

7/23/2023 12
Go, change the world
RV College of
Engineering Introduction
9.4.3 Partitioning methods

7/23/2023 13
Go, change the world
RV College of
Engineering Introduction
Partitioning methods

Elbow method

• This method tries to measure the homogeneity or


heterogeneity within the cluster and for various values of ‘K’
and helps in arriving at the optimal ‘K’.

• From Figure 9.5, we can see the homogeneity will increase


or heterogeneity will decrease with increasing ‘K’ as the
number of data points inside each cluster reduces with this
increase.

7/23/2023 14
Go, change the world
RV College of
Engineering Introduction
Partitioning Around Medoids (PAM)

7/23/2023 15
Go, change the world
RV College of
Engineering Example Partitioning Around Medoids (PAM)

• The clusters made with medoids (3, 4)


and (7, 3) are as follows.

• Points in cluster1= {(2, 6), (3, 8), (4, 7),


(3, 4)}

• Points in cluster 2= {(7,4), (6,2), (6, 4),


(7,3), (8,5), (7,6)}

• After assigning clusters, we will calculate


the cost for each cluster and find their
sum.

• The cost is nothing but the sum of


distances of all the data points from the
medoid of the cluster they belong to.

• Hence, the cost for the current cluster


will be 3+4+4+2+2+0+1+3+3+0=22.

7/23/2023 16
Go, change the world
RV College of
Engineering Example Partitioning Around Medoids (PAM)

Iteration 2
Now, we will select another non-medoid point (7, 4) and make it a temporary medoid for the second cluster. Hence,
•M1 = (3, 4)
•M2 = (7, 4)
Now, let us calculate the distance between all the data points and the current medoids.

• The data points haven’t changed in the clusters


after changing the medoids. Hence, clusters are:
Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}
Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}

• The total cost this time will be


3+4+4+3+1+1+0+2+2+0=20.

• Here, the current cost is less than the cost


calculated in the previous iteration. Hence, we will
make the swap permanent and make (7,4) the
medoid for cluster 2.

• New medoids after this iteration are (3, 4) and (7,


4) with no change in the clusters.
7/23/2023 17
Go, change the world
RV College of
Engineering Example Partitioning Around Medoids (PAM)

Iteration 3
Now, let us again change the medoid of cluster 2 to (6, 4). Hence, the new medoids for the clusters are M1=(3, 4) and M2=
(6, 4 ).
Let us calculate the distance between the data points and the above medoids to find the new cluster. The results have been
tabulated as follows.
Again, the clusters haven’t changed. Hence, clusters are:

•Points in cluster1:{(2, 6), (3, 8), (4, 7), (3, 4)}

•Points in cluster 2:{(7,4), (6,2), (6, 4), (7,3), (8,5), (7,6)}

Now, let us again calculate the cost for each cluster and find
their sum. The total cost this time will be

3+4+4+2+0+2+1+3+3+0=22.

The current cost is 22 which is greater than the cost in the


previous iteration i.e. 20.

Hence, we will revert the change and the point (7, 4) will again
be made the medoid for cluster 2.
7/23/2023 18
Go, change the world
RV College of
Engineering Example Partitioning Around Medoids (PAM)

• So, the clusters after this iteration will be cluster1 = {(2, 6), (3, 8), (4, 7), (3, 4)} and cluster 2= {(7,4), (6,2), (6, 4), (7,3),
(8,5), (7,6)}. The medoids are (3,4) and (7,4).

• We keep replacing the medoids with a non-medoid data point.

• The set of medoids for which the cost is the least, the medoids, and the associated clusters are made permanent. So,
after all the iterations, you will get the final clusters and their medoids.

• The K-Medoids clustering algorithm is a computation-intensive algorithm that requires many iterations.

• In each iteration, we need to calculate the distance between the medoids and the data points, assign clusters, and
compute the cost.

• Hence, K-Medoids clustering is not well suited for large data sets.

Different Implementations of K-Medoids Clustering


K-Medoids clustering algorithm has different implementations such as PAM, CLARA, and CLARANS. (Read on your own as
it is beyond topic)

7/23/2023 19
RV College of Introduction to hierarchical clustering Go, change the world
Engineering
methods
• Situations arise when the data needs to be partitioned into groups at different levels such as in a hierarchy.

• The hierarchical clustering methods are used to group the data into hierarchy or tree-like structure.

• For example, in a machine learning problem of organizing employees of a university in different departments,
first the employees are grouped under the different departments in the university, and then within each
department, the employees can be grouped according to their roles such as professors, assistant professors,
supervisors, lab assistants, etc. This creates a hierarchical structure of the employee data and eases
visualization and analysis.

• There are two main hierarchical clustering methods: agglomerative clustering and divisive clustering.

• Agglomerative clustering is a bottom-up technique which starts with individual objects as clusters and then
iteratively merges them to form larger clusters.

On the other hand, the divisive method starts with one cluster with all given objects and then splits it iteratively
to form smaller clusters.

7/23/2023 20
RV College of Introduction to hierarchical clustering Go, change the world
Engineering
methods
A dendrogram is a commonly used tree structure
representation of step-by-step creation of hierarchical
clustering.

It shows how the clusters are merged iteratively (in


the case of agglomerative clustering) or split
iteratively (in the case of divisive clustering) to arrive
at the optimal clustering solution.

Dendrogram representation of hierarchical


clustering
7/23/2023 21
Go, change the world
RV College of
Engineering Distance Measures
One of the core measures of proximities between clusters is the distance between them.

There are four standard methods to measure the distance between clusters:

7/23/2023 22
Go, change the world
RV College of
Engineering Clustering Algms
• Distance measure is used to decide when to terminate the clustering algorithm

• For example, in an agglomerative clustering, - MIN distance between two neighbouring clusters becomes
less than the user-defined threshold.

• when an algorithm uses the minimum distance Dmin to measure the distance between the clusters –
nearest neighbour clustering algorithm, and if the decision to stop the algorithm is based on a
user-defined limit on Dmin , then it is called single linkage algorithm.

• when an algorithm uses the maximum distance Dmax to measure the distance between the clusters, then it
is referred to as furthest neighbour clustering algorithm, and if the decision to stop the algorithm is based
on a user defined limit on Dmax then it is called complete linkage algorithm.

• As minimum and maximum measures provide two extreme options to measure distance between the
clusters, they are prone to the outliers and noisy data.

• Instead, the use of mean and average distance helps in avoiding such problem and provides more
consistent results.

7/23/2023 23
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)
• Density-based methods – DBSCAN In the case of the other shaped clusters such as S-shaped or uneven
shaped clusters, the above two types of method do not provide accurate results.

• The density-based clustering approach provides a solution to identify clusters of arbitrary shapes.

• The principle is based on identifying the dense area and sparse area within the data set and then run
the clustering algorithm.

• DBSCAN is one of the popular density-based algorithm which creates clusters by using connected
regions with high density.

• Clusters are dense regions in the data space, separated by regions of the lower density of points.

• The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”.

• The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at
least a minimum number of points.

7/23/2023 24
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)
Real-life data may contain irregularities, like:
1.Clusters can be of arbitrary shape such as those shown in the figure below.
2.Data may contain noise.
Parameters Required For DBSCAN Algorithm

• Eps - It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors.

• If the eps value is chosen too small then a large part of the data will
be considered as an outlier.

• If it is chosen very large then the clusters will merge and the
majority of the data points will be in the same clusters.

• One way to find the eps value is based on the k-distance graph.

MinPts: As a general rule, the minimum MinPts can be derived from the
number of dimensions D in the dataset as, MinPts >= D+1. The minimum
7/23/2023
value of MinPts must be chosen at least 3. 25
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)
• In this algorithm, we have 3 types of data points.
• Core Point: A point is a core point if it has more than MinPts points within eps.
• Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core
point.
• Noise or outlier: A point which is not a core point or border point.

7/23/2023 26
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)

7/23/2023 27
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)
There are three points in the data set, (2.8, 4.5)
(1.2, 2.5) (1, 2.5) that have 4 neighbourhood points
around them, hence they would be called core
points and as already mentioned, if the core point
is not assigned to any cluster, a new cluster is
formed.

Hence, (2.8, 4.5) is assigned to a new cluster,


Cluster 1 and so is the point (1.2, 2.5), Cluster 2.
Also observe that the core points (1.2, 2.5) and (1,
2.5) share at least one common neighbourhood
point (1,2) so, they are assigned to the same
cluster.
As evident from the above table, the point (1, 2)
has only two other points in its neighborhood (1,
The below table shows the categorization of all the
2.5), (1.2, 2.5) for the assumed value of eps, as its
data points into core, border and outlier points.
less than MinPts, we can’t declare it as a core
point. Let’s repeat the above process for every
point in the dataset and find out the
neighborhood of each.
7/23/2023 28
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)

7/23/2023 29
RV College of FINDING PATTERN USING ASSOCIATION RULE Go, change the world
Engineering

Association analysis: - Set of frequent items

Application: Market Basket Analysis - that retailers use for cross-selling of their products

Food habits – Type of diseases


2. Support count: Support count denotes the number of
transactions in which a particular itemset is present.
This is a very important property of an itemset as it
1. Item Set denotes the frequency of occurrence for the itemset.

{Bread, Milk, Egg} occurs together three times and


thus have a support count of 3.

3. Association rule

A typical rule might be expressed as{Bread, Milk} →


{Egg}, which denotes that if Bread and Milk are
purchased, then Egg is also likely to be purchased.

Thus, association rules are learned from subsets of


itemset.
7/23/2023 30
RV College of FINDING PATTERN USING ASSOCIATION RULE Go, change the world
Engineering

• Support and confidence are the two concepts that are


used for measuring the strength of an association rule.

• Support denotes how often a rule is applicable to a given


data set.

• Confidence indicates how often the items in Y appear in


transactions that contain X in a total transaction of N.

• Confidence denotes the predictive power or accuracy of


the rule.

In our data, if we consider the association rule


{Bread, Milk} → {Egg}, then from the above formula

7/23/2023 31

You might also like