Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Amity School of Engineering & Technology

Module-1
Clustering and Classification
Data Analytics CSE704

By: Dr. Ghanshyam Prasad Dubey


Amity School of Engineering & Technology

Syllabus
• Module II: Clustering and Classification: (6 Hours)
– Analytical Theory and Methods: Overview of Clustering – K-
means – Use Cases – Overview of the Method – Determining
the Number of Clusters – Diagnostics – Reasons to Choose and
Cautions .- Classification: Decision Trees – Overview of a
Decision Tree – The General Algorithm – Decision Tree
Algorithms – Evaluating a Decision Tree
Amity School of Engineering & Technology

Introduction to Clustering
• Clustering methods are one of the most useful
unsupervised ML methods.
• These methods are used to find similarity as well as the
relationship patterns among data samples and then
cluster those samples into groups having similarity
based on features.
• The aim of the clustering process is to segregate groups
with similar traits and assign them into clusters.
Amity School of Engineering & Technology

Cluster Formation Methods


• Density-based
• Hierarchical-based
• Partitioning
• Grid
Amity School of Engineering & Technology

Density-based
• In these methods, the clusters are formed
as the dense region.
• The advantage of these methods is that
they have good accuracy as well as good
ability to merge two clusters.
• Ex. Density-Based Spatial Clustering of
Applications with Noise (DBSCAN),
Ordering Points to identify Clustering
structure (OPTICS) etc.
Amity School of Engineering & Technology

Hierarchical-based
• In these methods, the clusters are formed
as a tree type structure based on the
hierarchy.
• They have two categories namely,
Agglomerative (Bottom up approach) and
Divisive (Top down approach).
• Ex. Clustering using Representatives
(CURE), Balanced iterative Reducing
Clustering using Hierarchies (BIRCH) etc.
Amity School of Engineering & Technology

Partitioning
• In these methods, the clusters are formed
by portioning the objects into k clusters.
• Number of clusters will be equal to the
number of partitions.
• Ex. K-means, Clustering Large
Applications based upon randomized
Search (CLARANS).
Amity School of Engineering & Technology

Grid
• In these methods, the clusters are formed
as a grid like structure.
• The advantage of these methods is that all
the clustering operation done on these
grids are fast and independent of the
number of data objects.
• Ex. Statistical Information Grid (STING),
Clustering in Quest (CLIQUE).
Amity School of Engineering & Technology

Types of ML Clustering Algorithms

• K-means Clustering
• Mean-Shift Algorithm
• Hierarchical Clustering
Amity School of Engineering & Technology

K-means Clustering
• This clustering algorithm computes the
centroids and iterates until it finds optimal
centroid.
• It assumes that the number of clusters are
already known.
• It is also called flat clustering algorithm.
• The number of clusters identified from
data by algorithm is represented by ‘K’ in
K-means.
Amity School of Engineering & Technology

Mean-Shift Algorithm
• It is another powerful clustering algorithm
used in unsupervised learning.
• Unlike K-means clustering, it does not
make any assumptions hence it is a non-
parametric algorithm.
Amity School of Engineering & Technology

Hierarchical Clustering
• It is another unsupervised learning
algorithm that is used to group together
the unlabeled data points having similar
characteristics.
Amity School of Engineering & Technology

Applications of Clustering
• Data summarization and compression −
– Clustering is widely used in the areas where
we require data summarization, compression
and reduction as well. The examples are
image processing and vector quantization.
• Collaborative systems and customer
segmentation −
– Since clustering can be used to find similar
products or same kind of users, it can be used
in the area of collaborative systems and
customer segmentation.
Amity School of Engineering & Technology

• Serve as a key intermediate step for other data mining


tasks −
– Cluster analysis can generate a compact summary of data for
classification, testing, hypothesis generation; hence, it serves
as a key intermediate step for other data mining tasks also.
• Trend detection in dynamic data −
– Clustering can also be used for trend detection in dynamic
data by making various clusters of similar trends.
• Social network analysis −
– Clustering can be used in social network analysis. The
examples are generating sequences in images, videos or
audios.
• Biological data analysis −
– Clustering can also be used to make clusters of images,
videos hence it can successfully be used in biological data
analysis.
Amity School of Engineering & Technology

K Means Clustering
• K-Means Clustering is an unsupervised learning
algorithm that is used to solve the clustering problems in
machine learning or data science.
• K-Means Clustering is an Unsupervised Learning
algorithm, which groups the unlabeled dataset into
different clusters.
• Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters,
and so on.
• It is a centroid-based algorithm, where each cluster is
associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between
the data point and their corresponding clusters.
Amity School of Engineering & Technology

• The algorithm takes the unlabeled dataset as


input, divides the dataset into k-number of
clusters, and repeats the process until it does
not find the best clusters.
• The value of k should be predetermined in this
algorithm.
• The k-means clustering algorithm mainly
performs two tasks:
– Determines the best value for K center points
or centroids by an iterative process.
– Assigns each data point to its closest k-
center. Those data points which are near to
the particular k-center, create a cluster.
Amity School of Engineering & Technology
Amity School of Engineering & Technology

K-Means Algorithm
• Step-1: Select the number K to decide the number of
clusters.
• Step-2: Select random K points or centroids. (It can be
other from the input dataset).
• Step-3: Assign each data point to their closest centroid,
which will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of
each cluster.
• Step-5: Repeat the third steps, which means reassign each
data point to the new closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else
go to FINISH.
• Step-7: The model is ready.
Amity School of Engineering & Technology

• Suppose we have two variables M1 and


M2. The x-y axis scatter plot of these two
variables is given below:
Amity School of Engineering & Technology

• Let's take number k of clusters, i.e., K=2, to


identify the dataset and to put them into different
clusters. It means here we will try to group these
datasets into two different clusters.
Amity School of Engineering & Technology

• Now we will assign each data point of the scatter plot to


its closest K-point or centroid. We will compute it by
applying some mathematics that we have studied to
calculate the distance between two points. So, we will
draw a median between both the centroids. Consider the
below image:
Amity School of Engineering & Technology

• From the above image, it is clear that points left side of


the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.
Amity School of Engineering & Technology

• As we need to find the closest cluster, so we will repeat


the process by choosing a new centroid. To choose the
new centroids, we will compute the center of gravity of
these centroids, and will find new centroids as below:
Amity School of Engineering & Technology

• Next, we will reassign each datapoint to the new


centroid. For this, we will repeat the same process of
finding a median line. The median will be like below
image:
Amity School of Engineering & Technology

• From the above image, we can see, one


yellow point is on the left side of the line,
and two blue points are right to the line.
So, these three points will be assigned to
new centroids.
Amity School of Engineering & Technology

• As reassignment has taken place, so we will again


go to the step-4, which is finding new centroids or K-
points.
• We will repeat the process by finding the center of
gravity of centroids, so the new centroids will be as
shown in the below image:
Amity School of Engineering & Technology

• As we got the new centroids so again will


draw the median line and reassign the
data points. So, the image will be:
Amity School of Engineering & Technology

• We can see in the above image; there are no


dissimilar data points on either side of the line,
which means our model is formed. Consider the
below image:
Amity School of Engineering & Technology

• As our model is ready, so we can now


remove the assumed centroids, and the
two final clusters will be as shown in the
below image:
Amity School of Engineering & Technology

• Suppose we have two variables M1 and


M2. The x-y axis scatter plot of these two
variables is given below:
Amity School of Engineering & Technology

K-Mean Use case


• Document Classification
– Cluster documents in multiple categories
based on tags, topics, and the content of the
document.
– The initial processing of the documents is
needed to represent each document as a
vector and uses term frequency to identify
commonly used terms that help classify the
document.
– The document vectors are then clustered to
help identify similarities in document groups.
Amity School of Engineering & Technology

• Delivery Store Optimization


– Optimize the process of good delivery using
truck drones by using a combination of k-
means to find the optimal number of launch
locations and a genetic algorithm to solve the
truck route as a traveling salesman problem.
• Identifying Crime Localities
– With data related to crimes available in
specific localities in a city, the category of
crime, the area of the crime, and the
association between the two can give quality
insight into crime-prone areas within a city or
a locality.
Amity School of Engineering & Technology

• Customer Segmentation
– Clustering helps marketers improve their customer
base, work on target areas, and segment customers
based on purchase history, interests, or activity
monitoring.
– The classification would help the company target
specific clusters of customers for specific campaigns.
• Fantasy League Stat Analysis
– Analyzing player stats has always been a critical
element of the sporting world, and with increasing
competition, machine learning has a critical role to
play here.
– if you would like to create a fantasy draft team and like
to identify similar players based on player stats, k-
means can be a useful option.
Amity School of Engineering & Technology
• Insurance Fraud Detection
– Machine learning has a critical role to play in fraud
detection and has numerous applications in automobile,
healthcare, and insurance fraud detection.
– Utilizing past historical data on fraudulent claims, it is
possible to isolate new claims based on its proximity to
clusters that indicate fraudulent patterns. since insurance
fraud can potentially have a multi-million dollar impact on
a company, the ability to detect frauds is crucial.
• Rideshare Data Analysis
– The publicly available uber ride information dataset
provides a large amount of valuable data around traffic,
transit time, peak pickup localities, and more. analyzing
this data is useful not just in the context of uber but also in
providing insight into urban traffic patterns and helping us
plan for the cities of the future.
Amity School of Engineering & Technology

• Cyber-Profiling Criminals
– Cyber profiling is the process of collecting data from
individuals and groups to identify significant co-
relations.
– The idea of cyber profiling is derived from criminal
profiles, which provide information on the
investigation division to classify the types of criminals
who were at the crime scene.
• Call Record Detail Analysis
– A call detail record (cdr) is the information captured
by telecom companies during the call, sms, and
internet activity of a customer.
– This information provides greater insights about the
customer’s needs when used with customer
demographics.
Amity School of Engineering & Technology

Determining the Number of Clusters


• There are two main methods to find the
best value of K.
• Elbow Curve Method
• Silhouette Analysis
Amity School of Engineering & Technology

Elbow Curve Method


• The elbow method runs k-means clustering (kmeans
number of clusters) on the dataset for a range of values
of k (say 1 to 10).
• In the elbow method, we plot mean distance and look
for the elbow point where the rate of decrease shifts.
• For each k, calculate the total within-cluster sum of
squares (WSS). This elbow point can be used to
determine K.
• Perform K-means clustering with all these different
values of K. For each of the K values, we calculate
average distances to the centroid across all data points.
• Plot these points and find the point where the average
distance from the centroid falls suddenly (“Elbow”).
Amity School of Engineering & Technology
Amity School of Engineering & Technology

• At first, clusters will give a lot of information


(about variance), but at some point, the marginal
gain will drop, giving an angle in the graph. The
number of clusters is chosen at this point, hence
the “elbow criterion”. This “elbow” can’t always
be unambiguously identified.
• Inertia: Sum of squared distances of samples to
their closest cluster center.
• we always do not have clear clustered data. This
means that the elbow may not be clear and
sharp.
Amity School of Engineering & Technology

Silhouette Analysis
• The silhouette coefficient or silhouette score kmeans
is a measure of how similar a data point is within-
cluster (cohesion) compared to other clusters
(separation).
• The equation for calculating the silhouette
coefficient for a particular data point:

• S(i) is the silhouette coefficient of the data point i.


• a(i) is the average distance between i and all the
other data points in the cluster to which i belongs.
• b(i) is the average distance from i to all clusters to
which i does not belong.
Amity School of Engineering & Technology

• Module II: Clustering and Classification: (6 Hours)


– Analytical Theory and Methods: Overview of Clustering – K-
means – Use Cases – Overview of the Method – Determining
the Number of Clusters – Diagnostics – Reasons to Choose and
Cautions .- Classification: Decision Trees – Overview of a
Decision Tree – The General Algorithm – Decision Tree
Algorithms – Evaluating a Decision Tree
Amity School of Engineering & Technology

Further Study
• https://www.javatpoint.com/k-means-
clustering-algorithm-in-machine-learning
• https://www.analyticsvidhya.com/blog/202
1/05/k-mean-getting-the-optimal-number-
of-clusters/

You might also like