Data Analytics CSE704 Module-2

Amity School of Engineering & Technology
Module-1
Clustering and Classification
Data Analytics CSE704
By: Dr. Ghanshyam Prasad Dubey

Syllabus
• Module II: Clustering and Classification: (6 Hours)
– Analytical Theory and Methods: Overview of Clustering – K-
means – Use Cases – Overview of the Method – Determining
the Number of Clusters – Diagnostics – Reasons to Choose and
Cautions .- Classification: Decision Trees – Overview of a
Decision Tree – The General Algorithm – Decision Tree
Algorithms – Evaluating a Decision Tree
Introduction to Clustering
• Clustering methods are one of the most useful
unsupervised ML methods.
• These methods are used to find similarity as well as the
relationship patterns among data samples and then
cluster those samples into groups having similarity
based on features.
• The aim of the clustering process is to segregate groups
with similar traits and assign them into clusters.
Cluster Formation Methods

• Density-based
• Hierarchical-based
• Partitioning
• Grid
Density-based
• In these methods, the clusters are formed
as the dense region.
• The advantage of these methods is that
they have good accuracy as well as good
ability to merge two clusters.
• Ex. Density-Based Spatial Clustering of
Applications with Noise (DBSCAN),
Ordering Points to identify Clustering
structure (OPTICS) etc.
Hierarchical-based
as a tree type structure based on the
hierarchy.
• They have two categories namely,
Agglomerative (Bottom up approach) and
Divisive (Top down approach).
• Ex. Clustering using Representatives
(CURE), Balanced iterative Reducing
Clustering using Hierarchies (BIRCH) etc.
Partitioning
by portioning the objects into k clusters.
• Number of clusters will be equal to the
number of partitions.
• Ex. K-means, Clustering Large
Applications based upon randomized
Search (CLARANS).
Grid
as a grid like structure.
• The advantage of these methods is that all
the clustering operation done on these
grids are fast and independent of the
number of data objects.
• Ex. Statistical Information Grid (STING),
Clustering in Quest (CLIQUE).
Types of ML Clustering Algorithms
• K-means Clustering
• Mean-Shift Algorithm
• Hierarchical Clustering
K-means Clustering
• This clustering algorithm computes the
centroids and iterates until it finds optimal
centroid.
• It assumes that the number of clusters are
already known.
• It is also called flat clustering algorithm.
• The number of clusters identified from
data by algorithm is represented by ‘K’ in
K-means.
Mean-Shift Algorithm
• It is another powerful clustering algorithm
used in unsupervised learning.
• Unlike K-means clustering, it does not
make any assumptions hence it is a non-
parametric algorithm.
Hierarchical Clustering
• It is another unsupervised learning
algorithm that is used to group together
the unlabeled data points having similar
characteristics.
Applications of Clustering
• Data summarization and compression −
– Clustering is widely used in the areas where
we require data summarization, compression
and reduction as well. The examples are
image processing and vector quantization.
• Collaborative systems and customer
segmentation −
– Since clustering can be used to find similar
products or same kind of users, it can be used
in the area of collaborative systems and
customer segmentation.
• Serve as a key intermediate step for other data mining

tasks −
– Cluster analysis can generate a compact summary of data for
classification, testing, hypothesis generation; hence, it serves
as a key intermediate step for other data mining tasks also.
• Trend detection in dynamic data −
– Clustering can also be used for trend detection in dynamic
data by making various clusters of similar trends.
• Social network analysis −
– Clustering can be used in social network analysis. The
examples are generating sequences in images, videos or
audios.
• Biological data analysis −
– Clustering can also be used to make clusters of images,
videos hence it can successfully be used in biological data
analysis.
K Means Clustering
• K-Means Clustering is an unsupervised learning
algorithm that is used to solve the clustering problems in
machine learning or data science.
• K-Means Clustering is an Unsupervised Learning
algorithm, which groups the unlabeled dataset into
different clusters.
• Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters,
and so on.
• It is a centroid-based algorithm, where each cluster is
associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between
the data point and their corresponding clusters.
• The algorithm takes the unlabeled dataset as

input, divides the dataset into k-number of
clusters, and repeats the process until it does
not find the best clusters.
• The value of k should be predetermined in this
algorithm.
• The k-means clustering algorithm mainly
performs two tasks:
– Determines the best value for K center points
or centroids by an iterative process.
– Assigns each data point to its closest k-
center. Those data points which are near to
the particular k-center, create a cluster.
K-Means Algorithm
• Step-1: Select the number K to decide the number of
clusters.
• Step-2: Select random K points or centroids. (It can be
other from the input dataset).
• Step-3: Assign each data point to their closest centroid,
which will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of
each cluster.
• Step-5: Repeat the third steps, which means reassign each
data point to the new closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else
go to FINISH.
• Step-7: The model is ready.
• Suppose we have two variables M1 and

M2. The x-y axis scatter plot of these two
variables is given below:
• Let's take number k of clusters, i.e., K=2, to

identify the dataset and to put them into different
clusters. It means here we will try to group these
datasets into two different clusters.
• Now we will assign each data point of the scatter plot to

its closest K-point or centroid. We will compute it by
applying some mathematics that we have studied to
calculate the distance between two points. So, we will
draw a median between both the centroids. Consider the
below image:
• From the above image, it is clear that points left side of

the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid. Let's
color them as blue and yellow for clear visualization.
• As we need to find the closest cluster, so we will repeat

the process by choosing a new centroid. To choose the
new centroids, we will compute the center of gravity of
these centroids, and will find new centroids as below:
• Next, we will reassign each datapoint to the new

centroid. For this, we will repeat the same process of
finding a median line. The median will be like below
image:
• From the above image, we can see, one

yellow point is on the left side of the line,
and two blue points are right to the line.
So, these three points will be assigned to
new centroids.
• As reassignment has taken place, so we will again

go to the step-4, which is finding new centroids or K-
points.
• We will repeat the process by finding the center of
gravity of centroids, so the new centroids will be as
shown in the below image:
• As we got the new centroids so again will

draw the median line and reassign the
data points. So, the image will be:
• We can see in the above image; there are no

dissimilar data points on either side of the line,
which means our model is formed. Consider the
below image:
• As our model is ready, so we can now

remove the assumed centroids, and the
two final clusters will be as shown in the
below image:
• Suppose we have two variables M1 and

M2. The x-y axis scatter plot of these two
variables is given below:
K-Mean Use case

• Document Classification
– Cluster documents in multiple categories
based on tags, topics, and the content of the
document.
– The initial processing of the documents is
needed to represent each document as a
vector and uses term frequency to identify
commonly used terms that help classify the
document.
– The document vectors are then clustered to
help identify similarities in document groups.
• Delivery Store Optimization

– Optimize the process of good delivery using
truck drones by using a combination of k-
means to find the optimal number of launch
locations and a genetic algorithm to solve the
truck route as a traveling salesman problem.
• Identifying Crime Localities
– With data related to crimes available in
specific localities in a city, the category of
crime, the area of the crime, and the
association between the two can give quality
insight into crime-prone areas within a city or
a locality.
• Customer Segmentation
– Clustering helps marketers improve their customer
base, work on target areas, and segment customers
based on purchase history, interests, or activity
monitoring.
– The classification would help the company target
specific clusters of customers for specific campaigns.
• Fantasy League Stat Analysis
– Analyzing player stats has always been a critical
element of the sporting world, and with increasing
competition, machine learning has a critical role to
play here.
– if you would like to create a fantasy draft team and like
to identify similar players based on player stats, k-
means can be a useful option.
• Insurance Fraud Detection
– Machine learning has a critical role to play in fraud
detection and has numerous applications in automobile,
healthcare, and insurance fraud detection.
– Utilizing past historical data on fraudulent claims, it is
possible to isolate new claims based on its proximity to
clusters that indicate fraudulent patterns. since insurance
fraud can potentially have a multi-million dollar impact on
a company, the ability to detect frauds is crucial.
• Rideshare Data Analysis
– The publicly available uber ride information dataset
provides a large amount of valuable data around traffic,
transit time, peak pickup localities, and more. analyzing
this data is useful not just in the context of uber but also in
providing insight into urban traffic patterns and helping us
plan for the cities of the future.
• Cyber-Profiling Criminals
– Cyber profiling is the process of collecting data from
individuals and groups to identify significant co-
relations.
– The idea of cyber profiling is derived from criminal
profiles, which provide information on the
investigation division to classify the types of criminals
who were at the crime scene.
• Call Record Detail Analysis
– A call detail record (cdr) is the information captured
by telecom companies during the call, sms, and
internet activity of a customer.
– This information provides greater insights about the
customer’s needs when used with customer
demographics.
Determining the Number of Clusters

• There are two main methods to find the
best value of K.
• Elbow Curve Method
• Silhouette Analysis
Elbow Curve Method

• The elbow method runs k-means clustering (kmeans
number of clusters) on the dataset for a range of values
of k (say 1 to 10).
• In the elbow method, we plot mean distance and look
for the elbow point where the rate of decrease shifts.
• For each k, calculate the total within-cluster sum of
squares (WSS). This elbow point can be used to
determine K.
• Perform K-means clustering with all these different
values of K. For each of the K values, we calculate
average distances to the centroid across all data points.
• Plot these points and find the point where the average
distance from the centroid falls suddenly (“Elbow”).
• At first, clusters will give a lot of information

(about variance), but at some point, the marginal
gain will drop, giving an angle in the graph. The
number of clusters is chosen at this point, hence
the “elbow criterion”. This “elbow” can’t always
be unambiguously identified.
• Inertia: Sum of squared distances of samples to
their closest cluster center.
• we always do not have clear clustered data. This
means that the elbow may not be clear and
sharp.
Silhouette Analysis
• The silhouette coefficient or silhouette score kmeans
is a measure of how similar a data point is within-
cluster (cohesion) compared to other clusters
(separation).
• The equation for calculating the silhouette
coefficient for a particular data point:
• S(i) is the silhouette coefficient of the data point i.

• a(i) is the average distance between i and all the
other data points in the cluster to which i belongs.
• b(i) is the average distance from i to all clusters to
which i does not belong.
• Module II: Clustering and Classification: (6 Hours)

– Analytical Theory and Methods: Overview of Clustering – K-
means – Use Cases – Overview of the Method – Determining
the Number of Clusters – Diagnostics – Reasons to Choose and
Cautions .- Classification: Decision Trees – Overview of a
Decision Tree – The General Algorithm – Decision Tree
Algorithms – Evaluating a Decision Tree
Further Study
• https://www.javatpoint.com/k-means-
clustering-algorithm-in-machine-learning
• https://www.analyticsvidhya.com/blog/202
1/05/k-mean-getting-the-optimal-number-
of-clusters/

Data Analytics CSE704 Module-2

Uploaded by

Copyright:

Available Formats

You might also like

Data Analytics CSE704 Module-2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics CSE704 Module-2

Uploaded by

Copyright:

Available Formats

Amity School of Engineering & Technology

By: Dr. Ghanshyam Prasad Dubey

Cluster Formation Methods

Types of ML Clustering Algorithms

• Serve as a key intermediate step for other data mining

• The algorithm takes the unlabeled dataset as

• Suppose we have two variables M1 and

• Let's take number k of clusters, i.e., K=2, to

• Now we will assign each data point of the scatter plot to

• From the above image, it is clear that points left side of

• As we need to find the closest cluster, so we will repeat

• Next, we will reassign each datapoint to the new

• From the above image, we can see, one

• As reassignment has taken place, so we will again

• As we got the new centroids so again will

• We can see in the above image; there are no

• As our model is ready, so we can now

• Suppose we have two variables M1 and

K-Mean Use case

• Delivery Store Optimization

Determining the Number of Clusters

Elbow Curve Method

• At first, clusters will give a lot of information

• S(i) is the silhouette coefficient of the data point i.

• Module II: Clustering and Classification: (6 Hours)

You might also like