Professional Documents
Culture Documents
Data Analytics CSE704 Module-2
Data Analytics CSE704 Module-2
Data Analytics CSE704 Module-2
Module-1
Clustering and Classification
Data Analytics CSE704
Syllabus
• Module II: Clustering and Classification: (6 Hours)
– Analytical Theory and Methods: Overview of Clustering – K-
means – Use Cases – Overview of the Method – Determining
the Number of Clusters – Diagnostics – Reasons to Choose and
Cautions .- Classification: Decision Trees – Overview of a
Decision Tree – The General Algorithm – Decision Tree
Algorithms – Evaluating a Decision Tree
Amity School of Engineering & Technology
Introduction to Clustering
• Clustering methods are one of the most useful
unsupervised ML methods.
• These methods are used to find similarity as well as the
relationship patterns among data samples and then
cluster those samples into groups having similarity
based on features.
• The aim of the clustering process is to segregate groups
with similar traits and assign them into clusters.
Amity School of Engineering & Technology
Density-based
• In these methods, the clusters are formed
as the dense region.
• The advantage of these methods is that
they have good accuracy as well as good
ability to merge two clusters.
• Ex. Density-Based Spatial Clustering of
Applications with Noise (DBSCAN),
Ordering Points to identify Clustering
structure (OPTICS) etc.
Amity School of Engineering & Technology
Hierarchical-based
• In these methods, the clusters are formed
as a tree type structure based on the
hierarchy.
• They have two categories namely,
Agglomerative (Bottom up approach) and
Divisive (Top down approach).
• Ex. Clustering using Representatives
(CURE), Balanced iterative Reducing
Clustering using Hierarchies (BIRCH) etc.
Amity School of Engineering & Technology
Partitioning
• In these methods, the clusters are formed
by portioning the objects into k clusters.
• Number of clusters will be equal to the
number of partitions.
• Ex. K-means, Clustering Large
Applications based upon randomized
Search (CLARANS).
Amity School of Engineering & Technology
Grid
• In these methods, the clusters are formed
as a grid like structure.
• The advantage of these methods is that all
the clustering operation done on these
grids are fast and independent of the
number of data objects.
• Ex. Statistical Information Grid (STING),
Clustering in Quest (CLIQUE).
Amity School of Engineering & Technology
• K-means Clustering
• Mean-Shift Algorithm
• Hierarchical Clustering
Amity School of Engineering & Technology
K-means Clustering
• This clustering algorithm computes the
centroids and iterates until it finds optimal
centroid.
• It assumes that the number of clusters are
already known.
• It is also called flat clustering algorithm.
• The number of clusters identified from
data by algorithm is represented by ‘K’ in
K-means.
Amity School of Engineering & Technology
Mean-Shift Algorithm
• It is another powerful clustering algorithm
used in unsupervised learning.
• Unlike K-means clustering, it does not
make any assumptions hence it is a non-
parametric algorithm.
Amity School of Engineering & Technology
Hierarchical Clustering
• It is another unsupervised learning
algorithm that is used to group together
the unlabeled data points having similar
characteristics.
Amity School of Engineering & Technology
Applications of Clustering
• Data summarization and compression −
– Clustering is widely used in the areas where
we require data summarization, compression
and reduction as well. The examples are
image processing and vector quantization.
• Collaborative systems and customer
segmentation −
– Since clustering can be used to find similar
products or same kind of users, it can be used
in the area of collaborative systems and
customer segmentation.
Amity School of Engineering & Technology
K Means Clustering
• K-Means Clustering is an unsupervised learning
algorithm that is used to solve the clustering problems in
machine learning or data science.
• K-Means Clustering is an Unsupervised Learning
algorithm, which groups the unlabeled dataset into
different clusters.
• Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters,
and so on.
• It is a centroid-based algorithm, where each cluster is
associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between
the data point and their corresponding clusters.
Amity School of Engineering & Technology
K-Means Algorithm
• Step-1: Select the number K to decide the number of
clusters.
• Step-2: Select random K points or centroids. (It can be
other from the input dataset).
• Step-3: Assign each data point to their closest centroid,
which will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of
each cluster.
• Step-5: Repeat the third steps, which means reassign each
data point to the new closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else
go to FINISH.
• Step-7: The model is ready.
Amity School of Engineering & Technology
• Customer Segmentation
– Clustering helps marketers improve their customer
base, work on target areas, and segment customers
based on purchase history, interests, or activity
monitoring.
– The classification would help the company target
specific clusters of customers for specific campaigns.
• Fantasy League Stat Analysis
– Analyzing player stats has always been a critical
element of the sporting world, and with increasing
competition, machine learning has a critical role to
play here.
– if you would like to create a fantasy draft team and like
to identify similar players based on player stats, k-
means can be a useful option.
Amity School of Engineering & Technology
• Insurance Fraud Detection
– Machine learning has a critical role to play in fraud
detection and has numerous applications in automobile,
healthcare, and insurance fraud detection.
– Utilizing past historical data on fraudulent claims, it is
possible to isolate new claims based on its proximity to
clusters that indicate fraudulent patterns. since insurance
fraud can potentially have a multi-million dollar impact on
a company, the ability to detect frauds is crucial.
• Rideshare Data Analysis
– The publicly available uber ride information dataset
provides a large amount of valuable data around traffic,
transit time, peak pickup localities, and more. analyzing
this data is useful not just in the context of uber but also in
providing insight into urban traffic patterns and helping us
plan for the cities of the future.
Amity School of Engineering & Technology
• Cyber-Profiling Criminals
– Cyber profiling is the process of collecting data from
individuals and groups to identify significant co-
relations.
– The idea of cyber profiling is derived from criminal
profiles, which provide information on the
investigation division to classify the types of criminals
who were at the crime scene.
• Call Record Detail Analysis
– A call detail record (cdr) is the information captured
by telecom companies during the call, sms, and
internet activity of a customer.
– This information provides greater insights about the
customer’s needs when used with customer
demographics.
Amity School of Engineering & Technology
Silhouette Analysis
• The silhouette coefficient or silhouette score kmeans
is a measure of how similar a data point is within-
cluster (cohesion) compared to other clusters
(separation).
• The equation for calculating the silhouette
coefficient for a particular data point:
Further Study
• https://www.javatpoint.com/k-means-
clustering-algorithm-in-machine-learning
• https://www.analyticsvidhya.com/blog/202
1/05/k-mean-getting-the-optimal-number-
of-clusters/