Unit-4 Unsupervised Algorithm

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Unit - 4

Unsupervised Learning
  Unsupervised learning is a branch of machine learning that deals with unlabeled
data.
 Unlike supervised learning, where the data is labeled with a specific category or
outcome, unsupervised learning algorithms are tasked with finding patterns and
relationships within the data without any prior knowledge of the data’s meaning.
This makes unsupervised learning a powerful tool for exploratory data analysis,
where the goal is to understand the underlying structure of the data.

Unsupervised Learning
 In artificial intelligence, machine learning that takes place in the absence of human
supervision is known as unsupervised machine learning.
 Unsupervised machine learning models, in contrast to supervised learning, are
given unlabeled data and allow discover patterns and insights on their own—
without explicit direction or instruction.
 Unsupervised machine learning analyzes and clusters unlabeled datasets using
machine learning algorithms.
 These algorithms find hidden patterns and data without any human intervention,
i.e., we don’t give output to our model.
 The training model has only input parameter values and discovers the groups or
patterns on its own.

Unsupervised Learning

How does unsupervised learning work?


Unsupervised learning works by analyzing unlabeled data to identify patterns and
relationships. The data is not labeled with any predefined categories or outcomes, so the
algorithm must find these patterns and relationships on its own. This can be a challenging
task, but it can also be very rewarding, as it can reveal insights into the data that would
not be apparent from a labeled dataset.

Data-set in Figure A is Mall data that contains information about its clients that subscribe
to them. Once subscribed they are provided a membership card and the mall has complete
information about the customer and his/her every purchase. Now using this data and
unsupervised learning techniques, the mall can easily group clients based on the
parameters we are feeding in.

The input to the unsupervised learning models is as follows:


 Unstructured data: May contain noisy(meaningless) data, missing values, or
unknown data
 Unlabeled data: Data only contains a value for input parameters, there is no targeted
value(output). It is easy to collect as compared to the labeled one in the Supervised
approach.

Unsupervised Learning Algorithms


There are mainly 3 types of Algorithms which are used for Unsupervised dataset.
 Clustering
 Association Rule Learning
 Dimensionality Reduction

Clustering
 Clustering in unsupervised machine learning is the process of grouping
unlabeled data into clusters based on their similarities.
 Broadly this technique is applied to group data based on different
patterns, such as similarities or differences, our machine model finds.

Some common clustering algorithms

 K-means Clustering: Partitioning Data into K Clusters


 Hierarchical Clustering: Building a Hierarchical Structure of Clusters
 Density-Based Clustering (DBSCAN): Identifying Clusters Based on
Density
 Mean-Shift Clustering: Finding Clusters Based on Mode Seeking
 Spectral Clustering: Utilizing Spectral Graph Theory for Clustering
K-Nearest Neighbor(KNN) Algorithm
The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning
method employed to tackle classification and regression problems. Evelyn Fix
and Joseph Hodges developed this algorithm in 1951, which was subsequently
expanded by Thomas Cover. The article explores the fundamentals, workings,
and implementation of the KNN algorithm.

KNN Algorithm working visualization

Why do we need a KNN algorithm?


 (K-NN) algorithm is a versatile and widely used machine learning
algorithm that is primarily used for its simplicity and ease of
implementation.
 It does not require any assumptions about the underlying data
distribution.
 It can also handle both numerical and categorical data, making it a
flexible choice for various types of datasets in classification and
regression tasks. It is a non-parametric method that makes predictions
based on the similarity of data points in a given dataset. K-NN is less
sensitive to outliers compared to other algorithms.
 The K-NN algorithm works by finding the K nearest neighbors to a given
data point based on a distance metric, such as Euclidean distance.
 The class or value of the data point is then determined by the majority
vote or average of the K neighbors. This approach allows the algorithm to
adapt to different patterns and make predictions based on the local
structure of the data.

Distance Metrics Used in KNN Algorithm


As we know that the KNN algorithm helps us identify the nearest points or the
groups for a query point. But to determine the closest groups or the nearest
points for a query point we need some metric. For this purpose, we use below
distance metrics:

1. Euclidean Distance
This is nothing but the cartesian distance between the two points which are in
the plane/hyperplane. Euclidean distance can also be visualized as the length
of the straight line that joins the two points which are into consideration. This
metric helps us calculate the net displacement done between the two states of
an object.

2.Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total
distance traveled by the object instead of the displacement. This metric is
calculated by summing the absolute difference between the coordinates of the
points in n-dimensions.

How to choose the value of k for KNN Algorithm?


The value of k is very crucial in the KNN algorithm to define the number of
neighbors in the algorithm. The value of k in the k-nearest neighbors (k-NN)
algorithm should be chosen based on the input data. If the input data has more
outliers or noise, a higher value of k would be better. It is recommended to
choose an odd value for k to avoid ties in classification. Cross-
validation methods can help in selecting the best k value for the given dataset.
Workings of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of
similarity, where it predicts the label or value of a new data point by considering
the labels or values of its K nearest neighbors in the training dataset.

Step-by-Step explanation of how KNN works is discussed below:


Step 1: Selecting the optimal value of K
 K represents the number of nearest neighbors that needs to be considered
while making prediction.
Step 2: Calculating distance
 To measure the similarity between target and training data points, Euclidean
distance is used. Distance is calculated between each of the data points in
the dataset and target point.
Step 3: Finding Nearest Neighbors
 The k data points with the smallest distances to the target point are the
nearest neighbors.
Step 4: Voting for Classification or Taking Average for Regression
 In the classification problem, the class labels of are determined by
performing majority voting. The class with the most occurrences among the
neighbors becomes the predicted class for the target data point.
 In the regression problem, the class label is calculated by taking average of
the target values of K nearest neighbors. The calculated average value
becomes the predicted output for the target data point.

Advantages of the KNN Algorithm


 Easy to implement as the complexity of the algorithm is not that high.
 Adapts Easily – As per the working of the KNN algorithm it stores all the
data in memory storage and hence whenever a new example or data point is
added then the algorithm adjusts itself as per that new example and has its
contribution to the future predictions as well.
 Few Hyperparameters – The only parameters which are required in the
training of a KNN algorithm are the value of k and the choice of the distance
metric which we would like to choose from our evaluation metric.

Disadvantages of the KNN Algorithm


 Does not scale – As we have heard about this that the KNN algorithm is
also considered a Lazy Algorithm. The main significance of this term is that
this takes lots of computing power as well as data storage. This makes this
algorithm both time-consuming and resource exhausting.
 Curse of Dimensionality – There is a term known as the peaking
phenomenon according to this the KNN algorithm is affected by the curse of
dimensionality which implies the algorithm faces a hard time classifying the
data points properly when the dimensionality is too high.
 Prone to Overfitting – As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well. Hence
generally feature selection as well as dimensionality reduction techniques
are applied to deal with this problem.

Image Segmentation By Clustering


Segmentation By clustering
It is a method to perform Image Segmentation of pixel-wise segmentation. In
this type of segmentation, we try to cluster the pixels that are together. There
are two approaches for performing the Segmentation by clustering.
 Clustering by Merging
 Clustering by Divisive

Clustering by merging or Agglomerative Clustering:


In this approach, we follow the bottom-up approach, which means we assign
the pixel closest to the cluster.
 Take each point as a separate cluster.
 For a given number of epochs or until clustering is satisfactory.
 Merge two clusters with the smallest inter-cluster distance (WCSS).
 Repeat the above step

K-Means Clustering
K-means clustering is a very popular clustering algorithm which applied when
we have a dataset with labels unknown. The goal is to find certain groups
based on some kind of similarity in the data with the number of groups
represented by K. This algorithm is generally used in areas like market
segmentation, customer segmentation, etc. But, it can also be used to segment
different objects in the images on the basis of the pixel values.
The algorithm for image segmentation works as follows:
1. First, we need to select the value of K in K-means clustering.
2. Select a feature vector for every pixel (color values such as RGB value,
texture etc.).
3. Define a similarity measure b/w feature vectors such as Euclidean distance
to measure the similarity b/w any two points/pixel.
4. Apply K-means algorithm to the cluster centers
5. Apply connected component’s algorithm.
6. Combine any component of size less than the threshold to an adjacent
component that is similar to it until you can’t combine more.

Using clustering for preprocessing


Clustering is an unsupervised Machine Learning-based Algorithm that comprises a
group of data points into clusters so that the objects belong to the same group.

Clustering helps to splits data into several subsets. Each of these subsets contains data
similar to each other, and these subsets are called clusters. Now that the data from our
customer base is divided into clusters, we can make an informed decision about who we
think is best suited for this product.
Let's understand this with an example, suppose we are a market manager, and we have
a new tempting product to sell. We are sure that the product would bring enormous
profit, as long as it is sold to the right people. So, how can we tell who is best suited for
the product from our company's huge customer base?

Clustering, falling under the category of unsupervised machine learning, is one of the
problems that machine learning algorithms solve.

ADVERTISEMENT

Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its
input data.

A good clustering algorithm aims to obtain clusters whose:

ADVERTISEMENT
ADVERTISEMENT

o The intra-cluster similarities are high, It implies that the data present inside the
cluster is similar to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not
similar to other data.

What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in the
cluster is less than the distance between any object in the cluster and any object
that is not located inside it.
o A connected region of a multidimensional space with a comparatively high
density of objects.

What is clustering in Data Mining?


o Clustering is the method of converting a group of abstract objects into classes of
similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of
significant subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set and
used either as a stand-alone instrument to get a better insight into data
distribution or as a pre-processing step for other algorithms

Important points:

o Data objects of a cluster can be considered as one group.


o We first partition the information set into groups while doing cluster analysis. It is
based on data similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications,
and it helps single out important characteristics that differentiate between
distinct groups.

Applications of cluster analysis in data mining:

o In many applications, clustering analysis is widely used, such as data analysis,


market research, pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the
purchasing patterns. They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card
fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies,
categorization of genes with the same functionalities and gain insight into
structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth
observation database and the identification of house groups in a city according
to house type, value, and geographical location.

Why is clustering used in data mining?


Clustering analysis has been an evolving problem in data mining due to its variety of
applications. The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must
contribute to the popularity of these algorithms. The main issue with the data clustering
algorithms is that it cant be standardized. The advanced algorithm may give the best
results with one type of data set, but it may fail or perform poorly with other kinds of
data set. Although many efforts have been made to standardize the algorithms that can
perform well in all situations, no significant achievement has been achieved so far. Many
clustering tools have been proposed so far. However, each algorithm has its advantages
or disadvantages and cant work on all real situations.

1. Scalability:

Scalability in clustering implies that as we boost the amount of data objects, the time to
perform clustering should approximately scale to the complexity order of the algorithm.
For example, if we perform K- means clustering, we know it is O(n), where n is the
number of objects in the data. If we raise the number of data objects 10 folds, then the
time taken to cluster them should also approximately increase 10 times. It means there
should be a linear relationship. If that is not the case, then there is some error with our
implementation process.
Data should be scalable if it is not scalable, then we can't get the appropriate result. The
figure illustrates the graphical example where it may lead to the wrong result.

2. Interpretability:

The outcomes of clustering should be interpretable, comprehensible, and usable.

3. Discovery of clusters with attribute shape:

The clustering algorithm should be able to find arbitrary shape clusters. They should not
be limited to only distance measurements that tend to discover a spherical cluster of
small sizes.

4. Ability to deal with different types of attributes:

Algorithms should be capable of being applied to any data such as data based on
intervals (numeric), binary data, and categorical data.

5. Ability to deal with noisy data:

Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive
to such data and may result in poor quality clusters.

ADVERTISEMENT

6. High dimensionality:

The clustering tools should not only able to handle high dimensional data space but
also the low-dimensional space
What is Semi-Supervised Cluster Analysis?

Semi-supervised clustering is a method that partitions unlabeled data by


creating the use of domain knowledge. It is generally expressed as pairwise
constraints between instances or just as an additional set of labeled instances.

The quality of unsupervised clustering can be essentially improved using some


weak structure of supervision, for instance, in the form of pairwise constraints
(i.e., pairs of objects labeled as belonging to similar or different clusters). Such
a clustering procedure that depends on user feedback or guidance constraints is
known as semisupervised clustering.

There are several methods for semi-supervised clustering that can be divided
into two classes which are as follows −

Constraint-based semi-supervised clustering − It can be used based on user-provided


labels or constraints to support the algorithm toward a more appropriate data
partitioning. This contains modifying the objective function depending on
constraints or initializing and constraining the clustering process depending on
the labeled objects.

Distance-based semi-supervised clustering − It can be used to employ an adaptive


distance measure that is trained to satisfy the labels or constraints in the
supervised data. Multiple adaptive distance measures have been utilized,
including string-edit distance trained using Expectation-Maximization (EM), and
Euclidean distance changed by the shortest distance algorithm.

DBSCAN Clustering algorithm


Clustering analysis or simply Clustering is basically an Unsupervised learning method
that divides the data points into a number of specific batches or groups, such that the data
points in the same groups have similar properties and data points in different groups have
different properties in some sense. It comprises many different methods based on
differential evolution.
E.g. K-Means (distance between points), Affinity propagation (graph distance), Mean-
shift (distance between points), DBSCAN (distance between nearest points), Gaussian
mixtures (Mahalanobis distance to centers), Spectral clustering (graph distance), etc.
Fundamentally, all clustering methods use the same approach i.e. first we calculate
similarities and then we use it to cluster the data points into groups or batches. Here we
will focus on the Density-based spatial clustering of applications with
noise (DBSCAN) clustering method.

Density-Based Spatial Clustering Of Applications With


Noise (DBSCAN)
Clusters are dense regions in the data space, separated by regions of the lower density of
points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and
“noise”. The key idea is that for each point of a cluster, the neighborhood of a given
radius has to contain at least a minimum number of points.

Why DBSCAN?
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for
finding spherical-shaped clusters or convex clusters. In other words, they are suitable
only for compact and well-separated clusters. Moreover, they are also severely affected
by the presence of noise and outliers in the data.

Real-life data may contain irregularities, like:


1. Clusters can be of arbitrary shape such as those shown in the figure below.
2. Data may contain noise.
The figure above shows a data set containing non-convex shape clusters and outliers.
Given such data, the k-means algorithm has difficulties in identifying these clusters with
arbitrary shapes.

Parameters Required For DBSCAN Algorithm

1. eps: It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered neighbors. If the eps value is
chosen too small then a large part of the data will be considered as an outlier. If it is
chosen very large then the clusters will merge and the majority of the data points will
be in the same clusters. One way to find the eps value is based on the k-distance
graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. The larger
the dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the dataset as,
MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.

Steps Used In DBSCAN Algorithm

1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density-connected points and assign them to the same cluster
as the core point.
A point a and b are said to be density connected if there exists a point c which has a
sufficient number of points in its neighbors and both points a and b are within the eps
distance. This is a chaining process. So, if b is a neighbor of c, c is a neighbor of d,
and d is a neighbor of e, which in turn is neighbor of a implying that b is a neighbor
of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.
Cluster of dataset

Evaluation Metrics For DBSCAN Algorithm In Machine Learning

We will use the Silhouette score and Adjusted rand score for evaluating clustering
algorithms. Silhouette’s score is in the range of -1 to 1. A score near 1 denotes the best
meaning that the data point i is very compact within the cluster to which it belongs and
far away from the other clusters. The worst value is -1. Values near 0 denote overlapping
clusters.
Absolute Rand Score is in the range of 0 to 1. More than 0.9 denotes excellent cluster
recovery, and above 0.8 is a good recovery. Less than 0.5 is considered to be poor
recovery.

Black points represent outliers. By changing the eps and the MinPts, we can change the
cluster configuration.
Now the question that should be raised is –
When Should We Use DBSCAN Over K-Means In Clustering
Analysis?
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) and K-Means are
both clustering algorithms that group together data that have the same characteristic.
However, They work on different principles and are suitable for different types of data.
We prefer to use DBSCAN when the data is not spherical in shape or the number of
classes is not known beforehand.
Difference Between DBSCAN and K-Means.

K-Means
DBSCAN

K-Means is very sensitive to the number


In DBSCAN we need not specify the number
of clusters so it
of clusters.
need to specified

Clusters formed in K-Means are spherical


Clusters formed in DBSCAN can be of any or
arbitrary shape.
convex in shape

K-Means does not work well with outliers


data. Outliers
DBSCAN can work well with datasets having
noise and outliers can skew the clusters in K-Means to a
very large extent.

In K-Means only one parameter is


In DBSCAN two parameters are required for required is for training
training the Model
the model
Clusters formed in K-means and DBSCAN

Outlier influence on DBSCAN

You might also like