Cluster Analysis

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 58

1

Linear Regression is used to predict for numerical value.

Logistics Regression is used for prediction of non numerical value


(categorical value)

Classification is the process of classifying the data with the help of class
labels
Wednesday, December 09, 2020 3
Wednesday, December 09, 2020 4
Wednesday, December 09, 2020 5
Wednesday, December 09, 2020 6
The Wine Quality

The Wine Quality Dataset involves predicting the quality of white wines Good/Bad
for given chemical measures of each wine.
1.Fixed acidity.
2.Volatile acidity.
3.Citric acid.
4.Residual sugar.
5.Chlorides.
6.Free sulfur dioxide.
7.Total sulfur dioxide.
8.Density.
9.pH.
10.Sulphates.
11.Alcohol.
12.Quality (Good/Bad).
Wednesday, December 09, 2020 7
Example
The below table has information about 20 wines sold in the market along with their alcohol
and alkalinity of ash content

Alkalinity of
Wine Alcohol Wine Alcohol Alkalinity of Ash
Ash

1 14.8 28 11 10.7 12.2


2 11.05 12 12 14.3 27
3 12.2 21 13 12.4 19.5
4 12 20 14 14.85 29.2
5 14.5 29.5 15 10.9 13.6
6 11.2 13 16 13.9 29.7
7 11.5 12 17 10.4 12.2
8 12.8 19 18 10.8 13.6
9 14.75 28.8 19 14 28.8
10 10.5 14 20 12.47 22.8

Wednesday, December 09, 2020 8


Wednesday, December 09, 2020 9
Classification and Clustering
Classification and Clustering are the two types of learning methods which characterize objects into groups
by one or more features. These processes appear to be similar, but there is a difference between them

Wednesday, December 09, 2020 10


Classification and Clustering

Wednesday, December 09, 2020 11


Classification and Clustering

Wednesday, December 09, 2020 12


Regression and Clustering

Regression:
• It predicts continuous values and their output. Regression analysis is the
statistical model that is used to predict the numeric data instead of
labels.

• It can also identify the distribution trends based on the available data or
historic data.

• Predicting a person's income based on various attributes such as age and


experience is an example of creating a regression model.

Wednesday, December 09, 2020 13


Regression and Clustering
Clustering:
• Clustering is quite literally the clustering or grouping up of data
according to the similarity of data points and data patterns.

• The aim of this is to separate similar categories of data and differentiate


them into localized regions.

• This way, when a new data point arrives, we can easily identify which
group or cluster it belongs to.

• This is done for unstructured datasets where it is up to the machine to


figure out the categories.
Wednesday, December 09, 2020 14
INTRODUCTION TO CLUSTERING

• Clustering is usually one of the first tasks performed in most


analytics projects.

• It helps data scientists to analyze individual clusters further

• A cluster refers to a collection of data points aggregated


together because of certain similarities.

Wednesday, December 09, 2020 15


• Clustering is the classification of objects into different groups,
or more precisely, the partitioning of a data set into subsets
(clusters), so that the data in each subset (ideally) share some
common trait - often according to some defined distance
measure.

Wednesday, December 09, 2020 16


When you were still young and naïve…
 Classification

Wednesday, December 09, 2020 17


You may classify them by
 Shape

Wednesday, December 09, 2020 18


You may classify them by
 Color

Wednesday, December 09, 2020 19


You may classify them by
 Sum of internal
angles

Wednesday, December 09, 2020 20


Classification
 Shape
 Color
 Sum of internal
angles

 Similarity of characteristics

Wednesday, December 09, 2020 21


In clustering, we do not have a target to predict. We look at
the data and then try to club similar observations and form
different groups.

Wednesday, December 09, 2020 22


What is Cluster Analysis?

• Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups

Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

Wednesday, December 09, 2020 23


• Clustering: the act of grouping “similar” object into sets.

• A clustering problem can be viewed as unsupervised classification.

• Clustering is appropriate when there is no a priori knowledge about


the data.

• Grouping is based on the distance (proximity)

• You don’t know who or what belongs to which group. Not even the
number of groups.
Wednesday, December 09, 2020 24
Applications of Clustering Applications
Example:
 A bank wants to give credit card offers to its customers. Currently, they look
at the details of each customer and based on this information, decide which
offer should be given to which customer.
 Now, the bank can potentially have millions of customers. Does it make
sense to look at the details of each customer separately and then make a
decision? Certainly not! It is a manual process and will take a huge amount
of time.
 So what can the bank do? One option is to segment its customers into
different groups. For instance, the bank can group the customers based on
their income:

Wednesday, December 09, 2020 25


Applications of Clustering Applications

• cluster analysis in
market segmentation:

• Help marketers
discover distinct
groups in their
customer bases, and
then use this
knowledge to develop
targeted marketing
programs

Wednesday, December 09, 2020 26


• Land use: Identification of • Insurance: Identifying groups of
areas of similar land use in insurance policy holders with a high
an earth observation average claim cost
database

Wednesday, December 09, 2020 27


• City-planning: Identifying groups
of houses according to their
house type, value, and
geographical location

• The field of psychiatry: The


characterization of patients on
the basis of clusters of symptoms
can be useful in the identification
of an appropriate form of
therapy.
Wednesday, December 09, 2020 28
• Earth-quake studies: Observed
earth quake epicenters should be
clustered along continent faults

Wednesday, December 09, 2020 29


Climate - Understanding the
Earth’s climate requires
finding patterns in the
atmosphere and ocean. To
that end, cluster analysis has
been applied to find patterns
in the atmospheric pressure
of Polar Regions and areas of
the ocean that have a
significant impact on land
climate

Wednesday, December 09, 2020 30


Measuring similarity

 Similarity
 The degree of correspondence among objects across
all of the characteristics.

 Correlational measures
 Distance measures

Wednesday, December 09, 2020 31


Similarity measure
 Correlation measure
 Grouping cases base on respondent pattern
 Distance measure
 Grouping cases base on distance

Wednesday, December 09, 2020 32


Non-overlapping clusters
Cluster in which each observation belongs to only one cluster. Non-
overlapping clusters are more frequently used clustering techniques in
practice.

Wednesday, December 09, 2020 33


Overlapping clusters
• An observation may belong to more than one cluster

Wednesday, December 09, 2020 34


Example
The below table has information about 20 wines sold in the market along with their alcohol
and alkalinity of ash content

Alkalinity of
Wine Alcohol Wine Alcohol Alkalinity of Ash
Ash

1 14.8 28 11 10.7 12.2


2 11.05 12 12 14.3 27
3 12.2 21 13 12.4 19.5
4 12 20 14 14.85 29.2
5 14.5 29.5 15 10.9 13.6
6 11.2 13 16 13.9 29.7
7 11.5 12 17 10.4 12.2
8 12.8 19 18 10.8 13.6
9 14.75 28.8 19 14 28.8
10 10.5 14 20 12.47 22.8

Wednesday, December 09, 2020 35


Clusters of wine based on alcohol and ash content.

Wednesday, December 09, 2020 36


Types of Clustering

• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of clusters

• Partitional Clustering[k means clustering]


• A division data objects into non-overlapping subsets (clusters) such that
each data object is in exactly one subset

• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree

Wednesday, December 09, 2020 37


K-Means Clustering

• K-means clustering is one of the simplest and popular unsupervised machine


learning algorithms.

• Data set into K distinct, non-overlapping clusters.

• There is an algorithm that tries to minimize the distance of the points in a


cluster with their centroid – the k-means clustering technique..

• The main idea is to define k centers, one for each cluster.

Wednesday, December 09, 2020 38


How to choose the value of K?
• One of the most challenging tasks in this clustering algorithm is to choose
the right values of k.
• What should be the right k-value?
• How to choose the k-value?
• If you are choosing the k values randomly, it might be correct or may be
wrong.
• If you will choose the wrong value then it will directly affect your model
performance.
• So there are two methods by which you can select the right value of k.

• Elbow Method.
• Silhouette Method.
Wednesday, December 09, 2020 39
• You’ll define a target number k, which refers to the number of
centroids you need in the dataset. A centroid is the imaginary or real
location representing the center of the cluster.

• Every data point is allocated to each of the clusters through reducing


the in-cluster sum of squares.

• In other words, the K-means algorithm identifies k number of


centroids, and then allocates every data point to the nearest cluster,
while keeping the centroids as small as possible.

Wednesday, December 09, 2020 40


• Figure shows the results obtained from performing K-means clustering on a
simulated example consisting of 150 observations in two dimensions, using three
different values of K.

Note that there is no ordering of the clusters, so the cluster coloring is arbitrary.
Wednesday, December 09, 2020 41
Advantages of K-means
• It is very simple to implement.
• It is scalable to a huge data set and also faster to large datasets.
• it adapts the new examples very frequently.
• Generalization of clusters for different shapes and sizes.
 
Disadvantages of K-means
• It is sensitive to the outliers.
• Choosing the k values manually is a tough job.
• As the number of dimensions increases its scalability decreases.

Wednesday, December 09, 2020 42


Wednesday, December 09, 2020 43
CLASS ROOM PROBLEMS

Wednesday, December 09, 2020 44


Wednesday, December 09, 2020 45
Wednesday, December 09, 2020 46
Wednesday, December 09, 2020 47
Wednesday, December 09, 2020 48
CLASS ROOM PROBLEMS

Cluster the following eight points (with (x, y) representing locations) into two
clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

Wednesday, December 09, 2020 49


CLASS ROOM PROBLEMS

Cluster the following points (with (x, y) representing locations) into two clusters:

Wednesday, December 09, 2020 50


CLASS ROOM PROBLEMS

Cluster the following points (with (x, y) representing locations) into two clusters:

Wednesday, December 09, 2020 51


CLASS ROOM PROBLEMS

Cluster the following points (with (x, y) representing locations) into two clusters:

Wednesday, December 09, 2020 52


Hierarchical clustering

Hierarchical clustering will help to determine the optimal number of clusters.

1.It starts by putting every point in its own cluster, so each cluster is a singleton

2.It then merges the 2 points that are closest to each other based on the distances from
the distance matrix. The consequence is that there is one less cluster

3.It then recalculates the distances between the new and old clusters and save them in
a new distance matrix which will be used in the next step

4.Finally, steps 1 and 2 are repeated until all clusters are merged into one single cluster
including all points.

Wednesday, December 09, 2020 53


There exists 5 main methods to measure the distance between clusters, referred as
linkage methods:
1.Single linkage: computes the minimum distance between clusters before merging
them.

2.Complete linkage: computes the maximum distance between clusters before merging
them.

3.Average linkage: computes the average distance between clusters before merging
them.

4.Centroid linkage: calculates centroids for both clusters, then computes the distance
between the two before merging them.

5.Ward’s (minimum variance) criterion: minimizes the total within-cluster variance and
find the pair of clusters that leads to minimum increase in total within-cluster variance
after merging.
Wednesday, December 09, 2020 54
Dendogram: -
• A dendrogram is a diagram that shows the hierarchical relationship between objects.

• It is most commonly created as an output from hierarchical clustering. 

• The main use of a dendrogram is to work out the best way to allocate objects to
clusters.

• The dendrogram below shows the hierarchical clustering of six observations shown on


the scatterplot to the left.

Wednesday, December 09, 2020 55


• The key to interpreting a dendrogram is to focus on the height at which any two objects
are joined together.

• In the example above, we can see that E and F are most similar, as the height of the link
that joins them together is the smallest. The next two most similar objects are A and B.

• In the dendrogram above, the height of the dendrogram indicates the order in which the
clusters were joined.

Wednesday, December 09, 2020 56


Wednesday, December 09, 2020 57
Wednesday, December 09, 2020 58

You might also like