Cluster Analysis

1
Linear Regression is used to predict for numerical value.
Logistics Regression is used for prediction of non numerical value

(categorical value)
Classification is the process of classifying the data with the help of class
labels
Wednesday, December 09, 2020 3
The Wine Quality
The Wine Quality Dataset involves predicting the quality of white wines Good/Bad
for given chemical measures of each wine.
1.Fixed acidity.
2.Volatile acidity.
3.Citric acid.
4.Residual sugar.
5.Chlorides.
6.Free sulfur dioxide.
7.Total sulfur dioxide.
8.Density.
9.pH.
10.Sulphates.
11.Alcohol.
12.Quality (Good/Bad).
Example
The below table has information about 20 wines sold in the market along with their alcohol
and alkalinity of ash content
Alkalinity of
Wine Alcohol Wine Alcohol Alkalinity of Ash
Ash
1 14.8 28 11 10.7 12.2

2 11.05 12 12 14.3 27
3 12.2 21 13 12.4 19.5
4 12 20 14 14.85 29.2
5 14.5 29.5 15 10.9 13.6
6 11.2 13 16 13.9 29.7
7 11.5 12 17 10.4 12.2
8 12.8 19 18 10.8 13.6
9 14.75 28.8 19 14 28.8
10 10.5 14 20 12.47 22.8

Classification and Clustering
Classification and Clustering are the two types of learning methods which characterize objects into groups
by one or more features. These processes appear to be similar, but there is a difference between them



Regression and Clustering
Regression:
• It predicts continuous values and their output. Regression analysis is the
statistical model that is used to predict the numeric data instead of
labels.
• It can also identify the distribution trends based on the available data or
historic data.
• Predicting a person's income based on various attributes such as age and

experience is an example of creating a regression model.

Regression and Clustering
Clustering:
• Clustering is quite literally the clustering or grouping up of data
according to the similarity of data points and data patterns.
• The aim of this is to separate similar categories of data and differentiate

them into localized regions.
• This way, when a new data point arrives, we can easily identify which
group or cluster it belongs to.
• This is done for unstructured datasets where it is up to the machine to

figure out the categories.
INTRODUCTION TO CLUSTERING
• Clustering is usually one of the first tasks performed in most

analytics projects.
• It helps data scientists to analyze individual clusters further
• A cluster refers to a collection of data points aggregated

together because of certain similarities.

• Clustering is the classification of objects into different groups,
or more precisely, the partitioning of a data set into subsets
(clusters), so that the data in each subset (ideally) share some
common trait - often according to some defined distance
measure.

When you were still young and naïve…
 Classification

You may classify them by
 Shape

 Color

 Sum of internal
angles

Classification
 Shape
 Color
 Sum of internal
angles
 Similarity of characteristics

In clustering, we do not have a target to predict. We look at
the data and then try to club similar observations and form
different groups.

What is Cluster Analysis?
• Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

• Clustering: the act of grouping “similar” object into sets.
• A clustering problem can be viewed as unsupervised classification.
• Clustering is appropriate when there is no a priori knowledge about

the data.
• Grouping is based on the distance (proximity)
• You don’t know who or what belongs to which group. Not even the
number of groups.
Applications of Clustering Applications
Example:
 A bank wants to give credit card offers to its customers. Currently, they look
at the details of each customer and based on this information, decide which
offer should be given to which customer.
 Now, the bank can potentially have millions of customers. Does it make
sense to look at the details of each customer separately and then make a
decision? Certainly not! It is a manual process and will take a huge amount
of time.
 So what can the bank do? One option is to segment its customers into
different groups. For instance, the bank can group the customers based on
their income:

Applications of Clustering Applications
• cluster analysis in
market segmentation:
• Help marketers
discover distinct
groups in their
customer bases, and
then use this
knowledge to develop
targeted marketing
programs

• Land use: Identification of • Insurance: Identifying groups of
areas of similar land use in insurance policy holders with a high
an earth observation average claim cost
database

• City-planning: Identifying groups
of houses according to their
house type, value, and
geographical location
• The field of psychiatry: The

characterization of patients on
the basis of clusters of symptoms
can be useful in the identification
of an appropriate form of
therapy.
• Earth-quake studies: Observed
earth quake epicenters should be
clustered along continent faults

Climate - Understanding the
Earth’s climate requires
finding patterns in the
atmosphere and ocean. To
that end, cluster analysis has
been applied to find patterns
in the atmospheric pressure
of Polar Regions and areas of
the ocean that have a
significant impact on land
climate

Measuring similarity
 Similarity
 The degree of correspondence among objects across
all of the characteristics.
 Correlational measures
 Distance measures

Similarity measure
 Correlation measure
 Grouping cases base on respondent pattern
 Distance measure
 Grouping cases base on distance

Non-overlapping clusters
Cluster in which each observation belongs to only one cluster. Non-
overlapping clusters are more frequently used clustering techniques in
practice.

Overlapping clusters
• An observation may belong to more than one cluster

Example
The below table has information about 20 wines sold in the market along with their alcohol
and alkalinity of ash content
Alkalinity of
Wine Alcohol Wine Alcohol Alkalinity of Ash
Ash
1 14.8 28 11 10.7 12.2

2 11.05 12 12 14.3 27
3 12.2 21 13 12.4 19.5
4 12 20 14 14.85 29.2
5 14.5 29.5 15 10.9 13.6
6 11.2 13 16 13.9 29.7
7 11.5 12 17 10.4 12.2
8 12.8 19 18 10.8 13.6
9 14.75 28.8 19 14 28.8
10 10.5 14 20 12.47 22.8

Clusters of wine based on alcohol and ash content.

Types of Clustering
• A clustering is a set of clusters
• Important distinction between hierarchical and partitional sets of clusters
• Partitional Clustering[k means clustering]

• A division data objects into non-overlapping subsets (clusters) such that
each data object is in exactly one subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree

K-Means Clustering
• K-means clustering is one of the simplest and popular unsupervised machine

learning algorithms.
• Data set into K distinct, non-overlapping clusters.
• There is an algorithm that tries to minimize the distance of the points in a

cluster with their centroid – the k-means clustering technique..
• The main idea is to define k centers, one for each cluster.

How to choose the value of K?
• One of the most challenging tasks in this clustering algorithm is to choose
the right values of k.
• What should be the right k-value?
• How to choose the k-value?
• If you are choosing the k values randomly, it might be correct or may be
wrong.
• If you will choose the wrong value then it will directly affect your model
performance.
• So there are two methods by which you can select the right value of k.
• Elbow Method.
• Silhouette Method.
• You’ll define a target number k, which refers to the number of
centroids you need in the dataset. A centroid is the imaginary or real
location representing the center of the cluster.
• Every data point is allocated to each of the clusters through reducing

the in-cluster sum of squares.
• In other words, the K-means algorithm identifies k number of

centroids, and then allocates every data point to the nearest cluster,
while keeping the centroids as small as possible.

• Figure shows the results obtained from performing K-means clustering on a
simulated example consisting of 150 observations in two dimensions, using three
different values of K.
Note that there is no ordering of the clusters, so the cluster coloring is arbitrary.
Advantages of K-means
• It is very simple to implement.
• It is scalable to a huge data set and also faster to large datasets.
• it adapts the new examples very frequently.
• Generalization of clusters for different shapes and sizes.

Disadvantages of K-means
• It is sensitive to the outliers.
• Choosing the k values manually is a tough job.
• As the number of dimensions increases its scalability decreases.

CLASS ROOM PROBLEMS

CLASS ROOM PROBLEMS
Cluster the following eight points (with (x, y) representing locations) into two
clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

CLASS ROOM PROBLEMS
Cluster the following points (with (x, y) representing locations) into two clusters:

CLASS ROOM PROBLEMS

CLASS ROOM PROBLEMS

Hierarchical clustering
Hierarchical clustering will help to determine the optimal number of clusters.
1.It starts by putting every point in its own cluster, so each cluster is a singleton
2.It then merges the 2 points that are closest to each other based on the distances from
the distance matrix. The consequence is that there is one less cluster
3.It then recalculates the distances between the new and old clusters and save them in
a new distance matrix which will be used in the next step
4.Finally, steps 1 and 2 are repeated until all clusters are merged into one single cluster
including all points.

There exists 5 main methods to measure the distance between clusters, referred as
linkage methods:
1.Single linkage: computes the minimum distance between clusters before merging
them.
2.Complete linkage: computes the maximum distance between clusters before merging
them.
3.Average linkage: computes the average distance between clusters before merging
them.
4.Centroid linkage: calculates centroids for both clusters, then computes the distance
between the two before merging them.
5.Ward’s (minimum variance) criterion: minimizes the total within-cluster variance and
find the pair of clusters that leads to minimum increase in total within-cluster variance
after merging.
Dendogram: -
• A dendrogram is a diagram that shows the hierarchical relationship between objects.
• It is most commonly created as an output from hierarchical clustering.
• The main use of a dendrogram is to work out the best way to allocate objects to
clusters.
• The dendrogram below shows the hierarchical clustering of six observations shown on

the scatterplot to the left.

• The key to interpreting a dendrogram is to focus on the height at which any two objects
are joined together.
• In the example above, we can see that E and F are most similar, as the height of the link
that joins them together is the smallest. The next two most similar objects are A and B.
• In the dendrogram above, the height of the dendrogram indicates the order in which the
clusters were joined.


Cluster Analysis

Uploaded by

Copyright:

Available Formats

You might also like

Cluster Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis

Uploaded by

Copyright:

Available Formats

1

Linear Regression is used to predict for numerical value.

Logistics Regression is used for prediction of non numerical value

1 14.8 28 11 10.7 12.2

Wednesday, December 09, 2020 8

Wednesday, December 09, 2020 10

Wednesday, December 09, 2020 11

Wednesday, December 09, 2020 12

• Predicting a person's income based on various attributes such as age and

Wednesday, December 09, 2020 13

• The aim of this is to separate similar categories of data and differentiate

• This is done for unstructured datasets where it is up to the machine to

• Clustering is usually one of the first tasks performed in most

• It helps data scientists to analyze individual clusters further

• A cluster refers to a collection of data points aggregated

Wednesday, December 09, 2020 15

Wednesday, December 09, 2020 16

Wednesday, December 09, 2020 17

Wednesday, December 09, 2020 18

Wednesday, December 09, 2020 19

Wednesday, December 09, 2020 20

Wednesday, December 09, 2020 21

Wednesday, December 09, 2020 22

Wednesday, December 09, 2020 23

• A clustering problem can be viewed as unsupervised classification.

• Clustering is appropriate when there is no a priori knowledge about

• Grouping is based on the distance (proximity)

Wednesday, December 09, 2020 25

Wednesday, December 09, 2020 26

Wednesday, December 09, 2020 27

• The field of psychiatry: The

Wednesday, December 09, 2020 29

Wednesday, December 09, 2020 30

Wednesday, December 09, 2020 31

Wednesday, December 09, 2020 32

Wednesday, December 09, 2020 33

Wednesday, December 09, 2020 34

1 14.8 28 11 10.7 12.2

Wednesday, December 09, 2020 35

Wednesday, December 09, 2020 36

• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of clusters

• Partitional Clustering[k means clustering]

Wednesday, December 09, 2020 37

• K-means clustering is one of the simplest and popular unsupervised machine

• Data set into K distinct, non-overlapping clusters.

• There is an algorithm that tries to minimize the distance of the points in a

• The main idea is to define k centers, one for each cluster.

Wednesday, December 09, 2020 38

• Every data point is allocated to each of the clusters through reducing

• In other words, the K-means algorithm identifies k number of

Wednesday, December 09, 2020 40

Wednesday, December 09, 2020 42

Wednesday, December 09, 2020 44

Wednesday, December 09, 2020 49

Wednesday, December 09, 2020 50

Wednesday, December 09, 2020 51