Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

CPE316 - Introduction to

Machine Learning

Week 11
Unsupervised Learning
and Clustering

Assoc. Prof. Dr. Caner ÖZCAN


Sun always comes out after the
storm. Being optimistic and
surrounding yourself with
positive loving people is for me,
living life on the sunny side of
the street.

~ Janice Dean
Machine Learning Glossary

• https://developers.google.com/machine-learning/glossary#model
• https://scikit-learn.org/stable/glossary.html

3
Unsupervised Learning

If intelligence was a cake,


unsupervised learning would be the cake,
supervised learning would be the icing on the cake,
and reinforcement learning would be the cherry on the cake.

4
Unsupervised Learning
• In previous week, we looked at the most common unsupervised learning task:
dimensionality reduction.
• In this week, we will look at a few more unsupervised learning tasks and
algorithms:
• Clustering: the goal is to group similar instances together into clusters. This is a great
tool for data analysis, customer segmentation, recommender systems, search engines,
image segmentation, semi-supervised learning, dimensionality reduction, and more.
• Anomaly detection: the objective is to learn what “normal” data looks like, and use
this to detect abnormal instances, such as defective items on a production line or a
new trend in a time series.
• Density estimation: this is the task of estimating the probability density function (PDF)
of the random process that generated the dataset. This is commonly used for anomaly
detection: instances located in very low-density regions are likely to be anomalies. It is
also useful for data analysis and visualization.

5
Clustering
• Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset.
• It can be defined as "A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a group that has
less or no similarities with another group.«
• It is an unsupervised learning method, hence no supervision is provided to the algorithm,
and it deals with the unlabeled dataset.

6
Clustering
• Clustering is the task of identifying similar instances and assigning them to clusters, i.e.,
groups of similar instances.
• Consider Figure below: on the left is the iris dataset, where each instance’s species (i.e., its
class) is represented with a different marker. It is a labeled dataset, for which classification
algorithms such as Logistic Regression, SVMs or Random Forest classifiers are well suited.
• On the right is the same dataset, but without the labels, so you cannot use a classification
algorithm anymore.

7
Clustering
• This is where clustering algorithms step in: many of them can easily detect the top left
cluster.
• It is also quite easy to see with our own eyes, but it is not so obvious that the upper right
cluster is actually composed of two distinct sub-clusters.
• The dataset actually has two additional features (sepal length and width), not represented
here, and clustering algorithms can make good use of all features, so in fact they identify
the three clusters fairly well.

8
Clustering Applications
• Customer segmentation: You can cluster your customers based on their purchases, their
activity on your website, and so on.
• Data analysis: When analyzing a new dataset, it is often useful to first discover clusters of
similar instances, as it is often easier to analyze clusters separately.
• Dimensionality reduction technique: Once a dataset has been clustered, it is usually
possible to measure each instance’s affinity with each cluster.
• Anomaly detection (outlier detection): Any instance that has a low affinity to all the
clusters is likely to be an anomaly.
• Semi-supervised learning: If you only have a few labels, you could perform clustering and
propagate the labels to all the instances in the same cluster.

9
K-Means
• K-Means Clustering is an unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters.
• Here K defines the number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.

10
K-Means
• It allows us to cluster the data into different groups without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid.
• The main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
• The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the particular k-center,
create a cluster.

11
K-Means Algorithm Steps
The working of the K-Means algorithm is explained in the below steps:
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input
dataset).
• Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.

12
K-Means Algorithm Steps

• Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
• Let's take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters.

• We need to choose some random k points or centroid to form the


cluster.
• These points can be either the points from the dataset or any other
point.
• So, here we are selecting the below two points as k points, which are
not the part of our dataset.
13
K-Means Algorithm Steps

• Now we will assign each data point of the scatter plot to its
closest K-point or centroid.
• We will compute it by applying some mathematics that we have
studied to calculate the distance between two points.
• So, we will draw a median between both the centroids.

• From the right upper image, it is clear that points left side of the
line is near to the K1 or blue centroid, and points to the right of
the line are close to the yellow centroid.
• Let's color them as blue and yellow for clear visualization.
14
K-Means Algorithm Steps

• As we need to find the closest cluster, so we will repeat the


process by choosing a new centroid.
• To choose the new centroids, we will compute the center of
gravity of these centroids and will find new centroids as right:

• Next, we will reassign each datapoint to the new centroid.


• For this, we will repeat the same process of finding a median
line.
• The median will be like on the right image:

15
K-Means Algorithm Steps

• From the above image, we can see, one yellow point is on the
left side of the line, and two blue points are right to the line.
• So, these three points will be assigned to new centroids.

• As reassignment has taken place, so we will again go to the step-


4, which is finding new centroids or K-points.

• We will repeat the process by finding the center of gravity of


centroids, so the new centroids will be as shown in the right
image:
16
K-Means Algorithm Steps

• As we got the new centroids so again will draw the median line
and reassign the data points.
• So, the image will be the one on the right:

• We can see in the right upper image; there are no dissimilar data
points on either side of the line, which means our model is
formed.
• Consider the right image:

17
K-Means Algorithm Steps

• As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:

18
The Numerical Example of K-Means
• Suppose we have several objects and each objects have two attributes or
features as shown in table below. Our goal is to group these objects into K=2
group on the two features width and height.

Object Width (X) Height (Y)


A 1 1
B 2 1
C 4 3
D 5 4

• Each object represents one point with two attributes (X, Y) that we can
represent it as coordinate in an attribute space as shown in the figure below.
19
The Numerical Example of K-Means
1. Initial value of centroids: Suppose we use object A and B as the first centroids.
Object Width (X) Height (Y)
A 1 1
B 2 1
C 4 3
D 5 4

20
The Numerical Example of K-Means
2. Objects-Centroids distance: We calculate the distance between cluster
centroid to each object. Let us use Euclidean distance, then we have distance
matrix at iteration 0 is

Object Distance from Distance from


Cluster1 Cluster2
A(1,1) 0 1
B(2,1) 1 0
C(4,3) 3,61 2,83
D(5,4) 5 4,24

21
The Numerical Example of K-Means
3. Objects clustering: We assign each object based on the minimum distance.
Thus, object A is assigned to Cluster1, object B to Cluster2, object C to Cluster2
and object D to Cluster2.

Object Distance from Distance from Point belongs


Cluster1 Cluster2 to Cluster
A(1,1) 0 1 A
B(2,1) 1 0 B
C(4,3) 3,61 2,83 B
D(5,4) 5 4,24 B

22
The Numerical Example of K-Means
4. Iteration-1, Determine centroids: Knowing the members of each cluster, now
we compute the new centroid of each cluster based on these new memberships.
Cluster1 only has one member thus the centroid remains in same cluster.
Object Distance from Distance from Point belongs
Cluster1 Cluster2 to Cluster
A(1,1) 0 1 A
B(2,1) 1 0 B
C(4,3) 3,61 2,83 B
D(5,4) 5 4,24 B

Cluster2 now has three members, thus the centroid is the average coordinate
among the three members:
Center of Cluster2 = ((2 + 4 + 5)/3, (1 + 3 + 4)/3) = (3,67; 2,67)
23
The Numerical Example of K-Means
5. Iteration-1, Objects-Centroids distance: Again, we calculate the distance
between cluster centroid to each object.
Center of Cluster1: (0;1)
Center of Cluster2: (3,67; 2,67)

Object Distance from Distance from


Cluster1 Cluster2
A(1,1) 0 3,14
B(2,1) 1 2,36
C(4,3) 3,61 0,47
D(5,4) 5 1,89

24
The Numerical Example of K-Means
6. Iteration-1, Objects clustering: Similar to step 3, we assign each object based
on the minimum distance. Based on the new distance matrix, we move the object
B to Cluster1 while all the other objects remain. The Group matrix is shown
below.

Object Distance from Distance from Point belongs


Cluster1 Cluster2 to Cluster
A(1,1) 0 3,14 A
B(2,1) 1 2,36 A
C(4,3) 3,61 0,47 B
D(5,4) 5 1,89 B

25
The Numerical Example of K-Means
7. Iteration-2, Determine centroids: Now we repeat step 4 to calculate the new
centroids coordinate based on the clustering of previous iteration. Cluster1 and
Cluster2 both has two members, thus the new centroids are
Object Distance from Distance from Point belongs
Cluster1 Cluster2 to Cluster
A(1,1) 0 3,14 A
B(2,1) 1 2,36 A
C(4,3) 3,61 0,47 B
D(5,4) 5 1,89 B

Center of Cluster1 = ((1 + 2)/2, (1 + 1)/2) = (1,5; 1)


Center of Cluster2 = ((4 + 5)/2, (3 + 4)/2) = (4,5; 3,5)

26
The Numerical Example of K-Means
8. Iteration-2, Objects-Centroids distance: Again, we calculate the distance
between cluster centroid to each object.
Center of Cluster1: (1,5; 1)
Center of Cluster2: (4,5; 3,5)

Object Distance from Distance from


Cluster1 Cluster2
A(1,1) 0,5 4,30
B(2,1) 0,5 3,54
C(4,3) 3,20 0,71
D(5,4) 4,61 0,71

27
The Numerical Example of K-Means
9. Iteration-2, Objects clustering: Again, we assign each object based on the
minimum distance.
Object Distance from Distance from Point belongs
Cluster1 Cluster2 to Cluster
A(1,1) 0,5 4,30 A
B(2,1) 0,5 3,54 A
C(4,3) 3,20 0,71 B
D(5,4) 4,61 0,71 B

We obtain result that, comparing the clustering of last iteration and this iteration
reveals that the objects does not move group anymore.
Thus, the computation of the k-means clustering has reached its stability and no
more iteration is needed. We get the final grouping as the results.
28
K-Means
• The K-Means algorithm is a simple algorithm capable of clustering this kind of dataset very
quickly and efficiently, often in just a few iterations.
• Consider the unlabeled dataset represented in Figure below: you can clearly see 5 blobs of
instances.

29
An unlabeled dataset composed of five blobs of instances
K-Means
• Let’s train a K-Means clusterer on this dataset.
• It will try to find each blob’s center and assign each instance to the closest blob:
from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)

• Note that you have to specify the number of clusters k that the algorithm must find.
• In this example, it is pretty obvious from looking at the data that k should be set to 5, but
in general it is not that easy.

30
K-Means
• Each instance was assigned to one of the 5 clusters.
• In the context of clustering, an instance’s label is the index of the cluster that this instance gets
assigned to by the algorithm: this is not to be confused with the class labels in classification.
• The KMeans instance preserves a copy of the labels of the instances it was trained on, available via
the labels_ instance variable:
>>> y_pred
array([4, 0, 1, ..., 2, 1, 0], dtype=int32)
>>> y_pred is kmeans.labels_
True

• We can also take a look at the 5 centroids that the algorithm found:
>>> kmeans.cluster_centers_
array([[-2.80389616, 1.80117999],
[ 0.20876306, 2.25551336],
[-2.79290307, 2.79641063],
[-1.46679593, 2.28585348],
[-2.80037642, 1.30082566]]) 31
K-Means
• Of course, you can easily assign new instances to the cluster whose centroid is closest:

>>> X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])
>>> kmeans.predict(X_new)
array([1, 1, 2, 2], dtype=int32)

• If you plot the cluster’s decision boundaries, you get a Voronoi diagram (see Figure, where each
centroid is represented with an X):

In mathematics, a Voronoi
diagram is a partition of a
plane into regions close to
each of a given set of
objects.

32
The K-Means Algorithm
• Start by placing the centroids randomly (e.g., by
picking k instances at random and using their
locations as centroids).
• Then label the instances, update the centroids, label
the instances, update the centroids, and so on until
the centroids stop moving.
• The algorithm is guaranteed to converge in a finite
number of steps (usually quite small), it will not
oscillate forever.
• You can see the algorithm in action in Figure: the
centroids are initialized randomly (top left), then the
instances are labeled (top right), then the centroids
are updated (center left), the instances are relabeled
(center right), and so on.
• As you can see, in just 3 iterations the algorithm has
reached a clustering that seems close to optimal.
33
K-Means
• Unfortunately, although the algorithm is guaranteed to converge, it may not converge to the right
solution (i.e., it may converge to a local optimum): this depends on the centroid initialization.
• For example, Figure below shows two sub-optimal solutions that the algorithm can converge to if
you are not lucky with the random initialization step:

34
K-Means - Centroid Initialization Methods
• If you happen to know approximately where the centroids should be (e.g., if you ran another
clustering algorithm earlier), then you can set the init hyperparameter to a NumPy array containing
the list of centroids, and set n_init to 1:

good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])
kmeans = KMeans(n_clusters=5, init=good_init, n_init=1)

• Another solution is to run the algorithm multiple times with different random initializations and
keep the best solution.
• This is controlled by the n_init hyperparameter: by default, it is equal to 10, which means that the
whole algorithm described earlier actually runs 10 times when you call fit(), and Scikit-Learn keeps
the best solution.
• But how exactly does it know which solution is the best?
• Well of course it uses a performance metric: model’s inertia.

35
K-Means - Centroid Initialization Methods
• Inertia is the mean squared distance between each instance and its closest
centroid.
• It is roughly equal to 223.3 for the model on the left of Figure and 237.5 for the
model on the right of Figure.
• The KMeans class runs the algorithm n_init times and keeps the model with the
lowest inertia.
• If you are curious, a model’s inertia is accessible via the inertia_ instance variable:

>>> kmeans.inertia_
211.59853725816856

36
Finding the Optimal Number of Clusters
• So far, we have set the number of clusters k to 5 because it was obvious by looking at the data that
this is the correct number of clusters.
• But in general, it will not be so easy to know how to set k, and the result might be quite bad if you
set it to the wrong value.
• For example, as you can see in Figure, setting k to 3 or 8 results in fairly bad models:

37
Finding the Optimal Number of Clusters
• You might be thinking that we could just pick the model with the lowest inertia, right?
• Unfortunately, it is not that simple.
• The inertia for k=3 is 653.2, which is much higher than for k=5 (which was 211.6), but with k=8, the
inertia is just 119.1.
• The inertia is not a good performance metric when trying to choose k since it keeps getting lower as
we increase k.
• Indeed, the more clusters there are, the closer each instance will be to its closest centroid, and
therefore the lower the inertia will be.
• Let’s plot the inertia as a function of k:

There is an “elbow” at k=4 so it


would be a good choice: any lower
value would be dramatic, while any
higher value would not help much. 38
Limits of K-Means
• Despite its many merits, most notably being fast and scalable, K-Means is not perfect.
• As we saw, it is necessary to run the algorithm several times to avoid sub-optimal solutions, plus
you need to specify the number of clusters, which can be quite a hassle.
• Moreover, K-Means does not behave very well when the clusters have varying sizes, different
densities, or non-spherical shapes.
• For example, Figure shows how K-Means clusters a dataset containing three ellipsoidal clusters of
different dimensions, densities and orientations.
• As you can see, neither of these solutions are any good.
It is important to scale the input features
before you run K-Means, or else the
clusters may be very stretched, and K-
Means will perform poorly.
Scaling the features does not guarantee
that all the clusters will be nice and
spherical, but it generally improves
things. 39
K-Means Implementation
• We have a dataset of Mall_Customers, which is the data of customers who visit the mall and spend
there.
• In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and Spending Score
(which is the calculated value of how much a customer has spent in the mall, the more the value,
the more he has spent).
• From this dataset, we need to calculate some patterns, as it is an unsupervised method, so we don't
know what to calculate exactly.
• The steps to be followed for the implementation are given below:
• Data Pre-processing
• Finding the optimal number of clusters using the elbow method
• Training the K-means algorithm on the training dataset
• Visualizing the clusters

40
K-Means Implementation
Step-1: Data pre-processing
• We will import the libraries and dataset for our model, which is part of data pre-processing.

# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd

# Importing the dataset


dataset = pd.read_csv('Mall_Customers_data.csv’)

x = dataset.iloc[:, [3, 4]].values

• As we can see, we are extracting only 3rd and 4th feature.


• It is because we need a 2d plot to visualize the model,
and some features are not required, such as customer_id.
41
K-Means Implementation
Step-2: Finding the optimal number of clusters using the elbow method
• The Elbow method is one of the most popular ways to find the optimal number of clusters using the
concept of WCSS value.
• WCSS stands for Within Cluster Sum of Squares, which defines the total variations within a cluster.
• So, we are going to calculate the value for WCSS for different k values ranging from 1 to 10.
#finding optimal number of clusters using the elbow method
from sklearn.cluster import KMeans
wcss_list= [] #Initializing the list for the values of WCSS

#Using for loop for iterations from 1 to 10.


for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42) the number of
kmeans.fit(x) clusters will be 5
wcss_list.append(kmeans.inertia_)
mtp.plot(range(1, 11), wcss_list)
mtp.title('The Elbow Method Graph')
mtp.xlabel('Number of clusters(k)')
mtp.ylabel('wcss_list')
42
mtp.show()
K-Means Implementation
Step-3: Training the K-means algorithm on the training dataset
• To train the model, we will use the same two lines of code as we have used in the above section,
but here instead of using i, we will use 5, as we know there are 5 clusters that need to be formed.

#training the K-means model on a dataset


kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
y_predict= kmeans.fit_predict(x)

43
K-Means Implementation
Step-4: Visualizing the Clusters
• The last step is to visualize the clusters.
• As we have 5 clusters for our model, so we will visualize each cluster one by one.
• To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.
#visulaizing the clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()

44
K-Means Implementation
Step-4: Visualizing the Clusters
• The output image is clearly showing the five different clusters with different colors.
• The clusters are formed between two parameters of the dataset; Annual income of customer and
Spending.

45
Homework
(1)
• https://www.kaggle.com/kanncaa1/machine-learning-
tutorial-for-beginners
• Check K-Means clustering part, rewrite all codes and
create your own notebook.

(2)
• Find a new dataset from Kaggle and apply K-Means
clustering algorithm.

46
CPE316 - Introduction to
Machine Learning

Week 11
LAB.

Assoc. Prof. Dr. Caner ÖZCAN


K-Means Example

• Run K-Means Python notebook


and analyze the code.

48
References
• Ethem Apaydin, Introduction to Machine Learning, 3e. The MIT Press, 2014.
• Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2011.
• Tom Mitchell, Machine Learning, McGraw Hill, 1997.
• Russell, S., and P. Norvig. 2009. Artificial Intelligence: A Modern Approach, 3rd ed. New
York: Prentice Hall.
• “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools,
and Techniques to Build Intelligent Systems”, Aurélien Géron, O'Reilly Media (2019).
• https://developers.google.com/machine-learning/crash-course/
• https://www.javatpoint.com/clustering-in-machine-learning
• https://people.revoledu.com/kardi/tutorial/kMean/NumericalExample.htm
• https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning

You might also like