Professional Documents
Culture Documents
Week 11
Week 11
Machine Learning
Week 11
Unsupervised Learning
and Clustering
~ Janice Dean
Machine Learning Glossary
• https://developers.google.com/machine-learning/glossary#model
• https://scikit-learn.org/stable/glossary.html
3
Unsupervised Learning
4
Unsupervised Learning
• In previous week, we looked at the most common unsupervised learning task:
dimensionality reduction.
• In this week, we will look at a few more unsupervised learning tasks and
algorithms:
• Clustering: the goal is to group similar instances together into clusters. This is a great
tool for data analysis, customer segmentation, recommender systems, search engines,
image segmentation, semi-supervised learning, dimensionality reduction, and more.
• Anomaly detection: the objective is to learn what “normal” data looks like, and use
this to detect abnormal instances, such as defective items on a production line or a
new trend in a time series.
• Density estimation: this is the task of estimating the probability density function (PDF)
of the random process that generated the dataset. This is commonly used for anomaly
detection: instances located in very low-density regions are likely to be anomalies. It is
also useful for data analysis and visualization.
5
Clustering
• Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset.
• It can be defined as "A way of grouping the data points into different clusters, consisting of
similar data points. The objects with the possible similarities remain in a group that has
less or no similarities with another group.«
• It is an unsupervised learning method, hence no supervision is provided to the algorithm,
and it deals with the unlabeled dataset.
6
Clustering
• Clustering is the task of identifying similar instances and assigning them to clusters, i.e.,
groups of similar instances.
• Consider Figure below: on the left is the iris dataset, where each instance’s species (i.e., its
class) is represented with a different marker. It is a labeled dataset, for which classification
algorithms such as Logistic Regression, SVMs or Random Forest classifiers are well suited.
• On the right is the same dataset, but without the labels, so you cannot use a classification
algorithm anymore.
7
Clustering
• This is where clustering algorithms step in: many of them can easily detect the top left
cluster.
• It is also quite easy to see with our own eyes, but it is not so obvious that the upper right
cluster is actually composed of two distinct sub-clusters.
• The dataset actually has two additional features (sepal length and width), not represented
here, and clustering algorithms can make good use of all features, so in fact they identify
the three clusters fairly well.
8
Clustering Applications
• Customer segmentation: You can cluster your customers based on their purchases, their
activity on your website, and so on.
• Data analysis: When analyzing a new dataset, it is often useful to first discover clusters of
similar instances, as it is often easier to analyze clusters separately.
• Dimensionality reduction technique: Once a dataset has been clustered, it is usually
possible to measure each instance’s affinity with each cluster.
• Anomaly detection (outlier detection): Any instance that has a low affinity to all the
clusters is likely to be an anomaly.
• Semi-supervised learning: If you only have a few labels, you could perform clustering and
propagate the labels to all the instances in the same cluster.
9
K-Means
• K-Means Clustering is an unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters.
• Here K defines the number of pre-defined clusters that need to be created in the process,
as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs only one group that has similar properties.
10
K-Means
• It allows us to cluster the data into different groups without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid.
• The main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
• The k-means clustering algorithm mainly performs two tasks:
• Determines the best value for K center points or centroids by an iterative process.
• Assigns each data point to its closest k-center. Those data points which are near to the particular k-center,
create a cluster.
11
K-Means Algorithm Steps
The working of the K-Means algorithm is explained in the below steps:
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input
dataset).
• Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.
12
K-Means Algorithm Steps
• Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
• Let's take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters.
• Now we will assign each data point of the scatter plot to its
closest K-point or centroid.
• We will compute it by applying some mathematics that we have
studied to calculate the distance between two points.
• So, we will draw a median between both the centroids.
• From the right upper image, it is clear that points left side of the
line is near to the K1 or blue centroid, and points to the right of
the line are close to the yellow centroid.
• Let's color them as blue and yellow for clear visualization.
14
K-Means Algorithm Steps
15
K-Means Algorithm Steps
• From the above image, we can see, one yellow point is on the
left side of the line, and two blue points are right to the line.
• So, these three points will be assigned to new centroids.
• As we got the new centroids so again will draw the median line
and reassign the data points.
• So, the image will be the one on the right:
• We can see in the right upper image; there are no dissimilar data
points on either side of the line, which means our model is
formed.
• Consider the right image:
17
K-Means Algorithm Steps
• As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:
18
The Numerical Example of K-Means
• Suppose we have several objects and each objects have two attributes or
features as shown in table below. Our goal is to group these objects into K=2
group on the two features width and height.
• Each object represents one point with two attributes (X, Y) that we can
represent it as coordinate in an attribute space as shown in the figure below.
19
The Numerical Example of K-Means
1. Initial value of centroids: Suppose we use object A and B as the first centroids.
Object Width (X) Height (Y)
A 1 1
B 2 1
C 4 3
D 5 4
20
The Numerical Example of K-Means
2. Objects-Centroids distance: We calculate the distance between cluster
centroid to each object. Let us use Euclidean distance, then we have distance
matrix at iteration 0 is
21
The Numerical Example of K-Means
3. Objects clustering: We assign each object based on the minimum distance.
Thus, object A is assigned to Cluster1, object B to Cluster2, object C to Cluster2
and object D to Cluster2.
22
The Numerical Example of K-Means
4. Iteration-1, Determine centroids: Knowing the members of each cluster, now
we compute the new centroid of each cluster based on these new memberships.
Cluster1 only has one member thus the centroid remains in same cluster.
Object Distance from Distance from Point belongs
Cluster1 Cluster2 to Cluster
A(1,1) 0 1 A
B(2,1) 1 0 B
C(4,3) 3,61 2,83 B
D(5,4) 5 4,24 B
Cluster2 now has three members, thus the centroid is the average coordinate
among the three members:
Center of Cluster2 = ((2 + 4 + 5)/3, (1 + 3 + 4)/3) = (3,67; 2,67)
23
The Numerical Example of K-Means
5. Iteration-1, Objects-Centroids distance: Again, we calculate the distance
between cluster centroid to each object.
Center of Cluster1: (0;1)
Center of Cluster2: (3,67; 2,67)
24
The Numerical Example of K-Means
6. Iteration-1, Objects clustering: Similar to step 3, we assign each object based
on the minimum distance. Based on the new distance matrix, we move the object
B to Cluster1 while all the other objects remain. The Group matrix is shown
below.
25
The Numerical Example of K-Means
7. Iteration-2, Determine centroids: Now we repeat step 4 to calculate the new
centroids coordinate based on the clustering of previous iteration. Cluster1 and
Cluster2 both has two members, thus the new centroids are
Object Distance from Distance from Point belongs
Cluster1 Cluster2 to Cluster
A(1,1) 0 3,14 A
B(2,1) 1 2,36 A
C(4,3) 3,61 0,47 B
D(5,4) 5 1,89 B
26
The Numerical Example of K-Means
8. Iteration-2, Objects-Centroids distance: Again, we calculate the distance
between cluster centroid to each object.
Center of Cluster1: (1,5; 1)
Center of Cluster2: (4,5; 3,5)
27
The Numerical Example of K-Means
9. Iteration-2, Objects clustering: Again, we assign each object based on the
minimum distance.
Object Distance from Distance from Point belongs
Cluster1 Cluster2 to Cluster
A(1,1) 0,5 4,30 A
B(2,1) 0,5 3,54 A
C(4,3) 3,20 0,71 B
D(5,4) 4,61 0,71 B
We obtain result that, comparing the clustering of last iteration and this iteration
reveals that the objects does not move group anymore.
Thus, the computation of the k-means clustering has reached its stability and no
more iteration is needed. We get the final grouping as the results.
28
K-Means
• The K-Means algorithm is a simple algorithm capable of clustering this kind of dataset very
quickly and efficiently, often in just a few iterations.
• Consider the unlabeled dataset represented in Figure below: you can clearly see 5 blobs of
instances.
29
An unlabeled dataset composed of five blobs of instances
K-Means
• Let’s train a K-Means clusterer on this dataset.
• It will try to find each blob’s center and assign each instance to the closest blob:
from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
• Note that you have to specify the number of clusters k that the algorithm must find.
• In this example, it is pretty obvious from looking at the data that k should be set to 5, but
in general it is not that easy.
30
K-Means
• Each instance was assigned to one of the 5 clusters.
• In the context of clustering, an instance’s label is the index of the cluster that this instance gets
assigned to by the algorithm: this is not to be confused with the class labels in classification.
• The KMeans instance preserves a copy of the labels of the instances it was trained on, available via
the labels_ instance variable:
>>> y_pred
array([4, 0, 1, ..., 2, 1, 0], dtype=int32)
>>> y_pred is kmeans.labels_
True
• We can also take a look at the 5 centroids that the algorithm found:
>>> kmeans.cluster_centers_
array([[-2.80389616, 1.80117999],
[ 0.20876306, 2.25551336],
[-2.79290307, 2.79641063],
[-1.46679593, 2.28585348],
[-2.80037642, 1.30082566]]) 31
K-Means
• Of course, you can easily assign new instances to the cluster whose centroid is closest:
>>> X_new = np.array([[0, 2], [3, 2], [-3, 3], [-3, 2.5]])
>>> kmeans.predict(X_new)
array([1, 1, 2, 2], dtype=int32)
• If you plot the cluster’s decision boundaries, you get a Voronoi diagram (see Figure, where each
centroid is represented with an X):
In mathematics, a Voronoi
diagram is a partition of a
plane into regions close to
each of a given set of
objects.
32
The K-Means Algorithm
• Start by placing the centroids randomly (e.g., by
picking k instances at random and using their
locations as centroids).
• Then label the instances, update the centroids, label
the instances, update the centroids, and so on until
the centroids stop moving.
• The algorithm is guaranteed to converge in a finite
number of steps (usually quite small), it will not
oscillate forever.
• You can see the algorithm in action in Figure: the
centroids are initialized randomly (top left), then the
instances are labeled (top right), then the centroids
are updated (center left), the instances are relabeled
(center right), and so on.
• As you can see, in just 3 iterations the algorithm has
reached a clustering that seems close to optimal.
33
K-Means
• Unfortunately, although the algorithm is guaranteed to converge, it may not converge to the right
solution (i.e., it may converge to a local optimum): this depends on the centroid initialization.
• For example, Figure below shows two sub-optimal solutions that the algorithm can converge to if
you are not lucky with the random initialization step:
34
K-Means - Centroid Initialization Methods
• If you happen to know approximately where the centroids should be (e.g., if you ran another
clustering algorithm earlier), then you can set the init hyperparameter to a NumPy array containing
the list of centroids, and set n_init to 1:
good_init = np.array([[-3, 3], [-3, 2], [-3, 1], [-1, 2], [0, 2]])
kmeans = KMeans(n_clusters=5, init=good_init, n_init=1)
• Another solution is to run the algorithm multiple times with different random initializations and
keep the best solution.
• This is controlled by the n_init hyperparameter: by default, it is equal to 10, which means that the
whole algorithm described earlier actually runs 10 times when you call fit(), and Scikit-Learn keeps
the best solution.
• But how exactly does it know which solution is the best?
• Well of course it uses a performance metric: model’s inertia.
35
K-Means - Centroid Initialization Methods
• Inertia is the mean squared distance between each instance and its closest
centroid.
• It is roughly equal to 223.3 for the model on the left of Figure and 237.5 for the
model on the right of Figure.
• The KMeans class runs the algorithm n_init times and keeps the model with the
lowest inertia.
• If you are curious, a model’s inertia is accessible via the inertia_ instance variable:
>>> kmeans.inertia_
211.59853725816856
36
Finding the Optimal Number of Clusters
• So far, we have set the number of clusters k to 5 because it was obvious by looking at the data that
this is the correct number of clusters.
• But in general, it will not be so easy to know how to set k, and the result might be quite bad if you
set it to the wrong value.
• For example, as you can see in Figure, setting k to 3 or 8 results in fairly bad models:
37
Finding the Optimal Number of Clusters
• You might be thinking that we could just pick the model with the lowest inertia, right?
• Unfortunately, it is not that simple.
• The inertia for k=3 is 653.2, which is much higher than for k=5 (which was 211.6), but with k=8, the
inertia is just 119.1.
• The inertia is not a good performance metric when trying to choose k since it keeps getting lower as
we increase k.
• Indeed, the more clusters there are, the closer each instance will be to its closest centroid, and
therefore the lower the inertia will be.
• Let’s plot the inertia as a function of k:
40
K-Means Implementation
Step-1: Data pre-processing
• We will import the libraries and dataset for our model, which is part of data pre-processing.
# importing libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
43
K-Means Implementation
Step-4: Visualizing the Clusters
• The last step is to visualize the clusters.
• As we have 5 clusters for our model, so we will visualize each cluster one by one.
• To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.
#visulaizing the clusters
mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1') #for first cluster
mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = 'Cluster 2') #for second cluster
mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3') #for third cluster
mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4') #for fourth cluster
mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5') #for fifth cluster
mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c = 'yellow', label = 'Centroid')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
44
K-Means Implementation
Step-4: Visualizing the Clusters
• The output image is clearly showing the five different clusters with different colors.
• The clusters are formed between two parameters of the dataset; Annual income of customer and
Spending.
45
Homework
(1)
• https://www.kaggle.com/kanncaa1/machine-learning-
tutorial-for-beginners
• Check K-Means clustering part, rewrite all codes and
create your own notebook.
(2)
• Find a new dataset from Kaggle and apply K-Means
clustering algorithm.
46
CPE316 - Introduction to
Machine Learning
Week 11
LAB.
48
References
• Ethem Apaydin, Introduction to Machine Learning, 3e. The MIT Press, 2014.
• Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2011.
• Tom Mitchell, Machine Learning, McGraw Hill, 1997.
• Russell, S., and P. Norvig. 2009. Artificial Intelligence: A Modern Approach, 3rd ed. New
York: Prentice Hall.
• “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools,
and Techniques to Build Intelligent Systems”, Aurélien Géron, O'Reilly Media (2019).
• https://developers.google.com/machine-learning/crash-course/
• https://www.javatpoint.com/clustering-in-machine-learning
• https://people.revoledu.com/kardi/tutorial/kMean/NumericalExample.htm
• https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning