Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

LAB EXPERIMENT NO.

4
Name: Dhruv Jain
SAP ID: 60004190030
Div/Batch: A/A2
AIM:

Implementation of Clustering algorithm Using

1. K-means clustering

2. Hierarchical Clustering (single/complete/average)

Perform the experiment in Python.

Read any dataset from UCI dataset repository

Part A:

Program using inbuilt functions.

Plot the clusters

Plot dendrogram (for hierarchical)

Part B:

Program the algorithm from scratch.

Plot the clusters.

Part C:

Find optimum no. of cluster using elbow method.

Theory:

K-means clustering:
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e.,
data without defined categories or groups). The goal of this algorithm is to find groups in the data, with
the number of groups represented by the variable K. The algorithm works iteratively to assign each data
point to one of K groups based on the features that are provided. Data points are clustered based on
feature similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)

Hierarchical clustering:
Hierarchical clustering is one of the popular and easy to understand clustering technique. This clustering
technique is divided into two types:
1. Agglomerative
2. Divisive
Agglomerative Hierarchical clustering Technique: In this technique, initially each data point is considered
as an individual cluster. At each iteration, the similar clusters merge with other clusters until one cluster or
K clusters are formed.
The basic algorithm of Agglomerative is straight forward.
Compute the proximity matrix
Let each data point be a cluster
Repeat: Merge the two closest clusters and update the proximity matrix
Until only a single cluster remains
Key operation is the computation of the proximity of two clusters

Elbow Method:
In the Elbow method, we are actually varying the number of clusters ( K ) from 1 – 10. For each value of
K, we are calculating WCSS ( Within-Cluster Sum of Square ). WCSS is the sum of squared distance
between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks
like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is
largest when K = 1. When we analyze the graph we can see that the graph will rapidly change at a point
and thus creating an elbow shape. From this point, the graph starts to move almost parallel to the X-axis.
The K value corresponding to this point is the optimal K value or an optimal number of clusters.

PART A & C
Code:

import matplotlib.pyplot as plt


import pandas as pd
import sklearn
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
from sklearn import datasets
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
iris = datasets.load_iris()
X = iris.data

y = pd.DataFrame(iris.target)
variable = iris.feature_names
X = scale(iris.data)
clustering = KMeans(n_clusters=3, random_state=5)
clustering.fit(X)
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
y.columns = ['Target']
print(iris_df.head())
color_theme = np.array(['yellow', 'red', 'blue'])
plt.figure(facecolor='white')
plt.subplot(1,2,1)
plt.scatter(x=iris_df['petal length (cm)'], y=iris_df['petal width
(cm)'], c=color_theme[iris.target], s=50)
plt.title('Ground Truth')

plt.subplot(1,2,2)
plt.scatter(x=iris_df['petal length (cm)'], y=iris_df['petal width
(cm)'], c=color_theme[clustering.labels_], s=50)
plt.title('K-Means Clustering')
plt.show()
import scipy.cluster.hierarchy as shc

plt.figure(figsize=(8, 8), facecolor='white')


plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(X, method='complete')))

## Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering

clustering = AgglomerativeClustering(n_clusters=3,
affinity='euclidean', linkage='complete')
clustering.fit(X)

color_theme = np.array(['yellow', 'red', 'blue'])


plt.figure(facecolor='white')
plt.subplot(1,2,1)
plt.scatter(x=iris_df['petal length (cm)'], y=iris_df['petal width
(cm)'], c=color_theme[iris.target], s=50)
plt.title('Ground Truth')

plt.subplot(1,2,2)
plt.scatter(x=iris_df['petal length (cm)'], y=iris_df['petal width
(cm)'], c=color_theme[clustering.labels_], s=50)

plt.title('Agglomerative Clustering')
plt.show()
wcss=[]
for i in range(1, 7):
kmeans = KMeans(i)
kmeans.fit(iris_df)
wcss_iter = kmeans.inertia_
wcss.append(wcss_iter)

number_clusters = range(1,7)
plt.figure(facecolor='white')
plt.plot(number_clusters,wcss)
plt.title('The Elbow curve')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Output:
Part B:
Code:
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from sklearn.cluster import KMeans
from sklearn.preprocessing import scale
from sklearn import datasets
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
import random
iris = datasets.load_iris()
X = iris.data
y = pd.DataFrame(iris.target)
variable = iris.feature_names
X = scale(iris.data)
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

m = X.shape[0]
n = X.shape[1]
n_iter=100
K=3
Centroids = np.array([]).reshape(n,0)
for i in range(K):
rand=random.randint(0, m-1)
Centroids = np.c_[Centroids, X[rand]]
Output={}
EuclidianDistance=np.array([]).reshape(m,0)
for k in range(K):
tempDist=np.sum((X-Centroids[:,k])**2,axis=1)
EuclidianDistance=np.c_[EuclidianDistance,tempDist]
C=np.argmin(EuclidianDistance,axis=1)+1
Y={}
for k in range(K):
Y[k+1]=np.array([]).reshape(4,0)
for i in range(m):
Y[C[i]]=np.c_[Y[C[i]],X[i]]
for k in range(K):
Y[k+1]=Y[k+1].T

for k in range(K):
Centroids[:,k]=np.mean(Y[k+1],axis=0)
for i in range(n_iter):
EuclidianDistance=np.array([]).reshape(m,0)
for k in range(K):
tempDist=np.sum((X-Centroids[:,k])**2,axis=1)
EuclidianDistance=np.c_[EuclidianDistance,tempDist]
C=np.argmin(EuclidianDistance,axis=1)+1
Y={}
for k in range(K):
Y[k+1]=np.array([]).reshape(4,0)
for i in range(m):
Y[C[i]]=np.c_[Y[C[i]],X[i]]

for k in range(K):
Y[k+1]=Y[k+1].T

for k in range(K):
Centroids[:,k]=np.mean(Y[k+1],axis=0)
Output=Y

color_theme = np.array(['green', 'cyan', 'orange'])


plt.figure(facecolor='white')
plt.subplot(1,2,1)
plt.scatter(x=iris_df['petal length (cm)'], y=iris_df['petal width (cm)'], c=color_theme[iris.target],
s=50)
plt.title('Ground Truth')

plt.subplot(1,2,2)
plt.scatter(x=iris_df['petal length (cm)'], y=iris_df['petal width (cm)'], c=color_theme[C-1],
s=50)
plt.title('K means from scratch')
plt.show()
Output:
Conclusion:

When choosing a clustering algorithm, you should consider whether the algorithm scales to your
dataset. Datasets in machine learning can have millions of examples, but not all clustering
algorithms scale efficiently. Many clustering algorithms work by computing the similarity
between all pairs of examples. This means their runtime increases as the square of the number of
examples n, denoted as O(n^2) in complexity notation. O(n^2) algorithms are not practical when
the number of examples are in millions. This course focuses on the k-means algorithm, which
has a complexity of O(n^2), meaning that the algorithm scales linearly with .

You might also like