21MCS0062 VL2021220506181 Ast05

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

CSE6005 - Machine Learning

Lab Assessment - 5
Name: Hariharan K
Reg.No.: 21MCS0062

Question:
In this exercise, you are required to implement the k-means algorithm and apply it to a
real-life data set.
Input:
The provided input file ("place.txt") consists of the locations of 300 places in the US.
Each location is a two-dimensional point that represents the longitude and latitude of
the place. For example, "-112.1,33.5" means the longitude of the place is -112.1, and
the latitude is 33.5.
Output:
You are required to implement the k-means algorithm and use it to cluster the 300
locations into three clusters, such that the locations in the same cluster are
geographically close to each other.
After reading in the 300 locations in "place.txt" and applying the k-means algorithm
(with k = 3), you are required to generate an output file named "clusters.txt". The
output file should contain exactly 300 lines, where each line represents the cluster
label of each location. Every line should be in the format: location_id cluster_label.

1. Import the input place.txt and append latitude and longitude lists:

fh = open('place.txt', 'r')
long = []
lat = []
for line in fh.readlines():
nums = line.split(',')
nums[1] = nums[1].split('\n')[0]
long.append(nums[0])
lat.append(nums[1])
print(len(long))
print(len(lat))
print(long[:5])
print(lat[:5])

2. Create the data and initialize the data frame based on the list:

data = {'longitude' : long, 'latitude' : lat}


import pandas as pd
df = pd.DataFrame(data, columns = ['longitude', 'latitude'])
df.head(10)
df.shape
df.info()
3. Importing required libraries for visualizing k-means clustering:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns; sns.set()

4. Visualize the optimal k:


sse = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(df)
sse.append(kmeanModel.inertia_)
plt.figure(figsize=(15,6))
plt.plot(K, sse, 'bx-')
plt.xlabel('Number of Clusters (k)', fontsize = 15)
plt.ylabel('Sum of Squared Error', fontsize = 15)
plt.title('The Elbow Method showing the optimal k', fontsize = 20)
plt.show()

5. Measure silhouette score for each cluster:


from sklearn.metrics import silhouette_score
X = df[['latitude', 'longitude']]
print("Clusters\tSilhoutte Score\n")
for n_cluster in range(2, 11):
kmeans = KMeans(n_clusters=n_cluster).fit(X)
label = kmeans.labels_
sil_coeff = silhouette_score(X, label, metric='euclidean')
print("k = {} \t--> \t{}".format(n_cluster, sil_coeff))
6. Considering cluster with max Silhouette score and implementing k means
clustering with verbose equals to 1:
kmeans = KMeans(n_clusters = 3, init ='k-means++', verbose = 1)
kmeans.fit(df)

7. Including Cluster label entity in the data set:


df['cluster_label'] = kmeans.fit_predict(df)
df.head(10)
8. Calculating the location of centers of the Clusters:
print("Locations of centers of the clusters: ")
centers = kmeans.cluster_centers_
centers
print("There are 100 locations in each cluster")
df['cluster_label'].value_counts()

9. Calculating the number of rows in Cluster-labels:


print("Number of rows in Cluster-labels: ", df['cluster_label'].shape)
10. Generating the output CSV:
df['cluster_label'].to_csv('clusters.csv')

You might also like