Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Ex.

No: 07
Data and Text clustering using K-means algorithm.
DATE:

Objective:

The objective of clustering electricity consumers using the K-means


algorithm is to identify homogeneous groups based on their electricity
consumption patterns. This facilitates targeted strategies for energy
management, such as implementing tailored conservation initiatives, optimizing
resource allocation, and designing personalized pricing plans, ultimately
enhancing overall efficiency and customer satisfaction in the energy sector.

Hardware and Software:

Hardware Specification:

Device name MYPC-8757

Processor Intel(R)Core(TM)i5-10210U CPU@1.60GHz 2.11 GHz

Installed RAM 8.00 GB (7.84 GB usable)

Software Specification:

Python 3.12

Algorithm:
The provided program implements the K-means clustering algorithm for
clustering used cars based on their year, price, and mileage. The K-means
algorithm is an unsupervised machine learning algorithm used for grouping
similar data points into clusters.

1. Import the required libraries: pandas, numpy, matplotlib.pyplot, and


sklearn.cluster (KMeans) and sklearn.preprocessing (StandardScaler).
2. Load the dataset from the URL:
"https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/
master/usedcars.csv".
3. Preprocess the data:

NIDHIN S | 711321107076 | 21EE076


a. Select the relevant numeric features for clustering: 'year', 'price', and
'mileage'.
b. Standardize the features using StandardScaler to ensure equal importance of
each feature.
4. Determine the optimal number of clusters using the Elbow Method:
a. Initialize an empty list 'wcss' to store the Within-Cluster Sum of Squares
(WCSS) values.
b. Iterate over a range of cluster numbers (1 to 10 in this case).
c. For each number of clusters:
i. Create a KMeans object with the specified number of clusters and other
parameters (init='k-means++', max_iter=300, n_init=10, random_state=0).
ii. Fit the KMeans object to the scaled data.
iii. Append the WCSS value (inertia_ attribute) to the 'wcss' list.
d. Plot the Elbow Method graph with the number of clusters on the x-axis and
WCSS on the y-axis.
5. Based on the Elbow Method graph, choose the optimal number of clusters
(k=3 in this case).
6. Perform K-means clustering with the chosen number of clusters:
a. Create a KMeans object with the specified number of clusters and other
parameters.
b. Fit the KMeans object to the scaled data and obtain the cluster labels
(clusters).
7. Add the cluster labels to the original dataset as a new column 'Cluster'.
8. Visualize the clusters:
a. Create a scatter plot with the scaled 'year' and 'price' features, colored by the
cluster labels.
b. Plot the cluster centroids (cluster_centers_ attribute) as red points.
c. Add labels and a title to the plot.
9. Print the cluster centers by inverse-transforming the cluster_centers_ attribute
using the StandardScaler.

The link to the dataset used in the program is:


https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/
master/usedcars.csv

The K-means algorithm is an iterative algorithm that assigns data points to


clusters based on their proximity to the cluster centroids. The algorithm aims to
minimize the within-cluster sum of squares (WCSS), which is the sum of
squared distances between each data point and its assigned cluster centroid. The
Elbow Method is a heuristic technique used to determine the optimal number of
clusters by plotting the WCSS values against the number of clusters and
choosing the "elbow" point where the WCSS starts to level off.

NIDHIN S | 711321107076 | 21EE076


PROGRAM:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load the dataset


url = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-
datasets/master/usedcars.csv"
data = pd.read_csv(url)

# Display the first few rows of the dataset


print(data.head())

# Select relevant numeric features for clustering


X = data[['year', 'price', 'mileage']]

# Standardize the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine the optimal number of clusters using the Elbow Method


wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10,
random_state=0)
kmeans.fit(X_scaled)
wcss.append(kmeans.inertia_)

# Plot the Elbow Method graph


plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS') # Within cluster sum of squares
plt.show()

# Based on the Elbow Method, choose the optimal number of clusters and
perform clustering
k = 3 # Change this to the chosen number of clusters

NIDHIN S | 711321107076 | 21EE076


kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10,
random_state=0)
clusters = kmeans.fit_predict(X_scaled)

# Add the cluster labels to the dataset


data['Cluster'] = clusters

# Visualize the clusters


plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300,
c='red', label='Centroids')
plt.title('Cluster of Used Cars')
plt.xlabel('Year (Scaled)')
plt.ylabel('Price (Scaled)')
plt.legend()
plt.show()

# Print the cluster centers


print("Cluster Centers:")
print(scaler.inverse_transform(kmeans.cluster_centers_))

NIDHIN S | 711321107076 | 21EE076


OUTPUT

year model price mileage color transmission

0 2011 SEL 21992 7413 Yellow AUTO

1 2011 SEL 20995 10926 Gray AUTO

2 2011 SEL 19995 7351 Silver AUTO

3 2011 SEL 17809 11613 Gray AUTO

4 2012 SE 17500 8367 White AUTO

Cluster Centers:

[[ 2009.83673469 14720.68367347 29192.26530612]

[ 2007.64285714 10451.9047619 65617.80952381]

[ 2002.4 6268.3 102230.7 ]]

NIDHIN S | 711321107076 | 21EE076


GRAPH:

NIDHIN S | 711321107076 | 21EE076


INFERENCE:

The K-means clustering algorithm has effectively segmented the used car
dataset into three distinct clusters based on the year, price, and mileage features.
The Elbow Method graph indicates the optimal number of clusters as three,
where the within-cluster sum of squares (WCSS) starts to level off. The scatter
plot visualizes these three clusters, with the yellow cluster representing newer
and more expensive cars, the turquoise cluster representing mid-range cars in
terms of age and price, and the purple cluster consisting of older and less
expensive cars with higher mileage. The cluster centers provided in the output
quantify these differences, with Cluster 0 having the newest cars (around 2009-
2010 model years) with higher prices and lower mileage, Cluster 1 representing
slightly older cars (around 2007-2008) with moderate prices and higher
mileage, and Cluster 2 having the oldest cars (around 2002-2003) with the
lowest prices and highest mileage. This clustering effectively captures the
inherent patterns in the used car market, where newer cars with lower mileage
command higher prices, while older cars with higher mileage are generally less
expensive, aligning with intuitive expectations.

NIDHIN S | 711321107076 | 21EE076


Outcome
Exemplary Proficient Apprentice Novice
Parameter Score
(4) (3) (2) (1)
(1 - 4)
Identifying clear goals for
the experiment
Choosing the appropriate
experimental test bed
(Hardware, Software,
Emulation, Simulation, or
hybrid) to achieve the
identified objectives of the
experiment
Designing and conducting
the experiment
Ability to analyze and
interpret the data

RESULT:
The K-means clustering algorithm successfully grouped the used
car dataset from https://raw.githubusercontent.com/stedy/Machine-Learning-
with-R-datasets/master/usedcars.csv into three distinct clusters based on the
year, price, and mileage features. Cluster 0 represented newer and more
expensive cars, Cluster 1 captured mid-range cars, and Cluster 2 consisted of
older and less expensive cars with higher mileage, effectively separating the
used cars based on their age, price, and mileage characteristics.

NIDHIN S | 711321107076 | 21EE076


NIDHIN S | 711321107076 | 21EE076

You might also like