Kman 07

Ex.
No: 07
Data and Text clustering using K-means algorithm.
DATE:
Objective:
The objective of clustering electricity consumers using the K-means

algorithm is to identify homogeneous groups based on their electricity
consumption patterns. This facilitates targeted strategies for energy
management, such as implementing tailored conservation initiatives, optimizing
resource allocation, and designing personalized pricing plans, ultimately
enhancing overall efficiency and customer satisfaction in the energy sector.
Hardware and Software:
Hardware Specification:
Device name MYPC-8757
Processor Intel(R)Core(TM)i5-10210U CPU@1.60GHz 2.11 GHz
Installed RAM 8.00 GB (7.84 GB usable)
Software Specification:
Python 3.12
Algorithm:
The provided program implements the K-means clustering algorithm for
clustering used cars based on their year, price, and mileage. The K-means
algorithm is an unsupervised machine learning algorithm used for grouping
similar data points into clusters.
1. Import the required libraries: pandas, numpy, matplotlib.pyplot, and

sklearn.cluster (KMeans) and sklearn.preprocessing (StandardScaler).
2. Load the dataset from the URL:
"https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/
master/usedcars.csv".
3. Preprocess the data:
NIDHIN S | 711321107076 | 21EE076

a. Select the relevant numeric features for clustering: 'year', 'price', and
'mileage'.
b. Standardize the features using StandardScaler to ensure equal importance of
each feature.
4. Determine the optimal number of clusters using the Elbow Method:
a. Initialize an empty list 'wcss' to store the Within-Cluster Sum of Squares
(WCSS) values.
b. Iterate over a range of cluster numbers (1 to 10 in this case).
c. For each number of clusters:
i. Create a KMeans object with the specified number of clusters and other
parameters (init='k-means++', max_iter=300, n_init=10, random_state=0).
ii. Fit the KMeans object to the scaled data.
iii. Append the WCSS value (inertia_ attribute) to the 'wcss' list.
d. Plot the Elbow Method graph with the number of clusters on the x-axis and
WCSS on the y-axis.
5. Based on the Elbow Method graph, choose the optimal number of clusters
(k=3 in this case).
6. Perform K-means clustering with the chosen number of clusters:
a. Create a KMeans object with the specified number of clusters and other
parameters.
b. Fit the KMeans object to the scaled data and obtain the cluster labels
(clusters).
7. Add the cluster labels to the original dataset as a new column 'Cluster'.
8. Visualize the clusters:
a. Create a scatter plot with the scaled 'year' and 'price' features, colored by the
cluster labels.
b. Plot the cluster centroids (cluster_centers_ attribute) as red points.
c. Add labels and a title to the plot.
9. Print the cluster centers by inverse-transforming the cluster_centers_ attribute
using the StandardScaler.
The link to the dataset used in the program is:

https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/
master/usedcars.csv
The K-means algorithm is an iterative algorithm that assigns data points to

clusters based on their proximity to the cluster centroids. The algorithm aims to
minimize the within-cluster sum of squares (WCSS), which is the sum of
squared distances between each data point and its assigned cluster centroid. The
Elbow Method is a heuristic technique used to determine the optimal number of
clusters by plotting the WCSS values against the number of clusters and
choosing the "elbow" point where the WCSS starts to level off.
NIDHIN S | 711321107076 | 21EE076

PROGRAM:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
# Load the dataset

url = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-
datasets/master/usedcars.csv"
data = pd.read_csv(url)
# Display the first few rows of the dataset

print(data.head())
# Select relevant numeric features for clustering

X = data[['year', 'price', 'mileage']]
# Standardize the features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Determine the optimal number of clusters using the Elbow Method

wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10,
random_state=0)
kmeans.fit(X_scaled)
wcss.append(kmeans.inertia_)
# Plot the Elbow Method graph

plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS') # Within cluster sum of squares
plt.show()
# Based on the Elbow Method, choose the optimal number of clusters and
perform clustering
k = 3 # Change this to the chosen number of clusters
NIDHIN S | 711321107076 | 21EE076

kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, n_init=10,
random_state=0)
clusters = kmeans.fit_predict(X_scaled)
# Add the cluster labels to the dataset

data['Cluster'] = clusters
# Visualize the clusters

plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=clusters, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300,
c='red', label='Centroids')
plt.title('Cluster of Used Cars')
plt.xlabel('Year (Scaled)')
plt.ylabel('Price (Scaled)')
plt.legend()
plt.show()
# Print the cluster centers

print("Cluster Centers:")
print(scaler.inverse_transform(kmeans.cluster_centers_))
NIDHIN S | 711321107076 | 21EE076

OUTPUT
year model price mileage color transmission
0 2011 SEL 21992 7413 Yellow AUTO
1 2011 SEL 20995 10926 Gray AUTO
2 2011 SEL 19995 7351 Silver AUTO
3 2011 SEL 17809 11613 Gray AUTO
4 2012 SE 17500 8367 White AUTO
Cluster Centers:
[[ 2009.83673469 14720.68367347 29192.26530612]
[ 2007.64285714 10451.9047619 65617.80952381]
[ 2002.4 6268.3 102230.7 ]]
NIDHIN S | 711321107076 | 21EE076

GRAPH:
NIDHIN S | 711321107076 | 21EE076

INFERENCE:
The K-means clustering algorithm has effectively segmented the used car
dataset into three distinct clusters based on the year, price, and mileage features.
The Elbow Method graph indicates the optimal number of clusters as three,
where the within-cluster sum of squares (WCSS) starts to level off. The scatter
plot visualizes these three clusters, with the yellow cluster representing newer
and more expensive cars, the turquoise cluster representing mid-range cars in
terms of age and price, and the purple cluster consisting of older and less
expensive cars with higher mileage. The cluster centers provided in the output
quantify these differences, with Cluster 0 having the newest cars (around 2009-
2010 model years) with higher prices and lower mileage, Cluster 1 representing
slightly older cars (around 2007-2008) with moderate prices and higher
mileage, and Cluster 2 having the oldest cars (around 2002-2003) with the
lowest prices and highest mileage. This clustering effectively captures the
inherent patterns in the used car market, where newer cars with lower mileage
command higher prices, while older cars with higher mileage are generally less
expensive, aligning with intuitive expectations.
NIDHIN S | 711321107076 | 21EE076

Outcome
Exemplary Proficient Apprentice Novice
Parameter Score
(4) (3) (2) (1)
(1 - 4)
Identifying clear goals for
the experiment
Choosing the appropriate
experimental test bed
(Hardware, Software,
Emulation, Simulation, or
hybrid) to achieve the
identified objectives of the
experiment
Designing and conducting
the experiment
Ability to analyze and
interpret the data
RESULT:
The K-means clustering algorithm successfully grouped the used
car dataset from https://raw.githubusercontent.com/stedy/Machine-Learning-
with-R-datasets/master/usedcars.csv into three distinct clusters based on the
year, price, and mileage features. Cluster 0 represented newer and more
expensive cars, Cluster 1 captured mid-range cars, and Cluster 2 consisted of
older and less expensive cars with higher mileage, effectively separating the
used cars based on their age, price, and mileage characteristics.
NIDHIN S | 711321107076 | 21EE076

NIDHIN S | 711321107076 | 21EE076

Kman 07

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kman 07

Uploaded by

Copyright:

Available Formats

Ex.

The objective of clustering electricity consumers using the K-means

Hardware and Software:

Device name MYPC-8757

Processor Intel(R)Core(TM)i5-10210U CPU@1.60GHz 2.11 GHz

Installed RAM 8.00 GB (7.84 GB usable)

1. Import the required libraries: pandas, numpy, matplotlib.pyplot, and

NIDHIN S | 711321107076 | 21EE076

The link to the dataset used in the program is:

The K-means algorithm is an iterative algorithm that assigns data points to

NIDHIN S | 711321107076 | 21EE076

# Load the dataset

# Display the first few rows of the dataset

# Select relevant numeric features for clustering

# Standardize the features

# Determine the optimal number of clusters using the Elbow Method

# Plot the Elbow Method graph

NIDHIN S | 711321107076 | 21EE076

# Add the cluster labels to the dataset

# Visualize the clusters

# Print the cluster centers

NIDHIN S | 711321107076 | 21EE076

year model price mileage color transmission

0 2011 SEL 21992 7413 Yellow AUTO

1 2011 SEL 20995 10926 Gray AUTO

2 2011 SEL 19995 7351 Silver AUTO

3 2011 SEL 17809 11613 Gray AUTO

4 2012 SE 17500 8367 White AUTO

[[ 2009.83673469 14720.68367347 29192.26530612]

[ 2007.64285714 10451.9047619 65617.80952381]

[ 2002.4 6268.3 102230.7 ]]

NIDHIN S | 711321107076 | 21EE076

NIDHIN S | 711321107076 | 21EE076

NIDHIN S | 711321107076 | 21EE076

NIDHIN S | 711321107076 | 21EE076

You might also like