Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

21bce5695-knn-lab7

March 13, 2024

21BCE5695 M. Ashwin

1 K Nearest Neighbours
1.1 Importing required libraries
[ ]: from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.metrics import classification_report, mean_squared_error
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

1.2 Importing Dataset


[ ]: df = pd.read_csv('apple_quality.csv')

[ ]: print(df.head(2))

A_id Size Weight Sweetness Crunchiness Juiciness Ripeness \


0 0 -3.970049 -2.512336 5.346330 -1.012009 1.844900 0.32984
1 1 -1.195217 -2.839257 3.664059 1.588232 0.853286 0.86753

Acidity Quality
0 -0.491590 good
1 -0.722809 good
Dropping the ID column since it is not relevant to the machine learning model
[ ]: df.drop(['A_id'], axis=1, inplace=True)

Splitting into input and output data

1
[ ]: x = df.drop(['Quality'], axis=1)
y = df['Quality']

1.3 Data Analysis


[ ]: plt.figure(figsize=(25,10))
for (i,v) in enumerate(x.columns):
plt.subplot(3,df.shape[1],i+1);
plt.hist(df.iloc[:,i],bins="sqrt")
plt.title(df.columns[i],fontsize=9);

Encoding the categorical output values into binary values


[ ]: label = []
for i in tqdm(df['Quality']):
if i=='bad':
label.append(0)
else:
label.append(1)

df['Quality'] = label

100%|����������| 4000/4000 [00:00<00:00, 945994.70it/s]

[ ]: dfinfo = pd.DataFrame(df.dtypes,columns=["dtypes"])
for (m,n) in zip([df.count(),df.isna().sum()],["count","isna"]):
dfinfo = dfinfo.merge(pd.
↪DataFrame(m,columns=[n]),right_index=True,left_index=True,how="inner");

dfinfo.T.append(df.describe())

<ipython-input-65-4673ff7821a0>:4: FutureWarning: The frame.append method is


deprecated and will be removed from pandas in a future version. Use
pandas.concat instead.
dfinfo.T.append(df.describe())

[ ]: Size Weight Sweetness Crunchiness Juiciness Ripeness \


dtypes float64 float64 float64 float64 float64 float64
count 4000 4000 4000 4000 4000 4000
isna 0 0 0 0 0 0

2
count 4000.0 4000.0 4000.0 4000.0 4000.0 4000.0
mean -0.503015 -0.989547 -0.470479 0.985478 0.512118 0.498277
std 1.928059 1.602507 1.943441 1.402757 1.930286 1.874427
min -7.151703 -7.149848 -6.894485 -6.055058 -5.961897 -5.864599
25% -1.816765 -2.01177 -1.738425 0.062764 -0.801286 -0.771677
50% -0.513703 -0.984736 -0.504758 0.998249 0.534219 0.503445
75% 0.805526 0.030976 0.801922 1.894234 1.835976 1.766212
max 6.406367 5.790714 6.374916 7.619852 7.364403 7.237837

Acidity Quality
dtypes float64 int64
count 4000 4000
isna 0 0
count 4000.0 4000.0
mean 0.076877 0.501
std 2.11027 0.500062
min -7.010538 0.0
25% -1.377424 0.0
50% 0.022609 1.0
75% 1.510493 1.0
max 7.404736 1.0

Correlation matrix
[ ]: df.corr().round(2).style.background_gradient(cmap="viridis")

[ ]: <pandas.io.formats.style.Styler at 0x78992d29c3d0>

[ ]: print(df.head(3))

Size Weight Sweetness Crunchiness Juiciness Ripeness Acidity \


0 -3.970049 -2.512336 5.346330 -1.012009 1.844900 0.329840 -0.491590
1 -1.195217 -2.839257 3.664059 1.588232 0.853286 0.867530 -0.722809
2 -0.292024 -1.351282 -1.738429 -0.342616 2.838636 -0.038033 2.621636

Quality
0 1
1 1
2 0

1.4 Model building and testing


Splitting data into training and testing
[ ]: x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.
↪3,stratify=y,random_state=30);

3
[ ]: model = KNeighborsClassifier(algorithm="auto");
parameters = {"n_neighbors":[1,3,5],
"weights":["uniform","distance"]}
model_optim = GridSearchCV(model, parameters, cv=5,scoring="accuracy");

Training the model


[ ]: model_optim.fit(x_train,y_train)

[ ]: GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
param_grid={'n_neighbors': [1, 3, 5],
'weights': ['uniform', 'distance']},
scoring='accuracy')

[ ]: model_optim.best_estimator_

[ ]: KNeighborsClassifier(weights='distance')

Model metrics
[ ]: for (i,x,y) in zip(["Train","Test"],[x_train,x_test],[y_train,y_test]):
print("Classification kNN",i," report:
↪\n",classification_report(y,model_optim.predict(x)))

Classification kNN Train report:


precision recall f1-score support

bad 1.00 1.00 1.00 1397


good 1.00 1.00 1.00 1403

accuracy 1.00 2800


macro avg 1.00 1.00 1.00 2800
weighted avg 1.00 1.00 1.00 2800

Classification kNN Test report:


precision recall f1-score support

bad 0.91 0.90 0.91 599


good 0.90 0.91 0.91 601

accuracy 0.91 1200


macro avg 0.91 0.91 0.91 1200
weighted avg 0.91 0.91 0.91 1200

You might also like