Email Classification: Roll No-41463 (LP-3)

Roll No- 41463 (LP-3)
Email Classification
Classify the email using binary classification method. Email Spam detection has two
states: a) Normal State Not Spam b) Abnormal State Spam. Use K-Nearest Neighbors and
Support Vector Machine for Classification. Analyze their performance.
Dataset used: https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

(https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv)
In [1]: import numpy as np

import pandas as pd

from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.metrics import accuracy_score

In [2]: df = pd.read_csv("emails.csv")
df.head()
Out[2]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastructu
No.
Email
0 0 0 1 0 0 0 2 0 0 ... 0 0 0 0
1
Email
1 8 13 24 6 6 2 102 1 27 ... 0 0 0 0
2
Email
2 0 0 1 0 0 0 8 0 0 ... 0 0 0 0
3
Email
3 0 5 22 0 5 1 51 2 10 ... 0 0 0 0
4
Email
4 7 6 17 1 5 2 57 0 9 ... 0 0 0 0
5
5 rows × 3002 columns

In [3]: df.tail()
Out[3]:
Email
the to ect and for of a you hou ... connevey jay valued lay infrastru
No.
Email
5167 2 2 2 3 0 0 32 0 0 ... 0 0 0 0
5168
Email
5168 35 27 11 2 6 5 151 4 3 ... 0 0 0 0
5169
Email
5169 0 0 1 1 0 0 11 0 0 ... 0 0 0 0
5170
Email
5170 2 7 1 0 2 1 28 2 0 ... 0 0 0 0
5171
Email
5171 22 24 5 1 6 5 148 8 2 ... 0 0 0 0
5172
In [4]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5172 entries, 0 to 5171
Columns: 3002 entries, Email No. to Prediction
dtypes: int64(3001), object(1)
memory usage: 118.5+ MB
In [5]: df.describe()
Out[5]:
the to ect and for of
count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.00000
mean 6.640565 6.188128 5.143852 3.075599 3.124710 2.627030 55.51740
std 11.745009 9.534576 14.101142 6.045970 4.680522 6.229845 87.57417
min 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000
25% 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 12.00000
50% 3.000000 3.000000 1.000000 1.000000 2.000000 1.000000 28.00000
75% 8.000000 7.000000 4.000000 3.000000 4.000000 2.000000 62.25000
max 210.000000 132.000000 344.000000 89.000000 47.000000 77.000000 1898.00000

In [6]: df.isnull().sum()
Out[6]: Email No. 0
the 0
to 0
ect 0
and 0
for 0
of 0
a 0
you 0
hou 0
in 0
on 0
is 0
this 0
enron 0
i 0
be 0
that 0
will 0
have 0
with 0
your 0
at 0
we 0
s 0
are 0
it 0
by 0
com 0
as 0
..
decisions 0
produced 0
ended 0
greatest 0
degree 0
solmonson 0
imbalances 0
fall 0
fear 0
hate 0
fight 0
reallocated 0
debt 0
reform 0
australia 0
plain 0
prompt 0
remains 0
ifhsc 0
enhancements 0
connevey 0
jay 0
valued 0
lay 0
infrastructure 0
military 0
allowing 0
ff 0
dry 0
Prediction 0
Length: 3002, dtype: int64
Splitting Train and Test dataset
In [7]: x = df.iloc[:,1:3001]
y = df.iloc[:,-1].values
In [8]: x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2,
a) Using K-Nearest Neighbours

In [9]: knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
In [ ]:
Analyzing performance
In [10]: print("MSE: ", mean_squared_error(y_test, y_pred))

print("MAE: ", mean_absolute_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score: ", metrics.r2_score(y_test, y_pred))
print("Accuracy Score for KNN: ", accuracy_score(y_test, y_pred))
MSE: 0.12560386473429952
MAE: 0.12560386473429952
RMSE: 0.3544063553807966
R2 Score: 0.40780091899790494
Accuracy Score for KNN: 0.8743961352657005
b) Using Support Vector Machine(SVM)

In [11]: svc = SVC(C=1.0, gamma='auto', kernel='rbf')
svc.fit(x_test, y_test)
y_pred = svc.predict(x_test)
Analyzing Performance
In [12]: print("MSE: ", mean_squared_error(y_test, y_pred))
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2 Score: ", metrics.r2_score(y_test, y_pred))
print("Accuracy Score for KNN: ", accuracy_score(y_test, y_pred))
MSE: 0.07149758454106281
MAE: 0.07149758454106281
RMSE: 0.2673903224521464
R2 Score: 0.6629020615834228
Accuracy Score for KNN: 0.9285024154589372
In [ ]:

Email Classification: Roll No-41463 (LP-3)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Email Classification: Roll No-41463 (LP-3)

Uploaded by

Copyright:

Available Formats

Roll No- 41463 (LP-3)

Dataset used: https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

In [1]: import numpy as np

5 rows × 3002 columns

5 rows × 3002 columns

RangeIndex: 5172 entries, 0 to 5171

Columns: 3002 entries, Email No. to Prediction

dtypes: int64(3001), object(1)

memory usage: 118.5+ MB

count 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.000000 5172.00000

mean 6.640565 6.188128 5.143852 3.075599 3.124710 2.627030 55.51740

std 11.745009 9.534576 14.101142 6.045970 4.680522 6.229845 87.57417

min 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.00000

25% 0.000000 1.000000 1.000000 0.000000 1.000000 0.000000 12.00000

50% 3.000000 3.000000 1.000000 1.000000 2.000000 1.000000 28.00000

75% 8.000000 7.000000 4.000000 3.000000 4.000000 2.000000 62.25000

max 210.000000 132.000000 344.000000 89.000000 47.000000 77.000000 1898.00000

8 rows × 3001 columns

Out[6]: Email No. 0

Length: 3002, dtype: int64

Splitting Train and Test dataset

In [8]: x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2,

a) Using K-Nearest Neighbours

In [10]: print("MSE: ", mean_squared_error(y_test, y_pred))

Accuracy Score for KNN: 0.8743961352657005

b) Using Support Vector Machine(SVM)

Accuracy Score for KNN: 0.9285024154589372

You might also like