Cars Project PDF

MACHINE LEARNING (PROJECT)
Part 1: Machine Learning Models

You work for an office transport company. You are in discussions with ABC Consulting Company for
providing transport for their employees. For this purpose, you are tasked with understanding how do the
employees of ABC Consulting prefer to commute presently (between home and office). Based on the
parameters like age, salary, work experience etc. given in the data set ‘Transport.csv’, you are required to
predict the preferred mode of transport. The project requires you to build several Machine Learning
models and compare them so that the model can be finalised.
1. Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.
2. Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
3. Build the following models on the 70% training data and check the performance of these
models on the Training as well as the 30% Test data using the various inferences from the
Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune the models
wherever required for optimum performance.:
4. Which model performs the best?
5. What are your business insights?
Importing Data file

Age Gender Engineer MBA Work Exp Salary Distance license Transport
0 28 Male 0 0 4 14.3 3.2 0 Public Transport
1 23 Female 1 0 4 8.3 3.3 0 Public Transport
6 28 Male 1 0 5 14.4 5.1 0 Private Transport
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Performing Exploratory Data Analysis
Age Engineer MBA Work Exp Salary Distance license
count 444.000000 444.000000 444.000000 444.000000 444.000000 444.000000 444.000000
mean 27.747748 0.754505 0.252252 6.299550 16.238739 11.323198 0.234234
std 4.416710 0.430866 0.434795 5.112098 10.453851 3.606149 0.423997
min 18.000000 0.000000 0.000000 0.000000 6.500000 3.200000 0.000000
25% 25.000000 1.000000 0.000000 3.000000 9.800000 8.800000 0.000000
50% 27.000000 1.000000 0.000000 5.000000 13.600000 11.000000 0.000000
75% 30.000000 1.000000 1.000000 8.000000 15.725000 13.425000 0.000000
max 43.000000 1.000000 1.000000 24.000000 57.000000 23.400000 1.000000
As per the information above we have type data. Later on we would convert these two as dummy variables
to create and optimise the model. It’s important to check for Null & Duplicate Values in the data which if
present would call for an appropriate action.
dups = df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
Number of duplicate rows = 0
As we can see from the analysis that the data has neither missing values nor it has duplicate values, hence
it doesn’t require any corrective actions.
Univariate / Bivariate analysis
The distribution graphs here show that the each field is either normally or close to normally distributed.
The correlation matrix here shows Strong correlation between Age & Work ExP which is quite obvious in
normal circumstances ie. Higher the age higher is the work-ex. Also there is correlation between Age &
Salary ie. Higher the age higher is the salary which also confirms the prior correlation between age and
Work ExP. The correlation between being and Engineer or MBA with other fields like Salary, Work ExP,
distance or license.
At this step we will take two Actions
1: Treat Outliers
2: Introduce Dummy Variables for Transport, Gender with Transport being our dependent Variable. In this
scenario as the data is normally distributed we don’t require scaling of data hence we would skip that
process.
trans=[]
for y in df["Transport"]:
if y=="Public Transport":
trans.append(1)
else:
trans.append(0)
df["Transport"]=trans
df.head(15)
0 28 1 0 0 4 14.3 3.2 0 1
1 23 0 1 0 4 8.3 3.3 0 1
2 29 1 1 0 7 13.4 4.1 0 1
3 28 0 1 1 5 13.4 4.5 0 1
4 27 1 1 0 4 13.4 4.6 0 1
5 26 1 1 0 4 12.3 4.8 1 1
6 28 1 1 0 5 14.4 5.1 0 0
7 26 0 1 0 3 10.5 5.1 0 1
8 22 1 1 0 1 7.5 5.1 0 1
9 27 1 1 0 4 13.5 5.2 0 1
10 25 0 1 0 4 11.5 5.2 0 1
11 27 1 1 0 4 13.5 5.3 1 1
12 24 1 1 0 2 8.5 5.4 0 1
13 27 1 1 0 4 13.4 5.5 1 1
14 32 1 1 0 9 15.5 5.5 0 1
3. Build the following models on the 70% training data and check the performance of these models on the
Training as well as the 30% Test data using the various inferences from the Confusion Matrix and plotting a
AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.:
Now as we have treated the outliers and inserted dummy variables we would now split the data in 70-30 ratio to beg
in our Train & Test exercise
As we can observe that the difference between Test & train hence we would go ahead with different model.
from sklearn.model_selection import train_test_split

X=df.drop(columns=['Transport'])
Y=df['Transport']
x_train, x_test,y_train,y_test = train_test_split(X,Y,test_size=0.3,random_state=0)
model_ac=[]
LR=model.fit(x_train,y_train)
LR_prob=model.predict_proba(x_test)
LR_ac=model.score(x_test,y_test)*100
model_ac.append(LR_ac)
print("Accuracy LR: ",LR_ac)
Accuracy LR: 79.8507462686567
LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
LDA = model.fit_transform(x_train, y_train)
LDA_prob=model.predict_proba(x_test)
LDA_ac=model.score(x_test,y_test)*100
model_ac.append(LDA_ac)
print("Accuracy LDA: ",LDA_ac)
Accuracy LDA: 82.08955223880598
DTC
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
DTC = model.fit(x_train, y_train)
DTC_prob=model.predict_proba(x_test)
DTC_ac=model.score(x_test,y_test)*100
model_ac.append(DTC_ac)
print("Accuracy DTC: ",DTC_ac)
Accuracy DTC: 79.8507462686567
GNB
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
GNB=model.fit(x_train, y_train)
GNB_prob=model.predict_proba(x_test)
GNB_ac=model.score(x_test,y_test)*100
model_ac.append(GNB_ac)
print("Accuracy GNB: ",GNB_ac)
Accuracy GNB: 85.07462686567165
KNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
KNN=model.fit(x_train, y_train)
KNN_prob=model.predict_proba(x_test)
KNN_ac=model.score(x_test,y_test)*100
model_ac.append(KNN_ac)
print("Accuracy KNN: ",KNN_ac)
Accuracy KNN: 82.08955223880598
RFC
from sklearn.ensemble import RandomForestClassifier
model= RandomForestClassifier()
RFC = model.fit(x_train, y_train)
RFC_prob=model.predict_proba(x_test)
RFC_ac=model.score(x_test,y_test)*100
model_ac.append(RFC_ac)
print("Accuracy RFC: ",RFC_ac)
Accuracy RFC: 85.82089552238806
GBC
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
GBC = model.fit(x_train,y_train)
GBC_prob=model.predict_proba(x_test)
GBC_ac=model.score(x_test,y_test)*100
model_ac.append(GBC_ac)
print("Accuracy GBC: ",GBC_ac)
Accuracy GBC: 79.8507462686567
# print(LR_prob)
display(y_test)
327 1
233 0
122 1
102 1
71 1
..
213 1
416 1
423 0
325 0
403 1
Name: Transport, Length: 134, dtype: int64
from sklearn.metrics import roc_curve

from sklearn.metrics import roc_auc_score
LR_fpr, LR_tpr, LR_thresh = roc_curve(y_test, LR_prob[:,1], pos_label=1)

LDA_fpr, LDA_tpr, LDA_thresh = roc_curve(y_test, LDA_prob[:,1], pos_label=1)
DTC_fpr, DTC_tpr, DTC_thresh = roc_curve(y_test, DTC_prob[:,1], pos_label=1)
GNB_fpr, GNB_tpr, GNB_thresh = roc_curve(y_test, GNB_prob[:,1],pos_label=1)
KNN_fpr, KNN_tpr, KNN_thresh = roc_curve(y_test, KNN_prob[:,1],pos_label=1)
RFC_fpr, RFC_tpr, RFC_thresh = roc_curve(y_test, RFC_prob[:,1],pos_label=1)
GBC_fpr, GBC_tpr, GBC_thresh = roc_curve(y_test,GBC_prob[:,1],pos_label=1)
random_prob = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_prob, pos_label=1)
print(LR_fpr)
LR_auc_score = roc_auc_score(y_test, LR_prob[:,1])
LDA_auc_score = roc_auc_score(y_test, LDA_prob[:,1])
DTC_auc_score = roc_auc_score(y_test, DTC_prob[:,1])
GNB_auc_score = roc_auc_score(y_test, GNB_prob[:,1])
KNN_auc_score = roc_auc_score(y_test, KNN_prob[:,1])
RFC_auc_score = roc_auc_score(y_test, RFC_prob[:,1])
GBC_auc_score = roc_auc_score(y_test, GBC_prob[:,1])
print(LR_auc_score,LDA_auc_score,DTC_auc_score,GNB_auc_score,KNN_auc_score,GBC_auc_score)
[0. 0. 0. 0.02439024 0.02439024 0.04878049
0.04878049 0.07317073 0.07317073 0.09756098 0.09756098 0.09756098
0.09756098 0.14634146 0.14634146 0.14634146 0.14634146 0.19512195
0.19512195 0.2195122 0.2195122 0.24390244 0.24390244 0.26829268
0.26829268 0.29268293 0.29268293 0.29268293 0.29268293 0.31707317
0.31707317 0.3902439 0.3902439 0.41463415 0.41463415 0.46341463
0.46341463 0.48780488 0.48780488 0.6097561 0.6097561 0.63414634
0.63414634 0.68292683 0.68292683 0.85365854 0.90243902 0.92682927
0.97560976 1. ]
0.8173354314188302 0.8318908995541568 0.7866509310254393 0.8465774980330448 0.793207448
2035143 0.8196957776029373
import matplotlib.pyplot as plt

plt.style.use('seaborn')
# plot roc curves
plt.plot(LR_fpr, LR_tpr, linestyle='--',color='orange', label='Logistic Regression')
plt.plot(LDA_fpr, LDA_tpr, linestyle='--',color='green', label='LinearDiscriminantAnalysisn')
plt.plot(DTC_fpr, DTC_tpr, linestyle='--',color='white', label='Naive Bayes')
plt.plot(GNB_fpr, GNB_tpr, linestyle='--',color='red', label='Naive Bayes')
plt.plot(KNN_fpr, KNN_tpr, linestyle='--',color='blue', label='KNeighborsClassifier')
plt.plot(RFC_fpr, RFC_tpr, linestyle='--',color='pink', label='RandomForestClassifier')
plt.plot(GBC_fpr, GBC_tpr, linestyle='--',color='black', label='GradientBoostingClassifier')
plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')
plt.legend(loc='best')
plt.show();
4. Which model performs the best?
label=["LR","LDA","DTC","GNB","KNN","RFC","GBC"]
plt.pie(model_ac,labels=label,explode=[0,0,0,0.1,0,0.1,0],autopct='%1.1f%%')
([<matplotlib.patches.Wedge at 0x1cb93b0cb50>,
<matplotlib.patches.Wedge at 0x1cb94724250>,
<matplotlib.patches.Wedge at 0x1cb947248e0>,
<matplotlib.patches.Wedge at 0x1cb94724f70>,
<matplotlib.patches.Wedge at 0x1cb94733640>,
<matplotlib.patches.Wedge at 0x1cb94733cd0>,
<matplotlib.patches.Wedge at 0x1cb946e6df0>],
[Text(0.9968331559075043, 0.4651060731526576, 'LR'),
Text(0.270949789377357, 1.066107973723284, 'LDA'),
Text(-0.6538039927664018, 0.8846131013288866, 'DTC'),
Text(-1.1993608433869059, 0.03916078842733598, 'GNB'),
Text(-0.6997821010497332, -0.8487078478784211, 'KNN'),
Text(0.2717963586774379, -1.1688142450405392, 'RFC'),
Text(0.996833286546585, -0.4651057931618721, 'GBC')],
[Text(0.5437271759495478, 0.2536942217196314, '13.9%'),
Text(0.14779079420583108, 0.5815134402127002, '14.3%'),
Text(-0.3566203596907646, 0.48251623708848357, '13.9%'),
Text(-0.6996271586423616, 0.02284379324927932, '14.8%'),
Text(-0.38169932784530897, -0.46293155338822967, '14.3%'),
Text(0.1585478758951721, -0.6818083096069811, '14.9%'),
Text(0.5437272472072281, -0.2536940689973848, '13.9%')])
plt.bar(label,model_ac)
Age Gender Engineer MBA Work Exp Salary Distance license
Transport
Age Gender Engineer MBA Work Exp Salary Distance license
Transport
0 29.694444 0.645833 0.777778 0.201389 8.972222 22.618056 13.423611 0.493056
1 26.813333 0.743333 0.743333 0.276667 5.016667 13.176667 10.315000 0.110000
5. What are your business insights?
After having compared all the models mentioned above I have found that We have compared all the models
mentioned above and found that all the models are performing decently well with an average score of around but I
choose different Model for both classes in Training & test Set. All the models are having an Accuracy score where the
Training set is higher as compared to Testing Business Insights: As we have seen from the data that a lot of
respondents tend.
Powered by TCPDF (www.tcpdf.org)

Cars Project PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cars Project PDF

Uploaded by

Copyright:

Available Formats

MACHINE LEARNING (PROJECT)

Part 1: Machine Learning Models

wherever required for optimum performance.:

4. Which model performs the best?

5. What are your business insights?

Importing Data file

0 28 Male 0 0 4 14.3 3.2 0 Public Transport

1 23 Female 1 0 4 8.3 3.3 0 Public Transport

2 29 Male 1 0 7 13.4 4.1 0 Public Transport

3 28 Female 1 1 5 13.4 4.5 0 Public Transport

4 27 Male 1 0 4 13.4 4.6 0 Public Transport

5 26 Male 1 0 4 12.3 4.8 1 Public Transport

6 28 Male 1 0 5 14.4 5.1 0 Private Transport

7 26 Female 1 0 3 10.5 5.1 0 Public Transport

8 22 Male 1 0 1 7.5 5.1 0 Public Transport

9 27 Male 1 0 4 13.5 5.2 0 Public Transport

Age Engineer MBA Work Exp Salary Distance license

count 444.000000 444.000000 444.000000 444.000000 444.000000 444.000000 444.000000

mean 27.747748 0.754505 0.252252 6.299550 16.238739 11.323198 0.234234

std 4.416710 0.430866 0.434795 5.112098 10.453851 3.606149 0.423997

min 18.000000 0.000000 0.000000 0.000000 6.500000 3.200000 0.000000

25% 25.000000 1.000000 0.000000 3.000000 9.800000 8.800000 0.000000

50% 27.000000 1.000000 0.000000 5.000000 13.600000 11.000000 0.000000

75% 30.000000 1.000000 1.000000 8.000000 15.725000 13.425000 0.000000

max 43.000000 1.000000 1.000000 24.000000 57.000000 23.400000 1.000000

print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 0

Univariate / Bivariate analysis

At this step we will take two Actions

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_curve

LR_fpr, LR_tpr, LR_thresh = roc_curve(y_test, LR_prob[:,1], pos_label=1)

import matplotlib.pyplot as plt

4. Which model performs the best?

Age Gender Engineer MBA Work Exp Salary Distance license

0 29.694444 0.645833 0.777778 0.201389 8.972222 22.618056 13.423611 0.493056

1 26.813333 0.743333 0.743333 0.276667 5.016667 13.176667 10.315000 0.110000

5. What are your business insights?

You might also like