Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

MACHINE LEARNING (PROJECT)

Part 1: Machine Learning Models


You work for an office transport company. You are in discussions with ABC Consulting Company for
providing transport for their employees. For this purpose, you are tasked with understanding how do the
employees of ABC Consulting prefer to commute presently (between home and office). Based on the
parameters like age, salary, work experience etc. given in the data set ‘Transport.csv’, you are required to
predict the preferred mode of transport. The project requires you to build several Machine Learning
models and compare them so that the model can be finalised.

1. Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and

missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.

2. Split the data into train and test in the ratio 70:30. Is scaling necessary or not?

3. Build the following models on the 70% training data and check the performance of these

models on the Training as well as the 30% Test data using the various inferences from the

Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune the models

wherever required for optimum performance.:

4. Which model performs the best?

5. What are your business insights?

Importing Data file


Age Gender Engineer MBA Work Exp Salary Distance license Transport

0 28 Male 0 0 4 14.3 3.2 0 Public Transport

1 23 Female 1 0 4 8.3 3.3 0 Public Transport

2 29 Male 1 0 7 13.4 4.1 0 Public Transport

3 28 Female 1 1 5 13.4 4.5 0 Public Transport

4 27 Male 1 0 4 13.4 4.6 0 Public Transport

5 26 Male 1 0 4 12.3 4.8 1 Public Transport

6 28 Male 1 0 5 14.4 5.1 0 Private Transport

7 26 Female 1 0 3 10.5 5.1 0 Public Transport

8 22 Male 1 0 1 7.5 5.1 0 Public Transport

9 27 Male 1 0 4 13.5 5.2 0 Public Transport

This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00

https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Performing Exploratory Data Analysis

Age Engineer MBA Work Exp Salary Distance license

count 444.000000 444.000000 444.000000 444.000000 444.000000 444.000000 444.000000

mean 27.747748 0.754505 0.252252 6.299550 16.238739 11.323198 0.234234

std 4.416710 0.430866 0.434795 5.112098 10.453851 3.606149 0.423997

min 18.000000 0.000000 0.000000 0.000000 6.500000 3.200000 0.000000

25% 25.000000 1.000000 0.000000 3.000000 9.800000 8.800000 0.000000

50% 27.000000 1.000000 0.000000 5.000000 13.600000 11.000000 0.000000

75% 30.000000 1.000000 1.000000 8.000000 15.725000 13.425000 0.000000

max 43.000000 1.000000 1.000000 24.000000 57.000000 23.400000 1.000000

As per the information above we have type data. Later on we would convert these two as dummy variables
to create and optimise the model. It’s important to check for Null & Duplicate Values in the data which if
present would call for an appropriate action.

dups = df.duplicated()

print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 0

As we can see from the analysis that the data has neither missing values nor it has duplicate values, hence
it doesn’t require any corrective actions.

Univariate / Bivariate analysis

This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00

https://www.coursehero.com/file/106688812/Cars-Projectpdf/
The distribution graphs here show that the each field is either normally or close to normally distributed.

The correlation matrix here shows Strong correlation between Age & Work ExP which is quite obvious in
normal circumstances ie. Higher the age higher is the work-ex. Also there is correlation between Age &
Salary ie. Higher the age higher is the salary which also confirms the prior correlation between age and
Work ExP. The correlation between being and Engineer or MBA with other fields like Salary, Work ExP,
distance or license.

At this step we will take two Actions

1: Treat Outliers

2: Introduce Dummy Variables for Transport, Gender with Transport being our dependent Variable. In this
scenario as the data is normally distributed we don’t require scaling of data hence we would skip that
process.

trans=[]
for y in df["Transport"]:
if y=="Public Transport":
trans.append(1)
else:
trans.append(0)
df["Transport"]=trans
df.head(15)

Age Gender Engineer MBA Work Exp Salary Distance license Transport

0 28 1 0 0 4 14.3 3.2 0 1

1 23 0 1 0 4 8.3 3.3 0 1

2 29 1 1 0 7 13.4 4.1 0 1

This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00

https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Age Gender Engineer MBA Work Exp Salary Distance license Transport

3 28 0 1 1 5 13.4 4.5 0 1

4 27 1 1 0 4 13.4 4.6 0 1

5 26 1 1 0 4 12.3 4.8 1 1

6 28 1 1 0 5 14.4 5.1 0 0

7 26 0 1 0 3 10.5 5.1 0 1

8 22 1 1 0 1 7.5 5.1 0 1

9 27 1 1 0 4 13.5 5.2 0 1

10 25 0 1 0 4 11.5 5.2 0 1

11 27 1 1 0 4 13.5 5.3 1 1

12 24 1 1 0 2 8.5 5.4 0 1

13 27 1 1 0 4 13.4 5.5 1 1

14 32 1 1 0 9 15.5 5.5 0 1

3. Build the following models on the 70% training data and check the performance of these models on the
Training as well as the 30% Test data using the various inferences from the Confusion Matrix and plotting a
AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.:

Now as we have treated the outliers and inserted dummy variables we would now split the data in 70-30 ratio to beg
in our Train & Test exercise
As we can observe that the difference between Test & train hence we would go ahead with different model.

from sklearn.model_selection import train_test_split


X=df.drop(columns=['Transport'])
Y=df['Transport']
x_train, x_test,y_train,y_test = train_test_split(X,Y,test_size=0.3,random_state=0)

model_ac=[]
LR=model.fit(x_train,y_train)
LR_prob=model.predict_proba(x_test)

LR_ac=model.score(x_test,y_test)*100
model_ac.append(LR_ac)
print("Accuracy LR: ",LR_ac)
Accuracy LR: 79.8507462686567
LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
LDA = model.fit_transform(x_train, y_train)
LDA_prob=model.predict_proba(x_test)
LDA_ac=model.score(x_test,y_test)*100
model_ac.append(LDA_ac)
print("Accuracy LDA: ",LDA_ac)
Accuracy LDA: 82.08955223880598

This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00

https://www.coursehero.com/file/106688812/Cars-Projectpdf/
DTC
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
DTC = model.fit(x_train, y_train)
DTC_prob=model.predict_proba(x_test)
DTC_ac=model.score(x_test,y_test)*100
model_ac.append(DTC_ac)
print("Accuracy DTC: ",DTC_ac)
Accuracy DTC: 79.8507462686567
GNB
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
GNB=model.fit(x_train, y_train)
GNB_prob=model.predict_proba(x_test)
GNB_ac=model.score(x_test,y_test)*100
model_ac.append(GNB_ac)
print("Accuracy GNB: ",GNB_ac)
Accuracy GNB: 85.07462686567165
KNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
KNN=model.fit(x_train, y_train)
KNN_prob=model.predict_proba(x_test)
KNN_ac=model.score(x_test,y_test)*100
model_ac.append(KNN_ac)
print("Accuracy KNN: ",KNN_ac)
Accuracy KNN: 82.08955223880598
RFC
from sklearn.ensemble import RandomForestClassifier
model= RandomForestClassifier()
RFC = model.fit(x_train, y_train)
RFC_prob=model.predict_proba(x_test)
RFC_ac=model.score(x_test,y_test)*100
model_ac.append(RFC_ac)
print("Accuracy RFC: ",RFC_ac)
Accuracy RFC: 85.82089552238806
GBC
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
GBC = model.fit(x_train,y_train)
GBC_prob=model.predict_proba(x_test)
GBC_ac=model.score(x_test,y_test)*100
model_ac.append(GBC_ac)
print("Accuracy GBC: ",GBC_ac)
Accuracy GBC: 79.8507462686567

# print(LR_prob)
display(y_test)
327 1
233 0
122 1
102 1
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00

https://www.coursehero.com/file/106688812/Cars-Projectpdf/
71 1
..
213 1
416 1
423 0
325 0
403 1
Name: Transport, Length: 134, dtype: int64

from sklearn.metrics import roc_curve


from sklearn.metrics import roc_auc_score

LR_fpr, LR_tpr, LR_thresh = roc_curve(y_test, LR_prob[:,1], pos_label=1)


LDA_fpr, LDA_tpr, LDA_thresh = roc_curve(y_test, LDA_prob[:,1], pos_label=1)
DTC_fpr, DTC_tpr, DTC_thresh = roc_curve(y_test, DTC_prob[:,1], pos_label=1)
GNB_fpr, GNB_tpr, GNB_thresh = roc_curve(y_test, GNB_prob[:,1],pos_label=1)
KNN_fpr, KNN_tpr, KNN_thresh = roc_curve(y_test, KNN_prob[:,1],pos_label=1)
RFC_fpr, RFC_tpr, RFC_thresh = roc_curve(y_test, RFC_prob[:,1],pos_label=1)
GBC_fpr, GBC_tpr, GBC_thresh = roc_curve(y_test,GBC_prob[:,1],pos_label=1)
random_prob = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_prob, pos_label=1)
print(LR_fpr)
LR_auc_score = roc_auc_score(y_test, LR_prob[:,1])
LDA_auc_score = roc_auc_score(y_test, LDA_prob[:,1])
DTC_auc_score = roc_auc_score(y_test, DTC_prob[:,1])
GNB_auc_score = roc_auc_score(y_test, GNB_prob[:,1])
KNN_auc_score = roc_auc_score(y_test, KNN_prob[:,1])
RFC_auc_score = roc_auc_score(y_test, RFC_prob[:,1])
GBC_auc_score = roc_auc_score(y_test, GBC_prob[:,1])
print(LR_auc_score,LDA_auc_score,DTC_auc_score,GNB_auc_score,KNN_auc_score,GBC_auc_score)
[0. 0. 0. 0.02439024 0.02439024 0.04878049
0.04878049 0.07317073 0.07317073 0.09756098 0.09756098 0.09756098
0.09756098 0.14634146 0.14634146 0.14634146 0.14634146 0.19512195
0.19512195 0.2195122 0.2195122 0.24390244 0.24390244 0.26829268
0.26829268 0.29268293 0.29268293 0.29268293 0.29268293 0.31707317
0.31707317 0.3902439 0.3902439 0.41463415 0.41463415 0.46341463
0.46341463 0.48780488 0.48780488 0.6097561 0.6097561 0.63414634
0.63414634 0.68292683 0.68292683 0.85365854 0.90243902 0.92682927
0.97560976 1. ]
0.8173354314188302 0.8318908995541568 0.7866509310254393 0.8465774980330448 0.793207448
2035143 0.8196957776029373

import matplotlib.pyplot as plt


plt.style.use('seaborn')
# plot roc curves
plt.plot(LR_fpr, LR_tpr, linestyle='--',color='orange', label='Logistic Regression')
plt.plot(LDA_fpr, LDA_tpr, linestyle='--',color='green', label='LinearDiscriminantAnalysisn')
plt.plot(DTC_fpr, DTC_tpr, linestyle='--',color='white', label='Naive Bayes')
plt.plot(GNB_fpr, GNB_tpr, linestyle='--',color='red', label='Naive Bayes')
plt.plot(KNN_fpr, KNN_tpr, linestyle='--',color='blue', label='KNeighborsClassifier')
plt.plot(RFC_fpr, RFC_tpr, linestyle='--',color='pink', label='RandomForestClassifier')
plt.plot(GBC_fpr, GBC_tpr, linestyle='--',color='black', label='GradientBoostingClassifier')
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00

https://www.coursehero.com/file/106688812/Cars-Projectpdf/
plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')
plt.legend(loc='best')
plt.show();

4. Which model performs the best?

label=["LR","LDA","DTC","GNB","KNN","RFC","GBC"]
plt.pie(model_ac,labels=label,explode=[0,0,0,0.1,0,0.1,0],autopct='%1.1f%%')
([<matplotlib.patches.Wedge at 0x1cb93b0cb50>,
<matplotlib.patches.Wedge at 0x1cb94724250>,
<matplotlib.patches.Wedge at 0x1cb947248e0>,
<matplotlib.patches.Wedge at 0x1cb94724f70>,
<matplotlib.patches.Wedge at 0x1cb94733640>,
<matplotlib.patches.Wedge at 0x1cb94733cd0>,
<matplotlib.patches.Wedge at 0x1cb946e6df0>],
[Text(0.9968331559075043, 0.4651060731526576, 'LR'),
Text(0.270949789377357, 1.066107973723284, 'LDA'),
Text(-0.6538039927664018, 0.8846131013288866, 'DTC'),
Text(-1.1993608433869059, 0.03916078842733598, 'GNB'),
Text(-0.6997821010497332, -0.8487078478784211, 'KNN'),
Text(0.2717963586774379, -1.1688142450405392, 'RFC'),
Text(0.996833286546585, -0.4651057931618721, 'GBC')],
[Text(0.5437271759495478, 0.2536942217196314, '13.9%'),
Text(0.14779079420583108, 0.5815134402127002, '14.3%'),
Text(-0.3566203596907646, 0.48251623708848357, '13.9%'),
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00

https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Text(-0.6996271586423616, 0.02284379324927932, '14.8%'),
Text(-0.38169932784530897, -0.46293155338822967, '14.3%'),
Text(0.1585478758951721, -0.6818083096069811, '14.9%'),
Text(0.5437272472072281, -0.2536940689973848, '13.9%')])

plt.bar(label,model_ac)

Age Gender Engineer MBA Work Exp Salary Distance license

Transport

This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00

https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Age Gender Engineer MBA Work Exp Salary Distance license

Transport

0 29.694444 0.645833 0.777778 0.201389 8.972222 22.618056 13.423611 0.493056

1 26.813333 0.743333 0.743333 0.276667 5.016667 13.176667 10.315000 0.110000

5. What are your business insights?

After having compared all the models mentioned above I have found that We have compared all the models
mentioned above and found that all the models are performing decently well with an average score of around but I
choose different Model for both classes in Training & test Set. All the models are having an Accuracy score where the
Training set is higher as compared to Testing Business Insights: As we have seen from the data that a lot of
respondents tend.

This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00

https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Powered by TCPDF (www.tcpdf.org)

You might also like