Professional Documents
Culture Documents
Cars Project PDF
Cars Project PDF
1. Basic data summary, Univariate, Bivariate analysis, graphs, checking correlations, outliers and
missing values treatment (if necessary) and check the basic descriptive statistics of the dataset.
2. Split the data into train and test in the ratio 70:30. Is scaling necessary or not?
3. Build the following models on the 70% training data and check the performance of these
models on the Training as well as the 30% Test data using the various inferences from the
Confusion Matrix and plotting a AUC-ROC curve along with the AUC values. Tune the models
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Performing Exploratory Data Analysis
As per the information above we have type data. Later on we would convert these two as dummy variables
to create and optimise the model. It’s important to check for Null & Duplicate Values in the data which if
present would call for an appropriate action.
dups = df.duplicated()
As we can see from the analysis that the data has neither missing values nor it has duplicate values, hence
it doesn’t require any corrective actions.
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
The distribution graphs here show that the each field is either normally or close to normally distributed.
The correlation matrix here shows Strong correlation between Age & Work ExP which is quite obvious in
normal circumstances ie. Higher the age higher is the work-ex. Also there is correlation between Age &
Salary ie. Higher the age higher is the salary which also confirms the prior correlation between age and
Work ExP. The correlation between being and Engineer or MBA with other fields like Salary, Work ExP,
distance or license.
1: Treat Outliers
2: Introduce Dummy Variables for Transport, Gender with Transport being our dependent Variable. In this
scenario as the data is normally distributed we don’t require scaling of data hence we would skip that
process.
trans=[]
for y in df["Transport"]:
if y=="Public Transport":
trans.append(1)
else:
trans.append(0)
df["Transport"]=trans
df.head(15)
Age Gender Engineer MBA Work Exp Salary Distance license Transport
0 28 1 0 0 4 14.3 3.2 0 1
1 23 0 1 0 4 8.3 3.3 0 1
2 29 1 1 0 7 13.4 4.1 0 1
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Age Gender Engineer MBA Work Exp Salary Distance license Transport
3 28 0 1 1 5 13.4 4.5 0 1
4 27 1 1 0 4 13.4 4.6 0 1
5 26 1 1 0 4 12.3 4.8 1 1
6 28 1 1 0 5 14.4 5.1 0 0
7 26 0 1 0 3 10.5 5.1 0 1
8 22 1 1 0 1 7.5 5.1 0 1
9 27 1 1 0 4 13.5 5.2 0 1
10 25 0 1 0 4 11.5 5.2 0 1
11 27 1 1 0 4 13.5 5.3 1 1
12 24 1 1 0 2 8.5 5.4 0 1
13 27 1 1 0 4 13.4 5.5 1 1
14 32 1 1 0 9 15.5 5.5 0 1
3. Build the following models on the 70% training data and check the performance of these models on the
Training as well as the 30% Test data using the various inferences from the Confusion Matrix and plotting a
AUC-ROC curve along with the AUC values. Tune the models wherever required for optimum
performance.:
Now as we have treated the outliers and inserted dummy variables we would now split the data in 70-30 ratio to beg
in our Train & Test exercise
As we can observe that the difference between Test & train hence we would go ahead with different model.
model_ac=[]
LR=model.fit(x_train,y_train)
LR_prob=model.predict_proba(x_test)
LR_ac=model.score(x_test,y_test)*100
model_ac.append(LR_ac)
print("Accuracy LR: ",LR_ac)
Accuracy LR: 79.8507462686567
LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
model = LinearDiscriminantAnalysis()
LDA = model.fit_transform(x_train, y_train)
LDA_prob=model.predict_proba(x_test)
LDA_ac=model.score(x_test,y_test)*100
model_ac.append(LDA_ac)
print("Accuracy LDA: ",LDA_ac)
Accuracy LDA: 82.08955223880598
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
DTC
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
DTC = model.fit(x_train, y_train)
DTC_prob=model.predict_proba(x_test)
DTC_ac=model.score(x_test,y_test)*100
model_ac.append(DTC_ac)
print("Accuracy DTC: ",DTC_ac)
Accuracy DTC: 79.8507462686567
GNB
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
GNB=model.fit(x_train, y_train)
GNB_prob=model.predict_proba(x_test)
GNB_ac=model.score(x_test,y_test)*100
model_ac.append(GNB_ac)
print("Accuracy GNB: ",GNB_ac)
Accuracy GNB: 85.07462686567165
KNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
KNN=model.fit(x_train, y_train)
KNN_prob=model.predict_proba(x_test)
KNN_ac=model.score(x_test,y_test)*100
model_ac.append(KNN_ac)
print("Accuracy KNN: ",KNN_ac)
Accuracy KNN: 82.08955223880598
RFC
from sklearn.ensemble import RandomForestClassifier
model= RandomForestClassifier()
RFC = model.fit(x_train, y_train)
RFC_prob=model.predict_proba(x_test)
RFC_ac=model.score(x_test,y_test)*100
model_ac.append(RFC_ac)
print("Accuracy RFC: ",RFC_ac)
Accuracy RFC: 85.82089552238806
GBC
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
GBC = model.fit(x_train,y_train)
GBC_prob=model.predict_proba(x_test)
GBC_ac=model.score(x_test,y_test)*100
model_ac.append(GBC_ac)
print("Accuracy GBC: ",GBC_ac)
Accuracy GBC: 79.8507462686567
# print(LR_prob)
display(y_test)
327 1
233 0
122 1
102 1
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
71 1
..
213 1
416 1
423 0
325 0
403 1
Name: Transport, Length: 134, dtype: int64
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')
plt.legend(loc='best')
plt.show();
label=["LR","LDA","DTC","GNB","KNN","RFC","GBC"]
plt.pie(model_ac,labels=label,explode=[0,0,0,0.1,0,0.1,0],autopct='%1.1f%%')
([<matplotlib.patches.Wedge at 0x1cb93b0cb50>,
<matplotlib.patches.Wedge at 0x1cb94724250>,
<matplotlib.patches.Wedge at 0x1cb947248e0>,
<matplotlib.patches.Wedge at 0x1cb94724f70>,
<matplotlib.patches.Wedge at 0x1cb94733640>,
<matplotlib.patches.Wedge at 0x1cb94733cd0>,
<matplotlib.patches.Wedge at 0x1cb946e6df0>],
[Text(0.9968331559075043, 0.4651060731526576, 'LR'),
Text(0.270949789377357, 1.066107973723284, 'LDA'),
Text(-0.6538039927664018, 0.8846131013288866, 'DTC'),
Text(-1.1993608433869059, 0.03916078842733598, 'GNB'),
Text(-0.6997821010497332, -0.8487078478784211, 'KNN'),
Text(0.2717963586774379, -1.1688142450405392, 'RFC'),
Text(0.996833286546585, -0.4651057931618721, 'GBC')],
[Text(0.5437271759495478, 0.2536942217196314, '13.9%'),
Text(0.14779079420583108, 0.5815134402127002, '14.3%'),
Text(-0.3566203596907646, 0.48251623708848357, '13.9%'),
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Text(-0.6996271586423616, 0.02284379324927932, '14.8%'),
Text(-0.38169932784530897, -0.46293155338822967, '14.3%'),
Text(0.1585478758951721, -0.6818083096069811, '14.9%'),
Text(0.5437272472072281, -0.2536940689973848, '13.9%')])
plt.bar(label,model_ac)
Transport
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Age Gender Engineer MBA Work Exp Salary Distance license
Transport
After having compared all the models mentioned above I have found that We have compared all the models
mentioned above and found that all the models are performing decently well with an average score of around but I
choose different Model for both classes in Training & test Set. All the models are having an Accuracy score where the
Training set is higher as compared to Testing Business Insights: As we have seen from the data that a lot of
respondents tend.
This study source was downloaded by 100000830201698 from CourseHero.com on 05-11-2023 02:46:25 GMT -05:00
https://www.coursehero.com/file/106688812/Cars-Projectpdf/
Powered by TCPDF (www.tcpdf.org)