Professional Documents
Culture Documents
Project 5 PDF
Project 5 PDF
Megha Bhola
PROJECT OBJECTIVE
The objective is to identify the mode of transport most preferred by the employees. The data includes employee
information about their mode of transport including their professional/personal details. The project needs to predict
whether or not an employee will use Car as a mode of transport and also identify variable that determines the
decision.
To perform Exploratory data analysis by using Cars.csv data set, perform predictive analysis and build up a model to
identify potential customers. This is done through various histograms, identification of outliers to cut short the
process of customer base identification.
1. install.packages("caret")
2. install.packages (“car”)
3. install.packages (“carData”)
4. install.packages (“DMwR”)
5. install.packages(“grid”)
6. install.packages("rpart")
7. install.packages("rpart.plot")
8. install.packages("randomForest")
9. install.packages("lattice")
10. install.packages("ggplot2")
11. install.packages(“scales”)
12. library(ROCR)
13. library(ineq)
14. library(rattle)
15. library(RColorBrewer)
16. install.packages (“AER”)
17. install.packages (“lmtest”)
18. install.packages (“zoo”)
19. install.packages (“sandwich”)
20. install.packages("tidyr")
21. library('MASS')
carsbasedata<-read.csv("C:\\Users\\MEGHA\\Desktop\\Project Cars\\cars.csv",header=TRUE)
str(carsbasedata)
carsbasedata$Gender<-ifelse(carsbasedata$Gender =='Male',1,0)
table(carsbasedata$Gender)
0 1
128 316
carsbasedata$Transport<-ifelse(carsbasedata$Transport =='Car',1,0)
table(carsbasedata$Transport)
0 1
383 61
Our primary interest as per problem statement is to understand the factors influencing car usage. Hence we will
create a new column for Car usage. It will take value 0 for Public Transport & 2 Wheeler and 1 for car usage
carsbasedata$Engineer<-as.factor(carsbasedata$Engineer)
carsbasedata$MBA<-as.factor(carsbasedata$MBA)
carsbasedata$license<-as.factor(carsbasedata$license)
sum(carsbasedata$Transport == 1)/nrow(carsbasedata)
[1] 0.1373874
> carsbasedata$Transport<-as.factor(carsbasedata$Transport)
summary(carsbasedata)
Here the min age is 18 and max is 43. An avg age we can see is 27.
KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It
can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for
dealing with all kind of missing data. The assumption behind using KNN for missing values is that a point value can be
approximated by the values of the points that are closest to it, based on other variables. It is seen here that
MBA_imp new logical column has been created and it has one value set as TRUE. This means one null value has been
imputed.
summary(transportimputeddata)
## Age Gender Engineer MBA
## Min. :18.00 Female:128 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 Male :316 1st Qu.:1.0000 1st Qu.:0.0000
## Median :27.00 Median :1.0000 Median :0.0000
## Mean :27.75 Mean :0.7545 Mean :0.2523
## 3rd Qu.:30.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :1.0000 Max. :1.0000
## Work.Exp Salary Distance license
## Min. : 0.0 Min. : 6.50 Min. : 3.20 Min. :0.0000
## 1st Qu.: 3.0 1st Qu.: 9.80 1st Qu.: 8.80 1st Qu.:0.0000
## Median : 5.0 Median :13.60 Median :11.00 Median :0.0000
## Mean : 6.3 Mean :16.24 Mean :11.32 Mean :0.2342
## 3rd Qu.: 8.0 3rd Qu.:15.72 3rd Qu.:13.43 3rd Qu.:0.0000
## Max. :24.0 Max. :57.00 Max. :23.40 Max. :1.0000
## Transport MBA_imp
## 2Wheeler : 83 Mode :logical
## Car : 61 FALSE:443
## Public Transport:300 TRUE :1
nrow(transport_employee_aval_final)
## [1] 444
EDA
boxplot(carsbasedata$Age~carsbasedata$Engineer,main="Age vs Eng.")
boxplot(carsbasedata$Age~carsbasedata$MBA,main="Age vs MBA.")
As expected from the above there is no much difference, people carrying different qualifications are employed in a
firm.
boxplot(carsbasedata$Salary ~carsbasedata$Engineer, main = "Salary vs Eng.")
0-Female
1-Male
Its shows that there is an almost equal distribution of male and female with no much difference in work exp.
table(carsbasedata$license,carsbasedata$Transport)
Assumption 1
Let us assume if higher the salary more the chance of using car commute.
boxplot(carsbasedata$Salary~carsbasedata$Transport, main="Salary vs Transport")
Plot clearly shows that with the salary increase, commutation by Car increased.
Assumption 2
[1] 0.9322364
Plot depicts that lower age group mostly prefers 2 wheelers/public transport as mode of transport. With Age, Car is
more preferred option.
Assumption 3
For greater distance car is preferred over 2-wheeler and public transport
Assumption 4
Gender based preference for transport
table(carsbasedata$Gender,carsbasedata$Transport)
From the above data it is visible that female employees use more public transport. That is only 40% of female uses
private transport.
set.seed(400)
carsdatatrain<-carsbasedata[carindex,]
carsdatatest<-carsbasedata[-carindex,]
prop.table(table(carsdatatrain$Transport))
0 1
## 0.8621795 0.1378205
prop.table(table(carsdatatest$Transport))
## 0 1
## 0.8636364 0.1363636
carsdatatrain<-carsdatatrain[,c(1:8,10)]
carsdatatest<-carsdatatest[,c(1:8,10)]
## The train and test data have almost same percentage of cars usage as the base data
attach(carsdatatrain)
prop.table(table(carsdataSMOTE$Transport))
## 0 1
## 0.5 0.5
We now have an equal split in the data between car users and non-car users. Let us proceed with building the
models
## Model Building: We will use the Logistic regression method model on the SMOTE data to understand the
factors influencing car usage. Since we have only limited variable, we will use them all in model building
outcomevar<-'Transport'
regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")
summary(carsglm$finalModel)
## NULL
## Deviance Residuals:
##
## Coefficients:
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## AIC: 35.959
carglmcoeff<-exp(coef(carsglm$finalModel))
write.csv(carglmcoeff,file = "Coeffs.csv")
varImp(object = carsglm)
## Overall
## Age 100.000
## license1 68.845
## GenderMale 64.056
## Work.Exp 59.351
## Salary 58.720
## MBA1 24.219
## Engineer1 5.813
## Distance 0.000
Model Interpretation
From the model we see that Age and License are more significant. When we look at the odds and probabilities table,
we get to see that Increase in age by 1 year implies that their is a 98% probability that the employee will use a car. As
expected, if the employee has a license, then it implies a 99% probability that he/she will use a car. One lac increase
in salary increases the probability of car usage by 72%. The null deviance of this model is 357.664 and the residual
deviance is 17.959. This gives a McFadden R Sqaure of almost 0.94 yielding a very good fit. We get to see Accuracy
and Kappa values are high We shall do the prediction based on this model
Reference
## Prediction 0 1
0 108 1
1 6 17
##
## Accuracy : 0.947
## 95% CI : (0.8938, 0.9784)
## No Information Rate : 0.8636
## P-Value [Acc > NIR] : 0.001692
##
## Kappa : 0.7984
## Mcnemar's Test P-Value : 0.130570
##
## Sensitivity : 0.9444
Specificity : 0.9474
## Pos Pred Value : 0.7391
## Neg Pred Value : 0.9908
## Prevalence : 0.1364
## Detection Rate : 0.1288
## Detection Prevalence : 0.1742
## Balanced Accuracy : 0.9459
##
## 'Positive' Class : 1
##
carusagepreddata<-carsdatatest
carusagepreddata$predictusage<-carusageprediction
Interpretation of Prediction
We see that the accuracy of prediction is 95% with almost all non-users getting predicted accurately. We have a 94%
accuracy in predicting the car users.
Let us perform the prediction for the two given cases
carunknown<read.csv("C:\\Users\\MEGHA\\Desktop\\Project Cars\\cars.csv",header=TRUE)
carunknown$license<-as.factor(carunknown$license)
carunknown$Engineer<-as.factor(carunknown$Engineer)
carunknown$MBA<-as.factor(carunknown$MBA)
carunknown$predictcaruse<-predict.train(object = carsglm,carunknown[,regressors],type = "raw")
print(carunknown)
As seen, the model has predicted that both the new rows are predicted to be not car users
Using glmnet method of caret package to try and run ridge Regression Model
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.008540753.
varImp(object = carsglmnet)
## glmnet variable importance
##
## Overall
## license1 100.000
## Age 99.068
## Distance 21.095
## Salary 8.684
## Work.Exp 0.000
plot(varImp(object = carsglmnet), main="Variable Importance for Logistic Regression - Post Ridge Regularization")
We get license and Age as the most significant variables, followed by the distance in the importance of variables. Let
us try prediction using the regularized model
We see that the accuracy of prediction is 94.7% with almost all non users gettng predicted accurately. However we
have a 88% accuracy in predicting the car users, slightly lower than the regular GLM model. The unknown cases are
also predicted as not car users.
NAÏVE BAYES
A-priori probabilities:
Conditional probabilities:
Age
Y [,1] [,2]
Gender
Y Female Male
Engineer
Y01
2Wheeler 0.2881356 0.7118644
MBA
Y01
Work.Exp
Y [,1] [,2]
Salary
Y [,1] [,2]
Distance
Y [,1] [,2]
license
Y01
This gives us the rule or factors which can help us employees decision to use car or not.
General way to interpret this output is that for any factor variable say license we can say that 74% of people without
license use 2-wheeler and 26% with license or continuous variables for example distance we can say 2-wheeler is
used by people for whom commute distance is 12.4 with sd of 3.18.
NB_Predictions=predict(Naive_Bayes_Model,cars_test)
> table(NB_Predictions,cars_test$Transport)
NB_Predictions
2Wheeler 6 0 6
Car 2 14 3
Public Transport 16 4 81
Though the problem statement is to more predominantly look at the factors influencing car usage, the dataset given
is a good case for a linear discriminant analysis to understand the factors driving the choice of transportation mode.
##Split the original base data into test and train samples again
carsbasedatalda<- read.csv("C:\\Users\\MEGHA\\Desktop\\Project Cars\\cars.csv",header=TRUE)
carsbasedatalda$Gender<-as.factor(carsbasedatalda$Gender)
carsbasedatalda$Engineer<-as.factor(carsbasedatalda$Engineer)
carsbasedatalda$MBA<-as.factor(carsbasedatalda$MBA)
carsbasedatalda<-knnImputation(carsbasedatalda)
set.seed(400)
carindexlda<-createDataPartition(carsbasedatalda$Transport, p=0.7,list = FALSE,times = 1)
carstrainlda<-carsbasedatalda[carindexlda,]
carstestlda<-carsbasedatalda[-carindexlda,]
carstrainlda$license<-as.factor(carstrainlda$license)
carstestlda$license<-as.factor(carstestlda$license)
cartrainlda.car<-carstrainlda[carstrainlda$Transport %in% c("Car", "Public Transport"),]
cartrainlda.twlr<-carstrainlda[carstrainlda$Transport %in% c("2Wheeler", "Public Transport"),]
cartrainlda.car$Transport<-as.character(cartrainlda.car$Transport)
cartrainlda.car$Transport<-as.factor(cartrainlda.car$Transport)
cartrainlda.twlr$Transport<-as.character(cartrainlda.twlr$Transport)
cartrainlda.twlr$Transport<-as.factor(cartrainlda.twlr$Transport)
prop.table(table(cartrainlda.car$Transport))
##
## Car Public Transport
## 0.1699605 0.8300395
prop.table(table(cartrainlda.twlr$Transport))
##
## 2Wheeler Public Transport
## 0.2193309 0.7806691
table(carsdatatrainldasm$Transport)
##
## 2Wheeler Public Transport Car
## 118 118 86
attach(carsdatatrainldasm)
## Call:
## lda(x, grouping = y)
##
## Prior probabilities of groups:
## 2Wheeler Public Transport Car
## 0.3664596 0.3664596 0.2670807
##
## Group means:
## Age Work.Exp Salary Distance license1 GenderMale
## 2Wheeler 25.25983 3.726144 12.36309 12.23362 0.2627119 0.5000000
## Public Transport 26.55085 4.983051 13.51864 10.68729 0.1440678 0.6949153
## Car 35.75833 15.854594 36.15677 15.59036 0.7093023 0.7093023
## Engineer1 MBA1
## 2Wheeler 0.6949153 0.2542373
## Public Transport 0.6864407 0.2796610
## Car 0.8488372 0.2558140
##
## Coefficients of linear discriminants:
## LD1 LD2
## Age 0.210574627 -0.1462080
## Work.Exp 0.081188593 -0.0836168
## Salary 0.007534406 0.0476583
## Distance 0.074968787 0.1906158
## license1 0.416869466 1.2676211
## GenderMale -0.186709183 -1.0865173
## Engineer1 0.150855710 0.1757128
## MBA1 -0.072490532 -0.1201606
##
## Proportion of trace:
## LD1 LD2
## 0.9477 0.0523
As per the out put, the first discriminant function achieves 90% separation and as per that Age, Work Experience and
Salary play an important part in the choice of transport, followed by Distance and License.
The overall accuracy is 78% with a prediction of cars usage at 83% Let us understand the transportaion choice of two
unknown cases
As per the prediction by LDA, both male/female employees will take public transport as a means of transportation
## Call:
## mda::fda(formula = as.formula(".outcome ~ ."), data = dat, method = mda::gen.ridge, lambda = param$lambda)
##
Dimension: 2
##
## Percent Between-Group Variance Explained:
## v1 v2
## 94.77 100.00
## Degrees of Freedom (per dimension): 7.992829
## Training Misclassification Error: 0.28882 ( N = 322 )
Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.1.
As per the prediction by LDA, both employees will take public transport as a means of transporation. The cases are
precited to be choosing Public Transport.
Decision tree method , CART to infer and understand from the dataset
predictions_CART<-predict(carscart,carstestlda)
confusionMatrix(predictions_CART,carstestlda$Transport)
In confusionMatrix.default(predictions_CART, carstestlda
$Transport): Levels are not in the same order for reference and data.
Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 15 0 31
## Car 1 17 5
## Public Transport 8 1 54
##
## Overall Statistics
##
## Accuracy : 0.6515
## 95% CI : (0.5637, 0.7323)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.8007180
##
## Kappa : 0.4068
## Mcnemar's Test P-Value : 0.0006336
##
## Statistics by Class:
##
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.6250 0.9444 0.6000
## Specificity 0.7130 0.9474 0.7857
## Pos Pred Value 0.3261 0.7391 0.8571
## Neg Pred Value 0.8953 0.9908 0.4783
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1136 0.1288 0.4091
## Detection Prevalence 0.3485 0.1742 0.4773
## Balanced Accuracy 0.6690 0.9459 0.6929
We get an overall accuracy of 55% with car usage getting predicted at 88% accuracy Let us predict for the unknown
cases
CART model also predicts the transport mode as Public Transport for the 2 cases
carsxgb$finalModel
xgb.Booster
## raw: 38.5 Kb
## call:
## xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
## gamma = param$gamma, colsample_bytree = param$colsample_bytree,
## min_child_weight = param$min_child_weight, subsample = param$subsample),
## data = x, nrounds = param$nrounds, num_class = length(lev),
## objective = "multi:softprob")
## params (as set within xgb.train):
## eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsample = "1",
num_class = "3", objective = "multi:softprob", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## niter: 50
## xNames: AgeGenderMaleEngineer1MBA1Work.ExpSalaryDistancelicense1
## problemType: Classification
## tuneValue:
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 1 50 1 0.3 0 0.6 1 1
## obsLevels: 2WheelerPublic TransportCar
## param:
## list()
predictions_xgb<-predict(carsxgb,carstestlda)
confusionMatrix(predictions_xgb,carstestlda$Transport)
## Warning in confusionMatrix.default(predictions_xgb, carstestlda$Transport):
## Levels are not in the same order for reference and data. Refactoring data to match
## Confusion Matrix and Statistics
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 15 0 18
## Car 1 17 1
## Public Transport 8 1 71
##
## Overall Statistics
##
## Accuracy : 0.7803
## 95% CI : (0.7, 0.8477)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.00823
##
## Kappa : 0.5789
## Mcnemar's Test P-Value : 0.18342
##
## Statistics by Class:
##
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.6250 0.9444 0.7889
## Specificity 0.8333 0.9825 0.7857
## Pos Pred Value 0.4545 0.8947 0.8875
## Neg Pred Value 0.9091 0.9912 0.6346
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1136 0.1288 0.5379
## Detection Prevalence 0.2500 0.1439 0.6061
## Balanced Accuracy 0.7292 0.9635 0.7873
The overall accuracy is at 73% with accuracy of car usage predicted at 88%.
Interestingly, XgB model has predicted the case with Male, license yes and distance as 5 KM to be using 2 Wheeler as
a mode of transport and the other case to be using Public Transport.
We will now use multinomial logistic regression to understand the factors driving transport mode choice. In this, the
data is relevelled with respect to one of the three classes and 2 independent Logistic regression models are run. In
our case, 2-Wheeler class is taken as the reference class.
carsmlr$finalModel
nnet::multinom(formula = .outcome ~ ., data = dat, decay = param$decay)
## Coefficients:
## (Intercept) Age GenderMale Engineer1 MBA1
## Public Transport -2.924602 0.1867473 0.9073215 -0.1392119 0.1814413
## Car -73.157600 2.4072670 -0.8481166 1.4958898 -0.7794798
## Work.Exp Salary Distance license1
## Public Transport 0.1027101 -0.07091949 -0.1481504 -1.154896
## Car -1.0684135 0.17301043 0.3334671 1.378113
##
## Residual Deviance: 327.5019
## AIC: 363.5019
carmlrcoeff<-exp(coef(carsmlr$finalModel))
write.csv(carmlrcoeff,file = "Coeffsmlr.csv")
The model has a residual deviance 296.416. The model implies that an increase in Age by 1 year increases the odds
of taking Public Transport as compared to 2 Wheeler by 1.3 (57%), whereas it increases the odds of choosing car by
12 (92%) Age and license are the two main important factors in deciding mode of transport
predictions_mlr<-predict(carsmlr,carstestlda)
confusionMatrix(predictions_mlr,carstestlda$Transport)
Levels are not in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 16 0 8
## Car 1 18 2
## Public Transport 7 0 80
##
## Overall Statistics
## Accuracy : 0.8636
## 95% CI : (0.7931, 0.9171)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 1.242e-06
## Kappa : 0.725
## Mcnemar's Test P-Value : 0.3815
## Statistics by Class:
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.6667 1.0000 0.8889
## Specificity 0.9259 0.9737 0.8333
## Pos Pred Value 0.6667 0.8571 0.9195
## Neg Pred Value 0.9259 1.0000 0.7778
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1212 0.1364 0.6061
## Detection Prevalence 0.1818 0.1591 0.6591
## Balanced Accuracy 0.7963 0.9868 0.8611
The overall accuracy is at 70% with accuracy of car usage predicted at 94% Let us try and predict for the two
unknown cases
The out of bag error estimate rate is 16.7% in the training dataset. Age, Work Experience, Distance and Salary are the
most significant variables. Let us try and predict for test data
BAGGING
(maxdepth=5, minsplit=15))
cars_test$pred.Transport <- predict(Cars.bagging, cars_test)
table(cars_test$Transport,cars_test$pred.Transport)
2Wheeler 8 1 15
Car 1 16 1
Public Transport 7 3 80
predictions_rf<-predict(carsrf,carstestlda)
confusionMatrix(predictions_rf,carstestlda$Transport)
Levels are not in the same order for reference and data. Refactoring data to match.
We have an overall accuracy of 72% and 94% accuracy for prediction of car usage Let us now predict the choice of
transport for 2 unknown cases
We have one record (female,engineer) predicted to choose 2 Wheeler and the other record to have chosen Public
Transport