Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

PROJECT #5

Predicting mode of Transport (ML)

Megha Bhola
PROJECT OBJECTIVE

The objective is to identify the mode of transport most preferred by the employees. The data includes employee
information about their mode of transport including their professional/personal details. The project needs to predict
whether or not an employee will use Car as a mode of transport and also identify variable that determines the
decision.

PERFORM EXPLORATORY DATA ANALYSIS

To perform Exploratory data analysis by using Cars.csv data set, perform predictive analysis and build up a model to
identify potential customers. This is done through various histograms, identification of outliers to cut short the
process of customer base identification.

Environment set up and data analysis

Installation of packages and library setup

1. install.packages("caret")
2. install.packages (“car”)
3. install.packages (“carData”)
4. install.packages (“DMwR”)
5. install.packages(“grid”)
6. install.packages("rpart")
7. install.packages("rpart.plot")
8. install.packages("randomForest")
9. install.packages("lattice")
10. install.packages("ggplot2")
11. install.packages(“scales”)
12. library(ROCR)
13. library(ineq)
14. library(rattle)
15. library(RColorBrewer)
16. install.packages (“AER”)
17. install.packages (“lmtest”)
18. install.packages (“zoo”)
19. install.packages (“sandwich”)
20. install.packages("tidyr")
21. library('MASS')

Import the data and read data set

carsbasedata<-read.csv("C:\\Users\\MEGHA\\Desktop\\Project Cars\\cars.csv",header=TRUE)

str(carsbasedata)

carsbasedata$Gender<-ifelse(carsbasedata$Gender =='Male',1,0)

table(carsbasedata$Gender)
0 1
128 316

carsbasedata$Transport<-ifelse(carsbasedata$Transport =='Car',1,0)

table(carsbasedata$Transport)
0 1
383 61
Our primary interest as per problem statement is to understand the factors influencing car usage. Hence we will
create a new column for Car usage. It will take value 0 for Public Transport & 2 Wheeler and 1 for car usage

## Understand the proportion of cars in Transport Mode


str(carsbasedata)

'data.frame': 444 obs. of 9 variables:


$ Age : int 28 23 29 28 27 26 28 26 22 27 ...
$ Gender : num 1 0 1 0 1 1 1 0 1 1 ...
$ Engineer : int 0 1 1 1 1 1 1 1 1 1 ...
$ MBA : int 0 0 0 1 0 0 0 0 0 0 ...
$ Work.Exp : int 4 4 7 5 4 4 5 3 1 4 ...
$ Salary : num 14.3 8.3 13.4 13.4 13.4 12.3 14.4 10.5 7.5 13.5 ...
$ Distance : num 3.2 3.3 4.1 4.5 4.6 4.8 5.1 5.1 5.1 5.2 ...
$ license : int 0 0 0 0 0 1 0 0 0 0 ...
$ Transport: num 0 0 0 0 0 0 0 0 0 0 ...

## The columns Engineer,MBA and license need to be converted into factors

carsbasedata$Engineer<-as.factor(carsbasedata$Engineer)

carsbasedata$MBA<-as.factor(carsbasedata$MBA)

carsbasedata$license<-as.factor(carsbasedata$license)
sum(carsbasedata$Transport == 1)/nrow(carsbasedata)
[1] 0.1373874

> carsbasedata$Transport<-as.factor(carsbasedata$Transport)
summary(carsbasedata)

## Age Gender Engineer MBA


## Min. :18.00 Female:128 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 Male :316 1st Qu.:1.0000 1st Qu.:0.0000
## Median :27.00 Median :1.0000 Median :0.0000
## Mean :27.75 Mean :0.7545 Mean :0.2528
## 3rd Qu.:30.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :1.0000 Max. :1.0000
## NA's :1
## Work.Exp Salary Distance license
## Min. : 0.0 Min. : 6.50 Min. : 3.20 Min. :0.0000
## 1st Qu.: 3.0 1st Qu.: 9.80 1st Qu.: 8.80 1st Qu.:0.0000
## Median : 5.0 Median :13.60 Median :11.00 Median :0.0000
## Mean : 6.3 Mean :16.24 Mean :11.32 Mean :0.2342
## 3rd Qu.: 8.0 3rd Qu.:15.72 3rd Qu.:13.43 3rd Qu.:0.0000
## Max. :24.0 Max. :57.00 Max. :23.40 Max. :1.0000
## Transport
## 2Wheeler : 83
## Car : 61
## Public Transport:300

Here the min age is 18 and max is 43. An avg age we can see is 27.

Males approximately70-75%, Engineers have outnumbered MBA’s


We can see compare to other modes, most common used in here is public Transportation

Null Value Replacement

KNN is an algorithm that is useful for matching a point with its closest k neighbors in a multi-dimensional space. It
can be used for data that are continuous, discrete, ordinal and categorical which makes it particularly useful for
dealing with all kind of missing data. The assumption behind using KNN for missing values is that a point value can be
approximated by the values of the points that are closest to it, based on other variables. It is seen here that
MBA_imp new logical column has been created and it has one value set as TRUE. This means one null value has been
imputed.

Transportimputeddata<- VIM::kNN(data=transport_employee_aval,variable =c("MBA"),k=7)


## here explictly package name has to be added because, the function name is conflicting with other package of
SMOTE

summary(transportimputeddata)
## Age Gender Engineer MBA
## Min. :18.00 Female:128 Min. :0.0000 Min. :0.0000
## 1st Qu.:25.00 Male :316 1st Qu.:1.0000 1st Qu.:0.0000
## Median :27.00 Median :1.0000 Median :0.0000
## Mean :27.75 Mean :0.7545 Mean :0.2523
## 3rd Qu.:30.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :43.00 Max. :1.0000 Max. :1.0000
## Work.Exp Salary Distance license
## Min. : 0.0 Min. : 6.50 Min. : 3.20 Min. :0.0000
## 1st Qu.: 3.0 1st Qu.: 9.80 1st Qu.: 8.80 1st Qu.:0.0000
## Median : 5.0 Median :13.60 Median :11.00 Median :0.0000
## Mean : 6.3 Mean :16.24 Mean :11.32 Mean :0.2342
## 3rd Qu.: 8.0 3rd Qu.:15.72 3rd Qu.:13.43 3rd Qu.:0.0000
## Max. :24.0 Max. :57.00 Max. :23.40 Max. :1.0000
## Transport MBA_imp
## 2Wheeler : 83 Mode :logical
## Car : 61 FALSE:443
## Public Transport:300 TRUE :1

Revised data set after missing value imputation


Null value is missing, Original data frame had 444 observations and new dataframe also has same number of record.

transport_employee_aval_final <- subset(transport_employee_aval_imputed, select = Age:Transport)


transport_employee_aval_final_boost <- subset(transport_employee_aval_imputed, select = Age:Transport)
transport_employee_aval_final_logit <- subset(transport_employee_aval_imputed, select = Age:Transport)

nrow(transport_employee_aval_final)
## [1] 444
EDA
boxplot(carsbasedata$Age~carsbasedata$Engineer,main="Age vs Eng.")

boxplot(carsbasedata$Age~carsbasedata$MBA,main="Age vs MBA.")

As expected from the above there is no much difference, people carrying different qualifications are employed in a
firm.
boxplot(carsbasedata$Salary ~carsbasedata$Engineer, main = "Salary vs Eng.")

boxplot(carsbasedata$Salary ~carsbasedata$MBA, main = "Salary vs MBA.")

Above boxplots represent no major impact of qualification on salary.


hist(carsbasedata$Work.Exp, col = "blue", main = "Distribution of work exp")

hist(transport_employee_aval$license, col = 'green')


This is skewed towards right, again this would be on expected lines as there would be more juniors then seniors in
any firm.
boxplot(carsbasedata$Work.Exp ~carsbasedata$Gender)

0-Female

1-Male

Its shows that there is an almost equal distribution of male and female with no much difference in work exp.
table(carsbasedata$license,carsbasedata$Transport)

0 (2Wheeler+Public transport) 1 (Car)


0 327 13
1 56 48

Assumption 1

Let us assume if higher the salary more the chance of using car commute.
boxplot(carsbasedata$Salary~carsbasedata$Transport, main="Salary vs Transport")
Plot clearly shows that with the salary increase, commutation by Car increased.

Assumption 2

Correlation between Age/work exp with the Car transport


cor(carsbasedata$Age, carsbasedata$Work.Exp)

[1] 0.9322364

boxplot(carsbasedata$Age~carsbasedata$Transport, main="Age vs Transport")

Plot depicts that lower age group mostly prefers 2 wheelers/public transport as mode of transport. With Age, Car is
more preferred option.
Assumption 3

Impact of distance on preference for Car


boxplot(carsbasedata$Distance~carsbasedata$Transport, main="Distance vs Transport")

For greater distance car is preferred over 2-wheeler and public transport

Assumption 4
Gender based preference for transport
table(carsbasedata$Gender,carsbasedata$Transport)

0 (2Wheeler+Public transport) 1 (Car)


Female 115 13
Male 268 48

From the above data it is visible that female employees use more public transport. That is only 40% of female uses
private transport.

Considerations for the model building and data split


The number of records for people travelling by car is in minority. Hence, we need to use an appropriate sampling
method on the train data. We will explore using SMOTE, use logistic regression, decision trees to see the best fit
model and also explore a couple of blackbox models for prediction later on.

## Balancing the data

##Split the data into test and train

set.seed(400)

carindex<-createDataPartition(carsbasedata$Transport, p=0.7,list = FALSE,times = 1)

carsdatatrain<-carsbasedata[carindex,]

carsdatatest<-carsbasedata[-carindex,]

prop.table(table(carsdatatrain$Transport))

0 1

## 0.8621795 0.1378205

prop.table(table(carsdatatest$Transport))

## 0 1

## 0.8636364 0.1363636

carsdatatrain<-carsdatatrain[,c(1:8,10)]

carsdatatest<-carsdatatest[,c(1:8,10)]

## The train and test data have almost same percentage of cars usage as the base data

## Apply SMOTE on Training data set


library(DMwR)

attach(carsdatatrain)

carsdataSMOTE<-SMOTE(Transport~., carsdatatrain, perc.over = 250,perc.under = 150)

prop.table(table(carsdataSMOTE$Transport))

## 0 1

## 0.5 0.5

We now have an equal split in the data between car users and non-car users. Let us proceed with building the
models
## Model Building: We will use the Logistic regression method model on the SMOTE data to understand the
factors influencing car usage. Since we have only limited variable, we will use them all in model building

##Create control parameter for GLM

outcomevar<-'Transport'

regressors<-c("Age","Work.Exp","Salary","Distance","license","Engineer","MBA","Gender")

trainctrl<-trainControl(method = 'repeatedcv',number = 10,repeats = 3)

carsglm<-train(carsdataSMOTE[,regressors],carsdataSMOTE[,outcomevar],method = "glm", family =


"binomial",trControl = trainctrl)

summary(carsglm$finalModel)

## NULL

## Deviance Residuals:

## Min 1Q Median 3Q Max

## -2.98326 -0.00024 0.00000 0.00000 1.11801

##

## Coefficients:

## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -118.62275 51.84138 -2.288 0.0221 *

## Age 3.69904 1.69960 2.176 0.0295 *

## Work.Exp -1.20774 0.89958 -1.343 0.1794

## Salary 0.96031 0.72224 1.330 0.1836

## Distance -0.04112 0.32873 -0.125 0.9005

## license1 4.44597 2.89203 1.537 0.1242

## Engineer1 -0.72971 2.98679 -0.244 0.8070

## MBA1 -1.05817 1.70158 -0.622 0.5340

## GenderMale -3.03060 2.10593 -1.439 0.1501

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##(Dispersion parameter for binomial family taken to be 1)

## Null deviance: 357.664 on 257 degrees of freedom

## Residual deviance: 17.959 on 249 degrees of freedom

## AIC: 35.959

## Number of Fisher Scoring iterations: 12

carglmcoeff<-exp(coef(carsglm$finalModel))
write.csv(carglmcoeff,file = "Coeffs.csv")

varImp(object = carsglm)

## glm variable importance

## Overall

## Age 100.000

## license1 68.845

## GenderMale 64.056

## Work.Exp 59.351

## Salary 58.720

## MBA1 24.219

## Engineer1 5.813

## Distance 0.000

plot(varImp(object = carsglm), main="Vairable Importance for Logistic Regression")

Model Interpretation
From the model we see that Age and License are more significant. When we look at the odds and probabilities table,
we get to see that Increase in age by 1 year implies that their is a 98% probability that the employee will use a car. As
expected, if the employee has a license, then it implies a 99% probability that he/she will use a car. One lac increase
in salary increases the probability of car usage by 72%. The null deviance of this model is 357.664 and the residual
deviance is 17.959. This gives a McFadden R Sqaure of almost 0.94 yielding a very good fit. We get to see Accuracy
and Kappa values are high We shall do the prediction based on this model

carusageprediction<-predict.train(object = carsglm,carsdatatest[,regressors],type = "raw")


confusionMatrix(carusageprediction,carsdatatest[,outcomevar], positive='1')

## Confusion Matrix and Statistics

Reference
## Prediction 0 1
0 108 1
1 6 17
##
## Accuracy : 0.947
## 95% CI : (0.8938, 0.9784)
## No Information Rate : 0.8636
## P-Value [Acc > NIR] : 0.001692
##
## Kappa : 0.7984
## Mcnemar's Test P-Value : 0.130570
##
## Sensitivity : 0.9444

Specificity : 0.9474
## Pos Pred Value : 0.7391
## Neg Pred Value : 0.9908
## Prevalence : 0.1364
## Detection Rate : 0.1288
## Detection Prevalence : 0.1742
## Balanced Accuracy : 0.9459
##
## 'Positive' Class : 1

##
carusagepreddata<-carsdatatest
carusagepreddata$predictusage<-carusageprediction

Interpretation of Prediction
We see that the accuracy of prediction is 95% with almost all non-users getting predicted accurately. We have a 94%
accuracy in predicting the car users.
Let us perform the prediction for the two given cases

carunknown<read.csv("C:\\Users\\MEGHA\\Desktop\\Project Cars\\cars.csv",header=TRUE)
carunknown$license<-as.factor(carunknown$license)
carunknown$Engineer<-as.factor(carunknown$Engineer)
carunknown$MBA<-as.factor(carunknown$MBA)
carunknown$predictcaruse<-predict.train(object = carsglm,carunknown[,regressors],type = "raw")
print(carunknown)

## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse


## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0

As seen, the model has predicted that both the new rows are predicted to be not car users

Improving the Model

Using glmnet method of caret package to try and run ridge Regression Model

trainctrlgn<-trainControl(method = 'cv',number = 10,returnResamp = 'none')


carsglmnet<-train(Transport~Age+Work.Exp+Salary+Distance+license, data = carsdataSMOTE, method = 'glmnet',
trControl = trainctrlgn)
carsglmnet
## glmnet
##
## 258 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 232, 232, 232, 232, 232, 232, ...
## Resampling results across tuning parameters:
##
## alpha lambda Accuracy Kappa
## 0.10 0.0008540753 0.9807692 0.9615385
## 0.10 0.0085407532 0.9730769 0.9461538
## 0.10 0.0854075320 0.9575385 0.9150897
## 0.55 0.0008540753 0.9807692 0.9615385
## 0.55 0.0085407532 0.9769231 0.9538462
## 0.55 0.0854075320 0.9536923 0.9073974
## 1.00 0.0008540753 0.9846154 0.9692308
## 1.00 0.0085407532 0.9846154 0.9692308
## 1.00 0.0854075320 0.9730769 0.9461538

## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda = 0.008540753.
varImp(object = carsglmnet)
## glmnet variable importance
##
## Overall
## license1 100.000
## Age 99.068
## Distance 21.095
## Salary 8.684
## Work.Exp 0.000

plot(varImp(object = carsglmnet), main="Variable Importance for Logistic Regression - Post Ridge Regularization")
We get license and Age as the most significant variables, followed by the distance in the importance of variables. Let
us try prediction using the regularized model

carusageprediction<-predict.train(object = carsglmnet,carsdatatest[,regressors],type = "raw")


confusionMatrix(carusagepredictiong,carsdatatest[,outcomevar], positive='1')

## Confusion Matrix and Statistics


##
## Reference
## Prediction 0 1
## 0 109 2
## 1 5 16
##
## Accuracy : 0.947
## 95% CI : (0.8938, 0.9784)
## No Information Rate : 0.8636
## P-Value [Acc > NIR] : 0.001692
##
## Kappa : 0.7896
## Mcnemar's Test P-Value : 0.449692
##
## Sensitivity : 0.8889
## Specificity : 0.9561
## Pos Pred Value : 0.7619
## Neg Pred Value : 0.9820
## Prevalence : 0.1364
## Detection Rate : 0.1212
## Detection Prevalence : 0.1591
## Balanced Accuracy : 0.9225
##
## 'Positive' Class : 1
##
## Let us predict for unknown cases

carunknown$predictcarusage<-predict.train(object = carsglmnet,carunknown[,regressors],type = "raw")


print(carunknown)
## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0
## predictcarusage
## 1 0
## 2 0

We see that the accuracy of prediction is 94.7% with almost all non users gettng predicted accurately. However we
have a 88% accuracy in predicting the car users, slightly lower than the regular GLM model. The unknown cases are
also predicted as not car users.

NAÏVE BAYES

Naive_Bayes_Model=naiveBayes(cars_train$Transport ~., data=cars_train)

naiveBayes.default(x = X, y = Y, laplace = laplace)

A-priori probabilities:

2Wheeler Car Public Transport

0.1891026 0.1378205 0.6730769

Conditional probabilities:

Age

Y [,1] [,2]

2Wheeler 25.25424 2.825946

Car 36.00000 3.287784

Public Transport 26.90000 2.963010

Gender

Y Female Male

2Wheeler 0.5084746 0.4915254

Car 0.2093023 0.7906977

Public Transport 0.2571429 0.7428571

Engineer

Y01
2Wheeler 0.2881356 0.7118644

Car 0.1395349 0.8604651

Public Transport 0.2523810 0.7476190

MBA

Y01

2Wheeler 0.7966102 0.2033898

Car 0.8837209 0.1162791

Public Transport 0.7619048 0.2380952

Work.Exp

Y [,1] [,2]

2Wheeler 4.033898 3.316449

Car 15.976744 4.827936

Public Transport 5.000000 3.018286

Salary

Y [,1] [,2]

2Wheeler 2.432305 0.3841553

Car 3.550837 0.4303948

Public Transport 2.514434 0.2949119

Distance

Y [,1] [,2]

2Wheeler 12.44576 3.185274

Car 15.27442 3.800006

Public Transport 10.31952 3.175135

license

Y01

2Wheeler 0.7457627 0.2542373

Car 0.1860465 0.8139535

Public Transport 0.8857143 0.1142857

This gives us the rule or factors which can help us employees decision to use car or not.

General way to interpret this output is that for any factor variable say license we can say that 74% of people without
license use 2-wheeler and 26% with license or continuous variables for example distance we can say 2-wheeler is
used by people for whom commute distance is 12.4 with sd of 3.18.
NB_Predictions=predict(Naive_Bayes_Model,cars_test)

> table(NB_Predictions,cars_test$Transport)

NB_Predictions

2Wheeler Car Public Transport

2Wheeler 6 0 6

Car 2 14 3

Public Transport 16 4 81

[1] Public Transport Public Transport Public Transport Public Transport

Levels: 2Wheeler Car Public Transport

Interference and Prediction Using Linear Discriminant Analysis

Though the problem statement is to more predominantly look at the factors influencing car usage, the dataset given
is a good case for a linear discriminant analysis to understand the factors driving the choice of transportation mode.

##Split the original base data into test and train samples again
carsbasedatalda<- read.csv("C:\\Users\\MEGHA\\Desktop\\Project Cars\\cars.csv",header=TRUE)
carsbasedatalda$Gender<-as.factor(carsbasedatalda$Gender)
carsbasedatalda$Engineer<-as.factor(carsbasedatalda$Engineer)
carsbasedatalda$MBA<-as.factor(carsbasedatalda$MBA)
carsbasedatalda<-knnImputation(carsbasedatalda)
set.seed(400)
carindexlda<-createDataPartition(carsbasedatalda$Transport, p=0.7,list = FALSE,times = 1)
carstrainlda<-carsbasedatalda[carindexlda,]
carstestlda<-carsbasedatalda[-carindexlda,]
carstrainlda$license<-as.factor(carstrainlda$license)
carstestlda$license<-as.factor(carstestlda$license)
cartrainlda.car<-carstrainlda[carstrainlda$Transport %in% c("Car", "Public Transport"),]
cartrainlda.twlr<-carstrainlda[carstrainlda$Transport %in% c("2Wheeler", "Public Transport"),]
cartrainlda.car$Transport<-as.character(cartrainlda.car$Transport)
cartrainlda.car$Transport<-as.factor(cartrainlda.car$Transport)
cartrainlda.twlr$Transport<-as.character(cartrainlda.twlr$Transport)
cartrainlda.twlr$Transport<-as.factor(cartrainlda.twlr$Transport)
prop.table(table(cartrainlda.car$Transport))
##
## Car Public Transport
## 0.1699605 0.8300395

prop.table(table(cartrainlda.twlr$Transport))
##
## 2Wheeler Public Transport
## 0.2193309 0.7806691

carldatwlrsm <- SMOTE(Transport~., data = cartrainlda.twlr, perc.over = 150, perc.under=200)


table(carldatwlrsm$Transport)
##
## 2Wheeler Public Transport
## 118 118

carldacarsm <- SMOTE(Transport~., data = cartrainlda.car, perc.over = 175, perc.under=200)


table(carldacarsm$Transport)
##
## Car Public Transport
## 86 86

carldacar<-carldacarsm[carldacarsm$Transport %in% c("Car"),]


carsdatatrainldasm<-rbind(carldatwlrsm,carldacar)
str(carsdatatrainldasm)

## 'data.frame': 322 obs. of 9 variables:


## $ Age : num 24 27 24 29 26 23 27 25 26 26 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 1 2 1 1 ...
## $ Engineer : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 1 2 2 ...
## $ MBA : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 2 ...
## $ Work.Exp : num 4 5 2 6 4 2 3 3 6 4 ...
## $ Salary : num 10.9 13.8 8.5 14.6 12.7 8.5 10.7 9.9 17.8 12.8 ...
## $ Distance : num 9 15.2 8.5 10.2 9 6.1 11.7 15.9 10.4 10.8 ...
## $ license : Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 1 1 ...
## $ Transport: Factor w/ 3 levels "2Wheeler","Public Transport",..: 2 2 2 2 2 2 2 2 2 2 ...

table(carsdatatrainldasm$Transport)
##
## 2Wheeler Public Transport Car
## 118 118 86

## Build the model

attach(carsdatatrainldasm)

## The following objects are masked from carsdatatrain:


## Age, Distance, Engineer, Gender, license, MBA, Salary,Work.Exp

trainctrllda<-trainControl(method = 'cv',number = 10)


carslda<-train(Transport~Age+Work.Exp+Salary+Distance+license+Gender+Engineer+MBA ,data =
carsdatatrainldasm, method="lda", trControl=trainctrllda)
carslda$finalModel

## Call:
## lda(x, grouping = y)
##
## Prior probabilities of groups:
## 2Wheeler Public Transport Car
## 0.3664596 0.3664596 0.2670807
##
## Group means:
## Age Work.Exp Salary Distance license1 GenderMale
## 2Wheeler 25.25983 3.726144 12.36309 12.23362 0.2627119 0.5000000
## Public Transport 26.55085 4.983051 13.51864 10.68729 0.1440678 0.6949153
## Car 35.75833 15.854594 36.15677 15.59036 0.7093023 0.7093023
## Engineer1 MBA1
## 2Wheeler 0.6949153 0.2542373
## Public Transport 0.6864407 0.2796610
## Car 0.8488372 0.2558140
##
## Coefficients of linear discriminants:
## LD1 LD2
## Age 0.210574627 -0.1462080
## Work.Exp 0.081188593 -0.0836168
## Salary 0.007534406 0.0476583
## Distance 0.074968787 0.1906158
## license1 0.416869466 1.2676211
## GenderMale -0.186709183 -1.0865173
## Engineer1 0.150855710 0.1757128
## MBA1 -0.072490532 -0.1201606
##
## Proportion of trace:
## LD1 LD2
## 0.9477 0.0523

plot(varImp(object = carslda),main="Variable Importance for Linear Discriminant Analysis" )

As per the out put, the first discriminant function achieves 90% separation and as per that Age, Work Experience and
Salary play an important part in the choice of transport, followed by Distance and License.

carsldapredict<-predict.train(object = carslda,newdata = carstestlda)


confusionMatrix(carsldapredict,carstestlda[,9])
## Warning in confusionMatrix.default(carsldapredict, carstestlda[, 9]):
## Levels are not in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 15 0 12
## Car 1 14 1
## Public Transport 8 4 77
##
Overall Statistics
##
## Accuracy : 0.803
## 95% CI : (0.7249, 0.8671)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.001323
##
## Kappa : 0.5952
## Mcnemar's Test P-Value : 0.308022
##
## Statistics by Class:
##
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.6250 0.7778 0.8556
## Specificity 0.8889 0.9825 0.7143
## Pos Pred Value 0.5556 0.8750 0.8652
## Neg Pred Value 0.9143 0.9655 0.6977
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1136 0.1061 0.5833
## Detection Prevalence 0.2045 0.1212 0.6742
## Balanced Accuracy 0.7569 0.8801 0.7849

The overall accuracy is 78% with a prediction of cars usage at 83% Let us understand the transportaion choice of two
unknown cases

predictTransportuk<-predict.train(object = carslda,newdata = carunknown)


carunknown$predictTransport<-predictTransportuk
print(carunknown)

## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse


## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0
## predictcarusegn predictTransport
## 1 0 Public Transport
## 2 0 Public Transport

As per the prediction by LDA, both male/female employees will take public transport as a means of transportation

Improve LDA Model by regularization

Run penalised LDA for the model above


trainctrlpda<-trainControl(method = 'cv',number = 10, returnResamp = 'all')
carspda<-train(Transport~Age+Work.Exp+Salary+Distance+license+Gender+Engineer+MBA ,data =
carsdatatrainldasm, method="pda", trControl=trainctrlpda)
carspda$finalModel

## Call:
## mda::fda(formula = as.formula(".outcome ~ ."), data = dat, method = mda::gen.ridge, lambda = param$lambda)
##
Dimension: 2
##
## Percent Between-Group Variance Explained:
## v1 v2
## 94.77 100.00
## Degrees of Freedom (per dimension): 7.992829
## Training Misclassification Error: 0.28882 ( N = 322 )

## Penalized Discriminant Analysis


##
## 322 samples
## 8 predictor
## 3 classes: '2Wheeler', 'Public Transport', 'Car'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 289, 290, 290, 290, 290, 290, ...
## Resampling results across tuning parameters:
##
## lambda Accuracy Kappa
## 0e+00 0.6977548 0.5410021
## 1e-04 0.6977548 0.5410021
## 1e-01 0.7007851 0.5457593
##

Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was lambda = 0.1.

plot(varImp(object = carspda), main="Variable Importance for Penalized Discriminant Analysis")


Variable importance under penalised discriminant analysis is the same as LDA.

carspdapredict<-predict.train(object = carspda,newdata = carstestlda)


confusionMatrix(carspdapredict,carstestlda[,9])

Warning in confusionMatrix.default(carspdapredict, carstestlda[, 9]):


## Levels are not in the same order for reference and data. Refactoring data to match.

Confusion Matrix and Statistics


##
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 15 0 12
## Car 1 14 1
## Public Transport 8 4 77
##
## Overall Statistics
##
## Accuracy : 0.803
## 95% CI : (0.7249, 0.8671)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.001323
##
## Kappa : 0.5952
## Mcnemar's Test P-Value : 0.308022
##
## Statistics by Class:
##
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.6250 0.7778 0.8556
## Specificity 0.8889 0.9825 0.7143
## Pos Pred Value 0.5556 0.8750 0.8652
## Neg Pred Value 0.9143 0.9655 0.6977
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1136 0.1061 0.5833
## Detection Prevalence 0.2045 0.1212 0.6742
## Balanced Accuracy 0.7569 0.8801 0.7849

The result is very similar to LDA.

Let us predict the 2 unknown cases

predictTransportuk1<-predict.train(object = carslda,newdata = carunknown)


carunknown$predictTransportpda<-predictTransportuk1
print(carunknown)

## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse


## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0

predictcarusage predictTransport predictTransportpda


## 1 0 Public Transport Public Transport
## 2 0 Public Transport Public Transport

As per the prediction by LDA, both employees will take public transport as a means of transporation. The cases are
precited to be choosing Public Transport.

Prediction Using CART

Decision tree method , CART to infer and understand from the dataset

carscart<-train(Transport~.,carsdatatrainldasm,method = 'rpart', trControl = trainControl(method = 'cv',number =


5,savePredictions = 'final'))
carscart$finalModel
## n= 322

root 322 204 2Wheeler (0.366459627 0.366459627 0.267080745)


Age< 30.02888 228 111 2Wheeler (0.513157895 0.482456140 0.004385965)
Distance>=10.90546 127 44 2Wheeler (0.653543307 0.338582677 0.007874016) *
Distance< 10.90546 101 34 Public Transport (0.336633663 0.663366337 0.000000000) *
Age>=30.02888 94 9 Car (0.010638298 0.085106383 0.904255319) *
library(rattle)
fancyRpartPlot(carscart$finalModel)
Age and distance are the major variables based on which CART has done the split Let us predict the accuracy of the
model in test data

predictions_CART<-predict(carscart,carstestlda)
confusionMatrix(predictions_CART,carstestlda$Transport)
In confusionMatrix.default(predictions_CART, carstestlda
$Transport): Levels are not in the same order for reference and data.
Refactoring data to match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 15 0 31
## Car 1 17 5
## Public Transport 8 1 54
##
## Overall Statistics
##
## Accuracy : 0.6515
## 95% CI : (0.5637, 0.7323)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.8007180
##
## Kappa : 0.4068
## Mcnemar's Test P-Value : 0.0006336
##
## Statistics by Class:
##
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.6250 0.9444 0.6000
## Specificity 0.7130 0.9474 0.7857
## Pos Pred Value 0.3261 0.7391 0.8571
## Neg Pred Value 0.8953 0.9908 0.4783
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1136 0.1288 0.4091
## Detection Prevalence 0.3485 0.1742 0.4773
## Balanced Accuracy 0.6690 0.9459 0.6929

We get an overall accuracy of 55% with car usage getting predicted at 88% accuracy Let us predict for the unknown
cases

predictTransportuk2<-predict.train(object = carscart,newdata = carunknown)


carunknown$predictTransportcart<-predictTransportuk2
print(carunknown)

## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse


## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0
## predictcarusage predictTransport predictTransportpda
## 1 0 Public Transport Public Transport
## 2 0 Public Transport Public Transport
## predictTransportcart
## 1 Public Transport
## 2 Public Transport

CART model also predicts the transport mode as Public Transport for the 2 cases

Prediction Using Boosting

boostcontrol <- trainControl(number=10)

xgbGrid <- expand.grid(


eta = 0.3,
max_depth = 1,
nrounds = 50,
gamma = 0,
colsample_bytree = 0.6,
min_child_weight = 1, subsample = 1)

carsxgb <- train(Transport ~ .,carsdatatrainldasm,trControl = boostcontrol,tuneGrid = xgbGrid,metric =


"Accuracy",method = "xgbTree")

carsxgb$finalModel
xgb.Booster
## raw: 38.5 Kb
## call:
## xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
## gamma = param$gamma, colsample_bytree = param$colsample_bytree,
## min_child_weight = param$min_child_weight, subsample = param$subsample),
## data = x, nrounds = param$nrounds, num_class = length(lev),
## objective = "multi:softprob")
## params (as set within xgb.train):
## eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.6", min_child_weight = "1", subsample = "1",
num_class = "3", objective = "multi:softprob", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## niter: 50
## xNames: AgeGenderMaleEngineer1MBA1Work.ExpSalaryDistancelicense1
## problemType: Classification
## tuneValue:
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 1 50 1 0.3 0 0.6 1 1
## obsLevels: 2WheelerPublic TransportCar
## param:
## list()

Let us predict using the test dataset

predictions_xgb<-predict(carsxgb,carstestlda)
confusionMatrix(predictions_xgb,carstestlda$Transport)
## Warning in confusionMatrix.default(predictions_xgb, carstestlda$Transport):
## Levels are not in the same order for reference and data. Refactoring data to match
## Confusion Matrix and Statistics
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 15 0 18
## Car 1 17 1
## Public Transport 8 1 71
##
## Overall Statistics
##
## Accuracy : 0.7803
## 95% CI : (0.7, 0.8477)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.00823
##
## Kappa : 0.5789
## Mcnemar's Test P-Value : 0.18342
##
## Statistics by Class:
##
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.6250 0.9444 0.7889
## Specificity 0.8333 0.9825 0.7857
## Pos Pred Value 0.4545 0.8947 0.8875
## Neg Pred Value 0.9091 0.9912 0.6346
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1136 0.1288 0.5379
## Detection Prevalence 0.2500 0.1439 0.6061
## Balanced Accuracy 0.7292 0.9635 0.7873

The overall accuracy is at 73% with accuracy of car usage predicted at 88%.

Let us try and predict for the two unknown cases

predictTransportuk3<-predict.train(object = carsxgb,newdata = carunknown)


carunknown$predictTransportxgb<-predictTransportuk3
print(carunknown)
## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0
## predictcarusegn predictTransport predictTransportpda
## 1 0 Public Transport Public Transport
## 2 0 Public Transport Public Transport
## predictTransportcart predictTransportxgb
## 1 Public Transport 2Wheeler
## 2 Public Transport 2Wheeler

Interestingly, XgB model has predicted the case with Male, license yes and distance as 5 KM to be using 2 Wheeler as
a mode of transport and the other case to be using Public Transport.

Prediction Using Multinomial Logistic Regression

We will now use multinomial logistic regression to understand the factors driving transport mode choice. In this, the
data is relevelled with respect to one of the three classes and 2 independent Logistic regression models are run. In
our case, 2-Wheeler class is taken as the reference class.

carsmlr<-train(Transport ~.,carsdatatrainldasm,method = "multinom")


## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 237.192746
## iter 20 value 184.556974
## iter 30 value 169.855050
## iter 40 value 169.455498
## iter 50 value 169.455170
## final value 169.455131
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 237.573946
## iter 20 value 196.662905
## final value 196.100093
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 237.193131
## iter 20 value 184.612516
## iter 30 value 170.602470
## iter 40 value 170.292215
## iter 50 value 170.292011
## final value 170.291991
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 217.341691
## iter 20 value 171.258377
## iter 30 value 156.882633
## iter 40 value 156.356639
## iter 50 value 156.281151
## iter 60 value 156.272282
## final value 156.263682
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 217.597761
## iter 20 value 186.830700
## final value 186.549764
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 217.341951
## iter 20 value 171.338597
## iter 30 value 157.766864
## iter 40 value 157.453579
## iter 50 value 157.430751
## iter 60 value 157.428884
## final value 157.427619
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 221.673852
## iter 20 value 161.585830
## iter 30 value 146.915896
## iter 40 value 142.670731
## iter 50 value 142.450371
## iter 60 value 142.245563
## iter 70 value 142.240373
## iter 80 value 142.053958
## iter 90 value 141.960555
## iter 100 value 141.956644
## final value 141.956644
## stopped after 100 iterations
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 222.014026
## iter 20 value 187.788120
## iter 30 value 187.088792
## final value 187.088789
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 221.674197
## iter 20 value 161.749455
## iter 30 value 149.001115
## iter 40 value 147.852502
## iter 50 value 147.825846
## iter 60 value 147.815484
## iter 70 value 147.807070
## final value 147.807067
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 240.338655
## iter 20 value 152.975171
## iter 30 value 137.310115
## iter 40 value 133.404821
## iter 50 value 133.172788
## iter 60 value 132.915733
## iter 70 value 132.853315
## iter 80 value 132.832126
## iter 90 value 132.830889
## iter 100 value 132.830178
## final value 132.830178
## stopped after 100 iterations
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 242.287435
## iter 20 value 168.076076
## iter 30 value 167.178308
## final value 167.178279
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 240.340641
## iter 20 value 153.066603
## iter 30 value 139.806060
## iter 40 value 139.568768
## iter 50 value 139.532543
## iter 60 value 139.518452
## iter 70 value 139.517056
## final value 139.507564
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 219.578839
## iter 20 value 169.702343
## iter 30 value 155.429872
## iter 40 value 154.487917
## iter 50 value 154.487309
## iter 60 value 154.487163
## final value 154.486668
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 222.179801
## iter 20 value 187.234643
## iter 30 value 186.882545
## final value 186.882535
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 219.581494
## iter 20 value 169.765363
## iter 30 value 156.095480
## final value 155.389311
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 236.109718
## iter 20 value 170.754225
## iter 30 value 157.734744
## final value 156.993661
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 236.337031
## iter 20 value 190.417696
## iter 30 value 189.330347
## iter 30 value 189.330346
## iter 30 value 189.330346
## final value 189.330346
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 236.109947
## iter 20 value 170.851484
## iter 30 value 158.485693
## final value 157.907969
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 211.672156
## iter 20 value 162.216504
## iter 30 value 152.728964
## iter 40 value 152.056439
## iter 50 value 151.989943
## iter 60 value 151.973445
## iter 70 value 151.956463
## iter 70 value 151.956463
## iter 70 value 151.956463
## final value 151.956463
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 212.099173
## iter 20 value 178.085317
## iter 30 value 177.669211
## final value 177.669205
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 211.672587
## iter 20 value 162.322899
## iter 30 value 153.531484
## iter 40 value 153.100731
## iter 50 value 153.082640
## iter 60 value 153.079831
## final value 153.076829
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 210.826151
## iter 20 value 164.161739
## iter 30 value 154.797119
## iter 40 value 154.491655
## iter 50 value 154.491414
## final value 154.491407
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 211.280318
## iter 20 value 179.415901
## iter 30 value 179.186696
## iter 30 value 179.186695
## iter 30 value 179.186695
## final value 179.186695
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 210.826612
## iter 20 value 164.251211
## iter 30 value 155.291386
## iter 40 value 155.044557
## final value 155.044458
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 221.219881
## iter 20 value 181.606449
## iter 30 value 172.091712
## iter 40 value 171.483751
## iter 50 value 171.475754
## iter 60 value 171.474994
## final value 171.474840
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 221.554922
## iter 20 value 195.084787
## iter 30 value 193.519523
## final value 193.519499
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 221.220221
## iter 20 value 181.662587
## iter 30 value 172.537349
## iter 40 value 171.959161
## iter 50 value 171.953029
## final value 171.952439
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 231.210504
## iter 20 value 173.696030
## iter 30 value 154.406197
## final value 154.275914
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 231.582676
## iter 20 value 189.901146
## iter 30 value 189.451688
## final value 189.451681
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 231.210881
## iter 20 value 173.759304
## iter 30 value 155.381149
## iter 40 value 155.289515
## iter 50 value 155.289164
## final value 155.289141
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 227.286067
## iter 20 value 163.721684
## iter 30 value 150.708078
## iter 40 value 150.092463
## iter 50 value 150.037849
## iter 60 value 150.024405
## iter 60 value 150.024405
## iter 60 value 150.024405
## final value 150.024405
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 227.657045
## iter 20 value 184.308026
## iter 30 value 183.659705
## final value 183.659690
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 227.286444
## iter 20 value 163.844875
## iter 30 value 152.090152
## iter 40 value 151.493757
## iter 50 value 151.490307
## final value 151.489799
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 214.482753
## iter 20 value 160.220934
## iter 30 value 151.603339
## iter 40 value 150.395167
## iter 50 value 150.358448
## iter 60 value 150.353563
## final value 150.352786
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 215.189918
## iter 20 value 169.779188
## iter 30 value 168.743649
## final value 168.743620
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 214.483479
## iter 20 value 160.258776
## iter 30 value 151.992973
## iter 40 value 150.915310
## iter 50 value 150.889919
## iter 60 value 150.886709
## final value 150.884703
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 228.856660
## iter 20 value 168.001438
## iter 30 value 154.236725
## iter 40 value 153.867964
## iter 50 value 153.845862
## iter 60 value 153.842756
## final value 153.839375
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 229.444001
## iter 20 value 184.308154
## iter 30 value 183.868037
## iter 30 value 183.868036
## iter 30 value 183.868035
## final value 183.868035
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 228.857257
## iter 20 value 168.065554
## iter 30 value 154.855369
## iter 40 value 154.542843
## iter 50 value 154.522111
## iter 60 value 154.519223
## final value 154.516941
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 234.254206
## iter 20 value 168.284999
## iter 30 value 155.742483
## iter 40 value 155.014042
## iter 50 value 154.947951
## iter 60 value 154.939998
## final value 154.914585
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 235.388715
## iter 20 value 184.201160
## iter 30 value 182.909514
## final value 182.909450
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 234.255365
## iter 20 value 168.353558
## iter 30 value 156.431196
## iter 40 value 155.805762
## iter 50 value 155.756946
## iter 60 value 155.747332
## final value 155.730193
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 194.478763
## iter 20 value 162.071172
## iter 30 value 153.316407
## iter 40 value 152.720165
## iter 50 value 152.708530
## iter 60 value 152.707081
## final value 152.705612
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 195.044455
## iter 20 value 169.153387
## iter 30 value 168.399732
## final value 168.399723
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 194.479335
## iter 20 value 162.094802
## iter 30 value 153.696140
## iter 40 value 153.147440
## iter 50 value 153.139821
## iter 60 value 153.138985
## final value 153.138800
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 259.507213
## iter 20 value 185.620246
## iter 30 value 171.900994
## iter 40 value 171.676761
## iter 50 value 171.675108
## iter 60 value 171.674814
## final value 171.674579
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 259.940535
## iter 20 value 199.377311
## iter 30 value 198.536398
## final value 198.536393
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 259.507655
## iter 20 value 185.669986
## iter 30 value 172.353149
## iter 40 value 172.153199
## iter 50 value 172.152002
## iter 60 value 172.151876
## final value 172.151773
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 216.104304
## iter 20 value 169.319193
## iter 30 value 163.306010
## final value 163.300003
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 216.509222
## iter 20 value 191.712747
## final value 191.537661
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 216.104715
## iter 20 value 169.495059
## iter 30 value 163.760883
## final value 163.755633
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 200.594142
## iter 20 value 151.856399
## iter 30 value 133.452187
## iter 40 value 127.258073
## iter 50 value 127.152551
## iter 60 value 127.146114
## iter 70 value 127.139747
## final value 127.139210
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 200.969405
## iter 20 value 169.051070
## iter 30 value 168.584656
## final value 168.584649
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 200.594521
## iter 20 value 151.955300
## iter 30 value 135.924719
## iter 40 value 134.835537
## iter 50 value 134.523859
## iter 60 value 134.444202
## iter 70 value 134.425891
## final value 134.425421
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 236.519208
## iter 20 value 166.315086
## iter 30 value 152.931391
## iter 40 value 152.098932
## iter 50 value 152.035764
## iter 60 value 152.024331
## iter 70 value 152.015512
## final value 152.015438
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 236.779646
## iter 20 value 191.975327
## final value 191.787626
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 236.519471
## iter 20 value 166.465638
## iter 30 value 154.046241
## iter 40 value 153.618817
## iter 50 value 153.613194
## final value 153.613091
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 223.902979
## iter 20 value 175.641691
## iter 30 value 161.255781
## iter 40 value 160.755490
## iter 50 value 160.744350
## iter 60 value 160.741224
## iter 70 value 160.729691
## iter 70 value 160.729691
## iter 70 value 160.729691
## final value 160.729691
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 224.080360
## iter 20 value 192.553242
## final value 191.824785
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 223.903158
## iter 20 value 175.709316
## iter 30 value 162.143535
## iter 40 value 161.789752
## iter 50 value 161.789271
## final value 161.789111
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 209.894815
## iter 20 value 162.849167
## iter 30 value 149.773550
## iter 40 value 141.618418
## iter 50 value 141.514060
## iter 60 value 141.498647
## iter 70 value 141.491611
## final value 141.488150
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 210.239682
## iter 20 value 181.508139
## iter 30 value 181.122050
## final value 181.122048
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 209.895164
## iter 20 value 162.944388
## iter 30 value 151.469777
## iter 40 value 150.465912
## iter 50 value 150.179928
## iter 60 value 150.116537
## iter 70 value 150.107270
## final value 150.101462
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 219.334247
## iter 20 value 163.087354
## iter 30 value 143.014069
## iter 40 value 139.192967
## iter 50 value 137.875567
## iter 60 value 137.324815
## iter 70 value 137.289675
## iter 80 value 136.452028
## iter 90 value 135.966665
## iter 100 value 135.947421
## final value 135.947421
## stopped after 100 iterations
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 219.806713
## iter 20 value 187.418888
## iter 30 value 186.891634
## iter 30 value 186.891633
## iter 30 value 186.891632
## final value 186.891632
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 219.334727
## iter 20 value 163.218993
## iter 30 value 145.475390
## iter 40 value 145.072904
## iter 50 value 145.017251
## iter 60 value 145.014875
## final value 145.009032
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 205.382859
## iter 20 value 161.026725
## iter 30 value 153.569868
## iter 40 value 153.506484
## iter 50 value 153.504985
## final value 153.504933
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 205.808631
## iter 20 value 170.782394
## iter 30 value 170.529465
## iter 30 value 170.529463
## iter 30 value 170.529463
## final value 170.529463
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 205.383289
## iter 20 value 161.070168
## iter 30 value 153.977220
## iter 40 value 153.937842
## iter 50 value 153.937341
## final value 153.937331
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 201.014116
## iter 20 value 147.069236
## iter 30 value 130.513895
## iter 40 value 127.718324
## iter 50 value 127.466167
## iter 60 value 127.324066
## iter 70 value 127.303915
## iter 80 value 127.155641
## iter 90 value 127.120671
## iter 100 value 127.119651
## final value 127.119651
## stopped after 100 iterations
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 201.984893
## iter 20 value 161.284255
## iter 30 value 160.668203
## final value 160.668175
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 201.015111
## iter 20 value 147.133871
## iter 30 value 131.861267
## iter 40 value 130.397608
## iter 50 value 130.333288
## iter 60 value 130.305895
## iter 70 value 130.297079
## final value 130.297076
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 245.622515
## iter 20 value 192.146403
## iter 30 value 176.987516
## iter 40 value 176.244638
## iter 50 value 176.188856
## iter 60 value 176.182244
## final value 176.179717
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 245.993832
## iter 20 value 211.996790
## final value 211.341732
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 245.622893
## iter 20 value 192.271097
## iter 30 value 178.089553
## iter 40 value 177.355048
## iter 50 value 177.332968
## final value 177.328965
## converged
## # weights: 30 (18 variable)
## initial value 353.753157
## iter 10 value 218.032720
## iter 20 value 173.350700
## iter 30 value 163.770891
## iter 40 value 163.750951
## iter 40 value 163.750950
## iter 40 value 163.750950
## final value 163.750950
## converged

carsmlr$finalModel
nnet::multinom(formula = .outcome ~ ., data = dat, decay = param$decay)

## Coefficients:
## (Intercept) Age GenderMale Engineer1 MBA1
## Public Transport -2.924602 0.1867473 0.9073215 -0.1392119 0.1814413
## Car -73.157600 2.4072670 -0.8481166 1.4958898 -0.7794798
## Work.Exp Salary Distance license1
## Public Transport 0.1027101 -0.07091949 -0.1481504 -1.154896
## Car -1.0684135 0.17301043 0.3334671 1.378113
##
## Residual Deviance: 327.5019
## AIC: 363.5019

carmlrcoeff<-exp(coef(carsmlr$finalModel))
write.csv(carmlrcoeff,file = "Coeffsmlr.csv")

plot(varImp(object=carsmlr), main = "Variable Importance for Multinomial Logit")

The model has a residual deviance 296.416. The model implies that an increase in Age by 1 year increases the odds
of taking Public Transport as compared to 2 Wheeler by 1.3 (57%), whereas it increases the odds of choosing car by
12 (92%) Age and license are the two main important factors in deciding mode of transport
predictions_mlr<-predict(carsmlr,carstestlda)
confusionMatrix(predictions_mlr,carstestlda$Transport)
Levels are not in the same order for reference and data. Refactoring data to match.
## Confusion Matrix and Statistics
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 16 0 8
## Car 1 18 2
## Public Transport 7 0 80
##
## Overall Statistics
## Accuracy : 0.8636
## 95% CI : (0.7931, 0.9171)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 1.242e-06
## Kappa : 0.725
## Mcnemar's Test P-Value : 0.3815
## Statistics by Class:
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.6667 1.0000 0.8889
## Specificity 0.9259 0.9737 0.8333
## Pos Pred Value 0.6667 0.8571 0.9195
## Neg Pred Value 0.9259 1.0000 0.7778
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1212 0.1364 0.6061
## Detection Prevalence 0.1818 0.1591 0.6591
## Balanced Accuracy 0.7963 0.9868 0.8611

The overall accuracy is at 70% with accuracy of car usage predicted at 94% Let us try and predict for the two
unknown cases

predictTransportuk4<-predict.train(object = carsmlr,newdata = carunknown)


carunknown$predictTransportmlr<-predictTransportuk4
print(carunknown)

## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse


## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0
## predictcarusegn predictTransport predictTransportpda
## 1 0 Public Transport Public Transport
## 2 0 Public Transport Public Transport
## predictTransportcart predictTransportxgb predictTransportmlr
## 1 Public Transport 2Wheeler Public Transport
## 2 Public Transport 2Wheeler Public Transport

The model has predicted both cases to be using public transport.


Prediction Using Random Forest

rftrcontrol<-control <- trainControl(method="repeatedcv", number=10, repeats=3)


mtry<-sqrt(ncol(carsdatatrainldasm))
tunegridrf <- expand.grid(.mtry=mtry)
carsrf<-train(Transport ~.,carsdatatrainldasm,method = "rf", trControl=rftrcontrol, tuneGrid = tunegridrf)
carsrf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 16.77%
## Confusion matrix:
## 2Wheeler Public Transport Car class.error
## 2Wheeler 85 32 1 0.27966102
## Public Transport 14 100 4 0.15254237
## Car 2 1 83 0.03488372
plot(varImp(object=carsrf), main = "Variable Importance for Random Forest")

The out of bag error estimate rate is 16.7% in the training dataset. Age, Work Experience, Distance and Salary are the
most significant variables. Let us try and predict for test data

BAGGING

Cars.bagging <- bagging(formula = Transport~., data=cars_train,control=rpart.control

(maxdepth=5, minsplit=15))
cars_test$pred.Transport <- predict(Cars.bagging, cars_test)

table(cars_test$Transport,cars_test$pred.Transport)

2Wheeler Car Public Transport

2Wheeler 8 1 15

Car 1 16 1

Public Transport 7 3 80

predictions_rf<-predict(carsrf,carstestlda)
confusionMatrix(predictions_rf,carstestlda$Transport)
Levels are not in the same order for reference and data. Refactoring data to match.

## Confusion Matrix and Statistics


##
## Reference
## Prediction 2Wheeler Car Public Transport
## 2Wheeler 18 0 29
## Car 1 17 1
## Public Transport 5 1 60
## Overall Statistics
## Accuracy : 0.7197
## 95% CI : (0.6349, 0.7943)
## No Information Rate : 0.6818
## P-Value [Acc > NIR] : 0.2011606
## Kappa : 0.5123
## Mcnemar's Test P-Value : 0.0004523
## Statistics by Class:
## Class: 2Wheeler Class: Car Class: Public Transport
## Sensitivity 0.7500 0.9444 0.6667
## Specificity 0.7315 0.9825 0.8571
## Pos Pred Value 0.3830 0.8947 0.9091
## Neg Pred Value 0.9294 0.9912 0.5455
## Prevalence 0.1818 0.1364 0.6818
## Detection Rate 0.1364 0.1288 0.4545
## Detection Prevalence 0.3561 0.1439 0.5000
## Balanced Accuracy 0.7407 0.9635 0.7619

We have an overall accuracy of 72% and 94% accuracy for prediction of car usage Let us now predict the choice of
transport for 2 unknown cases

predictTransportuk5<-predict.train(object = carsrf,newdata = carunknown)


carunknown$predictTransportrf<-predictTransportuk5
print(carunknown)
## Age Gender Engineer MBA Work.Exp Salary Distance license predictcaruse
## 1 25 Male 0 0 2 10 5 1 0
## 2 25 Female 1 0 2 10 5 0 0
## predictcarusegn predictTransport predictTransportpda
## 1 0 Public Transport Public Transport
## 2 0 Public Transport Public Transport
## predictTransportcart predictTransportxgb predictTransportmlr
## 1 Public Transport 2Wheeler Public Transport
## 2 Public Transport 2Wheeler Public Transport
## predictTransportrf
## 1 Public Transport
## 2 2Wheeler

We have one record (female,engineer) predicted to choose 2 Wheeler and the other record to have chosen Public
Transport

Key Observations for critical variables selected after regularisation are:

• Important variables are Age, Work.Exp, Distance and License


• Age and Work.Exp are highly correlated.
• Employees with work exp of more than 10 years and above are likely to use Car
• With license, we do see that 81% who commute through car have license and 89% who commute through
public transport. But surprisingly 74% without license use 2-wheeler.
• Box Plot shows users who are using car as transport have Median age much higher than non-car passenger.
So higher age seems to be a driving factor for transport mode selection
• Box Plot shows users who are using car as transport have Median work experience much higher than non-car
passenger. So higher work experience seems to be a driving factor for transport mode selection. This is also
clear from age as higher work experience employee will have higher age.
• Box Plot shows users who are travelling long distance prefers to use car as mode of transport
• Higher salary seems to be driving factor for choosing car as preferred one but there are quite a few outliers
for non-car owner also who are earning higher salary
• No of Men seems to be much higher w.r.t Women for preferred mode of transport as Car or Non Car. This
may be due to gender-inquality in Job.

You might also like