kNN Using caret R package

Vijayakumar Jawaharlal
April 29, 2014
Recently I’ve got familiar with caret package. Caret is a great R package which provides general interface to nearly
150 ML algorithms. It also provides great functions to sample the data (for training and testing), preprocessing,
evaluating the model etc.,
To get familiar with caret package, please check following URLs http://cran.r-
I am going to use same dataset from previous examples. Intention of this excercise is to get familiar with caret



## Loading required package: lattice

## Loading required package: ggplot2


#Spliting data as training and test set. Using createDataPartition() function from caret

indxTrain <- createDataPartition(y = Smarket$Direction,p = 0.75,list = FALSE)

training <- Smarket[indxTrain,]

testing <- Smarket[-indxTrain,]

#Checking distibution in origanl data and partitioned data

prop.table(table(training$Direction)) * 100


## Down Up

## 48.19 51.81

prop.table(table(testing$Direction)) * 100


## Down Up

## 48.08 51.92

## Down Up

## 48.16 51.84

creteDataParition function creates sample very effortlessly. We don’t need to write complex function like previous

kNN requires variables to be normalized or scaled. caret provides facility to preprocess data. I am going to choose
centring and scaling

trainX <- training[,names(training) != "Direction"]

preProcValues <- preProcess(x = trainX,method = c("center", "scale"))



## Call:

## preProcess.default(x = trainX, method = c("center", "scale"))


## Created from 938 samples and 8 variables

## Pre-processing: centered, scaled

Training and train control


ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClass


knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c(
"center","scale"), tuneLength = 20)

#Output of kNN fit

## k-Nearest Neighbors


## 938 samples

## 8 predictors
## 2 classes: 'Down', 'Up'


## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)


## Summary of sample sizes: 844, 844, 844, 845, 844, 845, ...


## Resampling results across tuning parameters:


## k Accuracy Kappa Accuracy SD Kappa SD

## 5 0.9 0.7 0.04 0.07

## 7 0.9 0.8 0.04 0.08

## 9 0.9 0.8 0.04 0.08

## 10 0.9 0.8 0.03 0.07

## 10 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.03 0.06

## 20 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.04 0.08

## 20 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.03 0.07

## 20 0.9 0.8 0.03 0.06

## 30 0.9 0.8 0.03 0.06

## 30 0.9 0.8 0.03 0.07

## 30 0.9 0.8 0.03 0.07

## 30 0.9 0.8 0.03 0.07

## 40 0.9 0.8 0.03 0.07

## 40 0.9 0.8 0.03 0.06

## 40 0.9 0.8 0.03 0.06

## 40 0.9 0.8 0.03 0.06

## 40 0.9 0.8 0.03 0.05


## Accuracy was used to select the optimal model using the largest value.

## The final value used for the model was k = 23.

#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)

knnPredict <- predict(knnFit,newdata = testing )

#Get the confusion matrix to see accuracy value and other parameter values

## Confusion Matrix and Statistics


## Reference

## Prediction Down Up

## Down 123 8

## Up 27 154


## Accuracy : 0.888

## 95% CI : (0.847, 0.921)

## No Information Rate : 0.519

## P-Value [Acc > NIR] : < 2e-16


## Kappa : 0.774

## Mcnemar's Test P-Value : 0.00235


## Sensitivity : 0.820

## Specificity : 0.951

## Pos Pred Value : 0.939

## Neg Pred Value : 0.851

## Prevalence : 0.481

## Detection Rate : 0.394

## Detection Prevalence : 0.420

## Balanced Accuracy : 0.885


## 'Positive' Class : Down


mean(knnPredict == testing$Direction)

## [1] 0.8878

#Now verifying 2 class summary function

ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSum


knnFit <- train(Direction ~ ., data = training, method = "knn", trControl = ctrl, preProcess = c(
"center","scale"), tuneLength = 20)

## Loading required package: pROC

## Type 'citation("pROC")' for a citation.


## Attaching package: 'pROC'


## The following objects are masked from 'package:stats':


## cov, smooth, var

## Warning: The metric "Accuracy" was not in the result set. ROC will be used

#Output of kNN fit


## k-Nearest Neighbors


## 938 samples

## 8 predictors
## 2 classes: 'Down', 'Up'


## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)


## Summary of sample sizes: 844, 844, 845, 843, 844, 844, ...


## Resampling results across tuning parameters:


## k ROC Sens Spec ROC SD Sens SD Spec SD

## 5 0.9 0.8 0.9 0.03 0.07 0.05

## 7 1 0.8 0.9 0.02 0.06 0.05

## 9 1 0.8 0.9 0.02 0.07 0.05

## 10 1 0.8 0.9 0.02 0.07 0.05

## 10 1 0.8 0.9 0.02 0.07 0.05

## 20 1 0.9 0.9 0.02 0.06 0.04

## 20 1 0.9 0.9 0.02 0.06 0.04

## 20 1 0.9 0.9 0.02 0.07 0.03

## 20 1 0.9 0.9 0.01 0.06 0.03

## 20 1 0.9 0.9 0.01 0.06 0.03

## 20 1 0.9 0.9 0.01 0.07 0.03

## 30 1 0.9 0.9 0.01 0.06 0.03

## 30 1 0.8 0.9 0.01 0.07 0.03

## 30 1 0.8 0.9 0.01 0.06 0.03

## 30 1 0.8 0.9 0.01 0.07 0.03

## 40 1 0.8 0.9 0.01 0.06 0.03

## 40 1 0.8 0.9 0.01 0.06 0.03

## 40 1 0.8 0.9 0.01 0.06 0.02

## 40 1 0.8 0.9 0.01 0.06 0.02

## 40 1 0.8 0.9 0.01 0.06 0.03


## ROC was used to select the optimal model using the largest value.

## The final value used for the model was k = 43.

#Plotting yields Number of Neighbours Vs accuracy (based on repeated cross validation)

knnPredict <- predict(knnFit,newdata = testing )

#Get the confusion matrix to see accuracy value and other parameter values

## Confusion Matrix and Statistics


## Reference

## Prediction Down Up

## Down 123 9

## Up 27 153


## Accuracy : 0.885

## 95% CI : (0.844, 0.918)

## No Information Rate : 0.519

## P-Value [Acc > NIR] : < 2e-16


## Kappa : 0.768

## Mcnemar's Test P-Value : 0.00461


## Sensitivity : 0.820

## Specificity : 0.944

## Pos Pred Value : 0.932

## Neg Pred Value : 0.850

## Prevalence : 0.481

## Detection Rate : 0.394

## Detection Prevalence : 0.423

## Balanced Accuracy : 0.882


## 'Positive' Class : Down


mean(knnPredict == testing$Direction)

## [1] 0.8846

Trying to plot ROC curve to check specificity and sensitivity


knnPredict <- predict(knnFit,newdata = testing , type="prob")

knnROC <- roc(testing$Direction,knnPredict[,"Down"], levels = rev(testing$Direction))



## Call:

## roc.default(response = testing$Direction, predictor = knnPredict[, "Down"], levels = rev(te


## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction

## Area under the curve: 0.5

## Call:

## roc.default(response = testing$Direction, predictor = knnPredict[, "Down"], levels = rev(te


## Data: knnPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction

## Area under the curve: 0.5

Applying Random Forest to see the performance


ctrl <- trainControl(method="repeatedcv",repeats = 3) #,classProbs=TRUE,summaryFunction = twoClass


# Random forrest

rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("ce
## Loading required package: randomForest

## randomForest 4.6-7

## Type rfNews() to see new features/changes/bug fixes.

## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .


## Random Forest

## 938 samples

## 8 predictors
## 2 classes: 'Down', 'Up'


## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)


## Summary of sample sizes: 844, 844, 844, 845, 844, 845, ...


## Resampling results across tuning parameters:


## mtry Accuracy Kappa Accuracy SD Kappa SD

## 2 1 1 0.004 0.009

## 3 1 1 0.004 0.009

## 4 1 1 0.004 0.009

## 5 1 1 0.004 0.009

## 6 1 1 0.004 0.009

## 7 1 1 0.004 0.009

## 8 1 1 0.004 0.009


## Accuracy was used to select the optimal model using the largest value.

## The final value used for the model was mtry = 2.

rfPredict <- predict(rfFit,newdata = testing )

## Confusion Matrix and Statistics


## Reference

## Prediction Down Up

## Down 150 0

## Up 0 162


## Accuracy : 1

## 95% CI : (0.988, 1)

## No Information Rate : 0.519

## P-Value [Acc > NIR] : <2e-16


## Kappa : 1

## Mcnemar's Test P-Value : NA


## Sensitivity : 1.000

## Specificity : 1.000

## Pos Pred Value : 1.000

## Neg Pred Value : 1.000

## Prevalence : 0.481

## Detection Rate : 0.481

## Detection Prevalence : 0.481

## Balanced Accuracy : 1.000


## 'Positive' Class : Down


mean(rfPredict == testing$Direction)

## [1] 1

#With twoclasssummary

ctrl <- trainControl(method="repeatedcv",repeats = 3,classProbs=TRUE,summaryFunction = twoClassSum


# Random forrest

rfFit <- train(Direction ~ ., data = training, method = "rf", trControl = ctrl, preProcess = c("ce
nter","scale"), tuneLength = 20)

## note: only 7 unique complexity parameters in default grid. Truncating the grid to 7 .

## Warning: The metric "Accuracy" was not in the result set. ROC will be used

## instead.

## Random Forest

## 938 samples

## 8 predictors
## 2 classes: 'Down', 'Up'


## Pre-processing: centered, scaled

## Resampling: Cross-Validated (10 fold, repeated 3 times)


## Summary of sample sizes: 844, 844, 845, 845, 843, 845, ...


## Resampling results across tuning parameters:


## mtry ROC Sens Spec ROC SD Sens SD Spec SD

## 2 1 1 1 9e-05 0.007 0.005

## 3 1 1 1 8e-05 0.007 0.005

## 4 1 1 1 0 0.007 0.005

## 5 1 1 1 0 0.007 0.005

## 6 1 1 1 8e-05 0.007 0.005

## 7 1 1 1 0 0.007 0.005

## 8 1 1 1 0 0.007 0.005


## ROC was used to select the optimal model using the largest value.

## The final value used for the model was mtry = 4.

#Trying plot with some more parameters

rfPredict <- predict(rfFit,newdata = testing )

## Confusion Matrix and Statistics


## Reference

## Prediction Down Up

## Down 150 0

## Up 0 162


## Accuracy : 1

## 95% CI : (0.988, 1)

## No Information Rate : 0.519

## P-Value [Acc > NIR] : <2e-16


## Kappa : 1

## Mcnemar's Test P-Value : NA


## Sensitivity : 1.000

## Specificity : 1.000

## Pos Pred Value : 1.000

## Neg Pred Value : 1.000

## Prevalence : 0.481

## Detection Rate : 0.481

## Detection Prevalence : 0.481

## Balanced Accuracy : 1.000


## 'Positive' Class : Down


mean(rfPredict == testing$Direction)

## [1] 1

Ploting ROC curve


rfPredict <- predict(rfFit,newdata = testing , type="prob")

rfROC <- roc(testing$Direction,rfPredict[,"Down"], levels = rev(testing$Direction))



## Call:

## roc.default(response = testing$Direction, predictor = rfPredict[, "Down"], levels = rev(tes



## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction

## Area under the curve: 0.5

## Call:

## roc.default(response = testing$Direction, predictor = rfPredict[, "Down"], levels = rev(tes



## Data: rfPredict[, "Down"] in 162 controls (testing$Direction 2) < 162 cases (testing$Direction

## Area under the curve: 0.5

