Grid Search For Random Forest

10/14/21, 5:44 PM RPubs - Random Forest Tunning in Caret
Caret Practice
Pham Dinh Khanh
May 15, 2018
Test Algorithm
Tunning algorithm is important in bulding modeling. In random forest model, you can not pre-understand your result
because your model are randomly processing. Tunning algorithm will help you control training proccess and gain
better result. In this study, we will focus on two main tunning parameters in random forest model is mtry and ntree.
Beside, there are many other method but these two parameters perhaps most likely have biggest affect to model
accuracy.
mtry: Number of variable is randomly collected to be sampled at each split time.
ntree: Number of branches will grow after each time split.
In below result we use repeatedcv method to divide our dataset into 10 folds cross-validation and repeat only 3
repeat times in order to slows down our process. I will hold back validation set for back testing.
#https://machinelearningmastery.com/tune-machine-learning-algorithms-in-r/
library(randomForest)
library(mlbench)
library(caret)
library(e1071)
# Load Dataset
data(Sonar)
dataset <- Sonar
x <- dataset[,1:60]
y <- dataset[,61]
#10 folds repeat 3 times
control <- trainControl(method='repeatedcv',
number=10,
repeats=3)
#Metric compare model is Accuracy
metric <- "Accuracy"
set.seed(123)
#Number randomely variable selected is mtry
mtry <- sqrt(ncol(x))
tunegrid <- expand.grid(.mtry=mtry)
rf_default <- train(Class~.,
data=dataset,
method='rf',
metric='Accuracy',
tuneGrid=tunegrid,
trControl=control)
print(rf_default)
https://rpubs.com/phamdinhkhanh/389752 1/12
## Random Forest
##
## 208 samples
## 60 predictor
## 2 classes: 'M', 'R'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 187, 187, 187, 188, 188, 187, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8408442 0.6765085
##
## Tuning parameter 'mtry' was held constant at a value of 7.745967
Random search
Caret can provide for you random parameter if you do not declare for them. As below model will generate 15 random
values of mtry at each time tunning. We have 15 values because of tunning length is 15.
# library(doParallel)
# cores <- 7
# registerDoParallel(cores = cores)
#mtry: Number of random variables collected at each split. In normal equal square number columns.
mtry <- sqrt(ncol(x))
#ntree: Number of trees to grow.
ntree <- 3
number=10,
repeats=3,
search = 'random')
#Random generate 15 mtry values with tuneLength = 15
set.seed(1)
rf_random <- train(Class ~ .,
data = dataset,
method = 'rf',
metric = 'Accuracy',
tuneLength = 15,
trControl = control)
print(rf_random)
## Random Forest
##
## 208 samples
## 60 predictor
##
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 4 0.8559380 0.7070142
## 11 0.8476840 0.6907062
## 13 0.8397547 0.6739463
## 23 0.8330159 0.6604453
## 24 0.8249206 0.6446857
## 30 0.8265079 0.6474770
## 35 0.8122150 0.6187142
## 38 0.8184848 0.6311962
## 40 0.8185642 0.6321423
## 42 0.8234127 0.6413732
## 47 0.8024459 0.5993188
## 54 0.8252309 0.6461357
## 55 0.8153824 0.6250915
## 57 0.8106205 0.6162190
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
plot(rf_random)
We can see the highest accuracy = 86% when mtry = 4.
Grid search
We also can define a grid of algorithm to tunning model. Each axis of grid is an algorithm parameter and point in grid
are specific combinations of parameter. In this example we only tunning on one parameter, the grid search only have
one dimension as vector.
#Create control function for training with 10 folds and keep 3 folds for training. search method i
s grid.
number=10,
repeats=3,
search='grid')
#create tunegrid with 15 values from 1:15 for mtry to tunning model. Our train function will chang
e number of entry variable at each split according to tunegrid.
tunegrid <- expand.grid(.mtry = (1:15))
rf_gridsearch <- train(Class ~ .,
data = dataset,
method = 'rf',
tuneGrid = tunegrid)
print(rf_gridsearch)
## Random Forest
##
## 208 samples
## 60 predictor
##
## Resampling: Bootstrapped (25 reps)
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 1 0.8249973 0.6474343
## 2 0.8252196 0.6480826
## 3 0.8158734 0.6290701
## 4 0.8152652 0.6279022
## 5 0.8206327 0.6389178
## 6 0.8162241 0.6295473
## 7 0.8194694 0.6360085
## 8 0.8182547 0.6338760
## 9 0.8159730 0.6291169
## 10 0.8118015 0.6208063
## 11 0.8120447 0.6215262
## 12 0.8122707 0.6219964
## 13 0.8124745 0.6222660
## 14 0.8164327 0.6306590
## 15 0.8127498 0.6229256
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(rf_gridsearch)
this results recommend us that the best optimal mtry = 2 with accuracy = 82.5%.
Tune by tools
In randomeForest() have tuneRF() for searching best optimal mtry values given for your data. We will depend on
OOBError to define the most accurate mtry for our model which have the least OOBEError.
set.seed(1)
bestMtry <- tuneRF(x,y, stepFactor = 1.5, improve = 1e-5, ntree = 500)
## mtry = 7 OOB error = 14.9%
## Searching left ...
## 0.1612903 1e-05
## -0.1923077 1e-05
## Searching right ...
## mtry = 10 OOB error = 15.38%
## -0.2307692 1e-05
print(bestMtry)
## mtry OOBError
## 4.OOB 4 0.1490385
## 5.OOB 5 0.1250000
## 7.OOB 7 0.1490385
## 10.OOB 10 0.1538462
According to this results, mtry = 5 is the best parameter for our model. This is quite different with Grid search method
when accuracy by 82% at mtry = 82.5%.
Make your own parameter search

You can make your own parameter search for adapted your dataset.
There are two option:
Turning manually: Write your R code create many models and compare their accuracy using caret.
Extend caret: Create an extension to caret that adds more parameters in caret model for tuning.
Tunning manually
In manually tunning we continualy keep caret because its result is aligned with previous model and provide a feasible
comparision. Moreover, keeping repeated cross validation in caret can reduces model’s overfiting.
This approach create many model caret scenarios with different manual parameters and compare its accuracy. Let
look at example we do this to evaluate different ntree while hodling mtry constant.
# cores <- makeCluster(detectCores()-1)
#Manual search by create 10 folds and repeat 3 times
control <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 3,
search = 'grid')
#create tunegrid
tunegrid <- expand.grid(.mtry = c(sqrt(ncol(dataset))))
modellist <- list()
#train with different ntree parameters
for (ntree in c(1000,1500,2000,2500)){

set.seed(123)
fit <- train(Class~.,
data = dataset,
method = 'rf',
tuneGrid = tunegrid,
trControl = control,
ntree = ntree)
key <- toString(ntree)
modellist[[key]] <- fit
#Compare results
results <- resamples(modellist)
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: 1000, 1500, 2000, 2500
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 0.7000000 0.7738095 0.8571429 0.8376696 0.9047619 0.952381 0
## 1500 0.6666667 0.7738095 0.8535714 0.8329076 0.9047619 0.952381 0
## 2000 0.6500000 0.7738095 0.8535714 0.8294949 0.9047619 0.952381 0
## 2500 0.7000000 0.7738095 0.8535714 0.8312410 0.9035714 0.952381 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1000 0.3814433 0.5431887 0.7096774 0.6700748 0.8073394 0.9041096 0
## 1500 0.3287671 0.5431887 0.6985887 0.6606111 0.8073394 0.9041096 0
## 2000 0.2708333 0.5431887 0.6985887 0.6529043 0.8073394 0.9041096 0
## 2500 0.3684211 0.5431887 0.6985887 0.6569477 0.8039582 0.9041096 0
dotplot(results)
# stopCluster(cores)
Our model have highest accuracy at ntree = 1000 when accuracy = 83.77%. It mean that we should keep ntree <=
1000 is best adapted values for tunning. In this case we keep mtry equal root square of number columns dataset.
We should try another option such as mtry = 5 or mtry = 2 in case they have interaction effects.
Extend caret
In this method we create a new algorithm for caret to support. It is the same with random forest we implemented but
we make it more flexiable tunning with multiple parameters. In this cases we will tunning for both: mtry and ntree
parameters.
We can create an custom list in our model to set up the rule of tunning such as defining the parameters, type, library,
predict and prop,…. caret package can search this list parameter to adjust the process.
customRF <- list(type = "Classification",
library = "randomForest",
loop = NULL)
customRF$parameters <- data.frame(parameter = c("mtry", "ntree"),
class = rep("numeric", 2),
label = c("mtry", "ntree"))
customRF$grid <- function(x, y, len = NULL, search = "grid") {}
customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs) {
randomForest(x, y,
mtry = param$mtry,
ntree=param$ntree)
#Predict label
customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata)
#Predict prob
customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL)
predict(modelFit, newdata, type = "prob")
customRF$sort <- function(x) x[order(x[,1]),]
customRF$levels <- function(x) x$classes
Now, let make our model tunning by calling caret traing model with this customRF list. Model will tune with different
mtry and ntree.
# cores <- makeCluster(detectCores()-1)
# train model
control <- trainControl(method="repeatedcv",
number=10,
repeats=3,
allowParallel = TRUE)
tunegrid <- expand.grid(.mtry=c(1:15),.ntree=c(1000,1500))
set.seed(123)
custom <- train(Class~., data=dataset,
method=customRF,
metric=metric,
tuneGrid=tunegrid,
trControl=control)
summary(custom)
## Length Class Mode
## call 5 -none- call
## type 1 -none- character
## predicted 208 factor numeric
## err.rate 3000 -none- numeric
## confusion 6 -none- numeric
## votes 416 matrix numeric
## oob.times 208 -none- numeric
## classes 2 -none- character
## importance 60 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 208 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 60 -none- character
## problemType 1 -none- character
## tuneValue 2 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
plot(custom)
# stopCluster(cores)
This function take several hours to complete with my laptop computer (core i7-3720QM and 8GB RAM) because it
will run thousand models in reality. You should consider to implement.
Reference documentary :
https://topepo.github.io/caret/model-training-and-tuning.html
https://machinelearningmastery.com/tune-machine-learning-algorithms-in-r/

Grid Search For Random Forest

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Grid Search For Random Forest

Uploaded by

Copyright:

Available Formats

10/14/21, 5:44 PM RPubs - Random Forest Tunning in Caret

dataset <- Sonar

#10 folds repeat 3 times

control <- trainControl(method='repeatedcv',

#Metric compare model is Accuracy

metric <- "Accuracy"

#Number randomely variable selected is mtry

mtry <- sqrt(ncol(x))

tunegrid <- expand.grid(.mtry=mtry)

rf_default <- train(Class~.,

## 2 classes: 'M', 'R'

## Resampling: Cross-Validated (10 fold, repeated 3 times)

## Tuning parameter 'mtry' was held constant at a value of 7.745967

mtry <- sqrt(ncol(x))

#ntree: Number of trees to grow.

control <- trainControl(method='repeatedcv',

#Random generate 15 mtry values with tuneLength = 15

rf_random <- train(Class ~ .,

## 2 classes: 'M', 'R'

## Resampling: Cross-Validated (10 fold, repeated 3 times)

## Resampling results across tuning parameters:

## mtry Accuracy Kappa

## The final value used for the model was mtry = 4.

We can see the highest accuracy = 86% when mtry = 4.

control <- trainControl(method='repeatedcv',

tunegrid <- expand.grid(.mtry = (1:15))

rf_gridsearch <- train(Class ~ .,

## 2 classes: 'M', 'R'

## Resampling: Bootstrapped (25 reps)

## Resampling results across tuning parameters:

## mtry Accuracy Kappa

## The final value used for the model was mtry = 2.

bestMtry <- tuneRF(x,y, stepFactor = 1.5, improve = 1e-5, ntree = 500)

## mtry = 7 OOB error = 14.9%

## Searching left ...

## mtry = 5 OOB error = 12.5%

## mtry = 4 OOB error = 14.9%

## Searching right ...

## mtry = 10 OOB error = 15.38%

Make your own parameter search

# cores <- makeCluster(detectCores()-1)

#Manual search by create 10 folds and repeat 3 times

control <- trainControl(method = 'repeatedcv',

tunegrid <- expand.grid(.mtry = c(sqrt(ncol(dataset))))

modellist <- list()

#train with different ntree parameters

for (ntree in c(1000,1500,2000,2500)){

fit <- train(Class~.,

key <- toString(ntree)

modellist[[key]] <- fit

results <- resamples(modellist)

## Models: 1000, 1500, 2000, 2500

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

## 1000 0.7000000 0.7738095 0.8571429 0.8376696 0.9047619 0.952381 0

## 1500 0.6666667 0.7738095 0.8535714 0.8329076 0.9047619 0.952381 0

## 2000 0.6500000 0.7738095 0.8535714 0.8294949 0.9047619 0.952381 0

## 2500 0.7000000 0.7738095 0.8535714 0.8312410 0.9035714 0.952381 0

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

## 1000 0.3814433 0.5431887 0.7096774 0.6700748 0.8073394 0.9041096 0

## 1500 0.3287671 0.5431887 0.6985887 0.6606111 0.8073394 0.9041096 0

## 2000 0.2708333 0.5431887 0.6985887 0.6529043 0.8073394 0.9041096 0

## 2500 0.3684211 0.5431887 0.6985887 0.6569477 0.8039582 0.9041096 0