Professional Documents
Culture Documents
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
Machine Learning With Random Forests - by Knoldus Inc. - Knoldus - Technical Insights - Medium
To know the machine learning in a little more deeper I’d suggest you go
through this introductory blog for Machine Learning.
In our previous blogs, we learned about Decision Tree algorithm (Link) for
and its implementation (Link). Now in this blog, we will move on to the next
algorithm for Machine Learning called Random Forest. Please go through
these blogs before moving forward as Random Forest algorithm is based on
Decision tree.
‘Random Forest’ as the name suggests is a forest and forest consists of trees.
Here the trees being mentioned are Decision Trees. So the full definition will
be “Random Forest is a random collection of Decision Trees”. Hence this
algorithm is basically just an extension of Decision Tree algorithm.
Bagging is a general procedure that can be used to reduce the variance for
those algorithms that have high variance. In this process, sub-samples are
created for the data set and a subset of attributes, that we use to train our
decision models and then we consider every model and choose the decision
by voting -(classification) or by taking the average (regression). For the
random forest, we usually take two third of the data with replacement (data
can be repeated for every other decision tree, no need to be unique data).
And the subset of the attributes m
Advantages
Works better for both classification and regression.
can handle large data set with a large number of attributes as these are
divided among trees.
It can model the importance of attributes. Hence it is used for
dimensionality reduction also.
Out of Bag Error shows more or less the same error rate as a separate data
set for training shows. Hence it removes the need of a separate test data set.
Disadvantages
Classification is good with Random Forest but Regression…Not so much.
Works as a black box. One can not control the inside functionality other
than changing the input values etc.
Implementation
Now it’s time to see the implementation of Random Forest algorithm in
Scala. Here we are gonna use Smile library to use Random Forest like we did
for the implementation of Decision Trees
We are going to use the same data for this implementation as we did for
Decision Tree. Hence we get here Array of Array of Double as the training
instances and Array of Int as response value for these instances.
Training
After getting the data we have a method randomForest() in the package
smile.operators package that returns an instance of RandomForest class.
val nTrees = 200
val maxNodes = 4
nodeSize: Int — the number of instances in a node below which the tree
will not split, by default the value is 1 but for very large data set it should
be more than one
mtry: Int — the number of randomly selected attributes for each decision
tree, by default its value is the square root(number of attributes).
subsample: Double — if the value is 1.0 then sampling with replacement if
less than 1.0 then without replacement, by default the value is 1.0.
Testing
Now our Random Forest is created. We can use its error() method to show
the out of bag error for our Random Forest.
Accuracy
Our Random Forest is ready and we also checked the out of bag error. Now we
know with every prediction we have some error also. So how to check the
accuracy for the random forest we just built.
trainingInstances: Array[Array[Double]]
responseValues: Array[Int]
testInstances: Array[Array[Double]]
testResponseValues: Array[Int]
Here the testInstances and testResponseVaues are fetched from a testing data
set. Shown below:
val weatherTest =
read.arff("src/main/resources/weatherRF.nominal.arff", 4)
val (testInstances,testResponseValues) =
data.pimpDataset(weatherTest).unzipInt
Here is the output:
As we can see here it tells the accuracy for our random forest which is
83.33% right now.
#mlforscalalovers
References:
Analytics Vidya
Smile Github
Image Source
Sbt Scala Tutorial Decision Tree Machine Learning