Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

Case study possible questions

1) To identify Classification Problems


2) Different cross validation methods
3) Variable selection methods
4) Random Forest

Random Forest Algorithm

a) Definition

Random forest is a supervised learning algorithm. The “forest” it builds is an ensemble of


decision trees, usually trained with the bagging method. The general idea of the bagging method
is that a combination of learning models increases the overall result.
Random forests include 3 main tuning parameters.

1. Node Size: unlike in decision trees, the number of observations in the terminal nodes of each
tree of the forest can be very small. The goal is to grow trees with as little bias as possible.

2. Number of Trees: in practice, 500 trees is often a good choice.

3. Number of Predictors Sampled: the number of predictors sampled at each split would seem
to be a key tuning parameter that should affect how well random forests perform. Sampling
2-5 each time is often adequate.

b) Algorithm

1. Take a random sample of size N with replacement from the data (bootstrap sample).

2. Take a random sample without replacement of the predictors.

3. Construct a split by using predictors selected in Step 2.

4. Repeat Steps 2 and 3 for each subsequent split until the tree is as large as desired. Do not
prune. Each tree is produced from a random sample of cases and at each split a random
sample of predictors.

5. Drop the out-of-bag data down the tree. Store the class assigned to each observation along
with each observation's predictor values.

6. Repeat Steps 1-5 a large number of times (e.g., 500).

7. For each observation in the dataset, count the number of trees that it is classified in one
category over the number of trees.

8. Assign each observation to a final category by a majority vote over the set of trees. Thus, if
51% of the time over a large number of trees a given observation is classified as a "1", that
becomes its classification.
2) Advantages

1) It can be used in classification and regression problems.


2) It solves the problem of overfitting as output is based on majority voting or averaging.
3) It performs well even if the data contains null/missing values.
4) Each decision tree created is independent of the other; thus, it shows the property of
parallelization.
5) It is highly stable as the average answers given by a large number of trees are taken.
6) It maintains diversity as all the attributes are not considered while making each decision tree
though it is not true in all cases.
7) It is immune to the curse of dimensionality. Since each tree does not consider all the
attributes, feature space is reduced.
8) We don’t have to segregate data into train and test as there will always be 30% of the data,
which is not seen by the decision tree made out of bootstrap.

Disadvantages

1) Random forest is highly complex compared to decision trees, where decisions can be made
by following the path of the tree.
2) Training time is more than other models due to its complexity. Whenever it has to make a
prediction, each decision tree has to generate output for the given input data.

Applications

The Random Forest Algorithm is most usually applied in the following four sectors:

A. Banking: It is mainly used in the banking industry to identify loan risk.


B. Medicine: To identify illness trends and risks.
C. Land Use: Random Forest Classifier is also used to classify places with similar land-use
patterns.
D. Market Trends: You can determine market trends using this algorithm.

5) SVD Problem

You might also like