Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 26

MODULE 4: REGRESSION SHRINKAGE METHODS

In statistics, shrinkage is the reduction in the effects of sampling variation. In regression


analysis, a fitted relationship appears to perform less well on a new data set than on the data set
used for fitting. In particular the value of the coefficient of determination 'shrinks'. This idea is
complementary to overfitting and, separately, to the standard adjustment made in the coefficient
of determination to compensate for the subjunctive effects of further sampling, like controlling
for the potential of new explanatory terms improving the model by chance: that is, the
adjustment formula itself provides "shrinkage." But the adjustment formula yields an artificial
shrinkage.

A shrinkage estimator is an estimator that, either explicitly or implicitly, incorporates the


effects of shrinkage. In loose terms this means that a naive or raw estimate is improved by
combining it with other information.

Description

Many standard estimators can be improved, in terms of mean squared error (MSE), by shrinking
them towards zero (or any other finite constant value). In other words, the improvement in the
estimate from the corresponding reduction in the width of the confidence interval can outweigh
the worsening of the estimate introduced by biasing the estimate towards zero (see bias-variance
trade-off).

Assume that the expected value of the raw estimate is not zero and consider other estimators
obtained by multiplying the raw estimate by a certain parameter. A value for this parameter can
be specified so as to minimize the MSE of the new estimate. For this value of the parameter, the
new estimate will have a smaller MSE than the raw one. Thus it has been improved. An effect
here may be to convert an unbiased raw estimate to an improved biased one.

Prediction:

 Linear regression:

 Or for a more general regression function:

 In a prediction context, there is less concern about the values of the components of the
right-hand side, rather interest is on the total contribution.
Variable Selection:

 The driving force behind variable selection:

o The desire for a parsimonious regression model (one that is simpler and easier to
interpret);
o The need for greater accuracy in prediction.

 The notion of what makes a variable "important" is still not well understood, but one
interpretation (Breiman, 2001) is that a variable is important if dropping it seriously
affects prediction accuracy.

 Selecting variables in regression models is a complicated problem, and there are many
conflicting views on which type of variable selection procedure is best, e.g. LRT, F-test,
AIC, and BIC.

There are two main types of stepwise procedures in regression:

 Backward elimination: eliminate the least important variable from the selected ones.

 Forward selection: add the most important variable from the remaining ones.

 A hybrid version that incorporates ideas from both main types: alternates backward and
forward steps, and stops when all variables have either been retained for inclusion or
removed.

Criticisms of Stepwise Methods:

 There is no guarantee that the subsets obtained from stepwise procedures will contain the
same variables or even be the "best" subset.

 When there are more variables than observations (p > n), backward elimination is
typically not a feasible procedure.

 The maximum or minimum of a set of correlated F statistics is not itself an F statistic.

 It produces a single answer (a very specific subset) to the variable selection problem,
although several different subsets may be equally good for regression purposes.

 The computing is easy by the use of R function step() or regsubsets(). However, to


specify a practically good answer, you must know the practical context in which your
inference will be used.

Scott Zeger on 'how to pick the wrong model': Turn your scientific problem over to a computer
that, knowing nothing about your science or your question, is very good at optimizing AIC, BIC,.
Objectives

Upon successful completion of this lesson, you should be able to:

 Introducing biased regression methods to reduce variance.


 Implementation of Ridge and Lasso regression.

Ridge Regression
Ridge regression solves some of the shortcomings of linear regression. Ridge regression is an
extension of the OLS method with an additional constraint. The OLS estimates are
unconstrained, and might exhibit a large magnitude, and therefore large variance. In ridge
regression, the coefficients are applied a penalty, so that they are shrunk towards zero, this also
having the effect of reducing the variance and hence, the prediction error. Similar to the OLS
approach, we choose the ridge coefficients to minimize a penalized residual sum of squares
(RSS). As opposed to OLS, ridge regression provides biased estimators which have a low
variance

Motivation: too many predictors


 It is not unusual to see the number of input variables greatly exceed the number of
observations, e.g. microarray data analysis, environmental pollution studies.
 With many predictors, fitting the full model without penalization will result in large
prediction intervals, and LS regression estimator may not uniquely exist.

One way out of this situation is to abandon the requirement of an unbiased estimator.
We assume only that X's and Y have been centered so that we have no need for a constant term
in the regression:
 X is an n by p matrix with centered columns,
 Y is a centered n-vector.
When initially developing predictive models, we often need to compute coefficients, as
coefficients are not explicitly stated in the training data. To estimate coefficients, we can use a
standard ordinary least squares (OLS) matrix coefficient estimator:

Hoerl and Kennard (1970) proposed that potential instability in the LS estimator
Knowing this formula’s operations requires familiarity with matrix notation. Suffice it to say,
this formula aims to find the best-fitting line for a given dataset by calculating coefficients for
each independent variable that collectively result in the smallest residual sum of squares (also
called the sum of squared errors)

Residual sum of squares (RSS) measures how well a linear regression model matches training
data. It is represented by the formulation:

This formula measures model prediction accuracy for ground-truth values in the training data. If
RSS = 0, the model perfectly predicts dependent variables. A score of zero is not always
desirable, however, as it can indicate overfitting on the training data, particularly if the training
dataset is small. Multicollinearity may be one cause of this.

• High coefficient estimates can often be symptomatic of overfitting. If two or more


variables share a high, linear correlation, OLS may return erroneously high-value
coefficients. When one or more coefficients are too high, the model’s output becomes
sensitive to minor alterations in the input data. In other words, the model has overfitted
on a specific training set and fails to accurately generalize on new test sets. Such a model
is considered unstable.

• Ridge regression modifies OLS by calculating coefficients that account for potentially
correlated predictors. Specifically, ridge regression corrects for high-value coefficients by
introducing a regularization term (often called the penalty term) into the RSS function.
This penalty term is the sum of the squares of the model’s coefficients. It is represented
in the formulation:
The L2 penalty term is inserted as the end of the RSS function, resulting in a new formulation,
the ridge regression estimator. Therein, its effect on the model is controlled by the
hyperparameter lambda (λ):

Remember that coefficients mark a given predictor’s (i.e. independent variable’s) effect on the
predicted value (i.e. dependent variable). Once added into RSS formula, the L2 penalty term
counteracts especially high coefficients by reducing all coefficient values. In statistics, this is
called coefficient shrinkage. The above ridge estimator thus calculates new regression
coefficients that reduce a given model’s RSS. This minimizes every predictor’s effect and
reduces overfitting on training data.

Note that ridge regression does not shrink every coefficient by the same value. Rather,
coefficients are shrunk in proportion to their initial size. As λ increases, high-value coefficients
shrink at a greater rate than low-value coefficients. High-value coefficients are thus penalized
greater than low-value coefficients.

More on Coefficient Shrinkage

Let's illustrate why it might be beneficial in some cases to have a biased estimator. This is just an
illustration with some impractical assumptions. Let's assume that β^ follows a normal
distribution with mean 1 and variance 1. This means that the true β=1 and that the true variance
when you do least squares estimation is assumed to equal 1. In practice, we do not know this
distribution.

Instead of β^, we will use a shrinkage estimator for β, β~, which is β^ shrunk by a factor
of a (where a is a constant greater than one). Then:

Take a look at the squared difference between β^ and the true β (= 1). Then compare with the
new estimator, β~, and see how accurate it gets compared to the true value of 1. Again, we
compute the squared difference between β~ and 1 because β~ itself is random and we can only
talk about it in the average sense. We can think of this as a measure of accuracy - expected
squared loss which turns out to be the variance of β~ + the squared bias.

By shrinking the estimator by a factor of a, the bias is not zero. So, it is not an unbiased estimator
anymore. The variance of

Therefore, the bigger a gets the higher the bias would be. The red curve in the plot below shows
the squared bias with respect to a. When a goes to infinity, the bias approaches 1. Also,
when a approaches infinity, the variance approaches zero. As you can see, one term goes up and
the other term goes down. The sum of the two terms is shown by the blue curve.

You can see that the optimal is achieved at a = 2 rather than a = 1. a = 1 gives you the unbiased
estimator. However, a = 2 is biased but it gives you a smaller expected loss. In this case, a biased
estimation may yield better prediction accuracy.

The red curve in the plot below represents the original distribution of β which has variance = 1.
When you shrink it, dividing it by a constant greater than one, the distribution becomes spikier.
The variance is decreased because the distribution is squeezed. In the meantime, there is one
negative thing going on---the mean has shifted away from 1.
Lasso Regression
Introduction

The word “LASSO” stands for Least Absolute Shrinkage and Selection Operator. It is a
statistical formula for the regularization of data models and feature selection.

LASSO regression, also known as L1 regularization, is a popular technique used in statistical


modeling and machine learning to estimate the relationships between variables and make
predictions.

The primary goal of LASSO regression is to find a balance between model simplicity and
accuracy. It achieves this by adding a penalty term to the traditional linear regression model,
which encourages sparse solutions where some coefficients are forced to be exactly zero. This
feature makes LASSO particularly useful for feature selection, as it can automatically identify
and discard irrelevant or redundant variables.

What is Lasso Regression?

Lasso regression is a regularization technique. It is used over regression methods for a more
accurate prediction. This model uses shrinkage. Shrinkage is where data values are shrunk
towards a central point as the mean. The lasso procedure encourages simple, sparse models (i.e.
models with fewer parameters). This particular type of regression is well-suited for models
showing high levels of multicollinearity or when you want to automate certain parts of model
selection, like variable selection/parameter elimination.

Lasso Regression uses L1 regularization technique. It is used when we have more features
because it automatically performs feature selection.

Here’s a step-by-step explanation of how LASSO regression works:

1. Linear regression model: LASSO regression starts with the standard linear regression
model, which assumes a linear relationship between the independent variables (features)
and the dependent variable (target). The linear regression equation can be represented as
follows:y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε
Where:
 y is the dependent variable (target).

 β₀, β₁, β₂, ..., βₚ are the coefficients (parameters) to be estimated.

 x₁, x₂, ..., xₚ are the independent variables (features).

 ε represents the error term.


2. L1 regularization: LASSO regression introduces an additional penalty term based on the
absolute values of the coefficients. The L1 regularization term is the sum of the absolute
values of the coefficients multiplied by a tuning parameter λ. L₁ = λ * (|β₁| + |β₂| + ... + |
βₚ|) Where:
 λ is the regularization parameter that controls the amount of regularization
applied.
 β₁, β₂, ..., βₚ are the coefficients.

3. Objective function: The objective of LASSO regression is to find the values of the
coefficients that minimize the sum of the squared differences between the predicted
values and the actual values, while also minimizing the L1 regularization
term:makefileCopy codeMinimize: RSS + L₁ Where:
 RSS is the residual sum of squares, which measures the error between the
predicted values and the actual values.

4. Shrinking coefficients: By adding the L1 regularization term, LASSO regression can


shrink the coefficients towards zero. When λ is sufficiently large, some coefficients are
driven to exactly zero. This property of LASSO makes it useful for feature selection, as
the variables with zero coefficients are effectively removed from the model.

5. Tuning parameter λ: The choice of the regularization parameter λ is crucial in LASSO


regression. A larger λ value increases the amount of regularization, leading to more
coefficients being pushed towards zero. Conversely, a smaller λ value reduces the
regularization effect, allowing more variables to have non-zero coefficients.

6. Model fitting: To estimate the coefficients in LASSO regression, an optimization


algorithm is used to minimize the objective function. Coordinate Descent is commonly
employed, which iteratively updates each coefficient while holding the others fixed.

Mathematical equation of Lasso Regression


Residual Sum of Squares + λ * (Sum of the absolute value of the magnitude of
coefficients)
Where,
 λ denotes the amount of shrinkage.
 λ = 0 implies all features are considered and it is equivalent to the linear regression where
only the residual sum of squares is considered to build a predictive model
 λ = ∞ implies no feature is considered i.e, as λ closes to infinity it eliminates more and
more features
 The bias increases with increase in λ
 variance increases with decrease in λ
If a regression model uses the L1 Regularization technique, then it is called Lasso Regression. If
it used the L2 regularization technique, it’s called Ridge Regression. We will study more about
these in the later sections.

L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the
coefficient. This regularization type can result in sparse models with few coefficients. Some
coefficients might become zero and get eliminated from the model. Larger penalties result in
coefficient values that are closer to zero (ideal for producing simpler models). On the other hand,
L2 regularization does not result in any elimination of sparse models or coefficients. Thus, Lasso
Regression is easier to interpret as compared to the Ridge.

Geometric Interpretation

The lasso performs L1 shrinkage so that there are "corners'' in the constraint, which in two
dimensions corresponds to a diamond. If the sum of squares "hits'' one of these corners, then the
coefficient corresponding to the axis is shrunk to zero.

As p increases, the multidimensional diamond has an increasing number of corners, and so it is


highly likely that some coefficients will be set equal to zero. Hence, the lasso performs shrinkage
and (effectively) subset selection.

In contrast with subset selection, Lasso performs a soft thresholding: as the smoothing parameter
is varied, the sample path of the estimates moves continuously to zero.
Tree-based Methods

Decision trees can be used for both regression and classification problems. Here we focus on
classification trees. Classification trees are a very different approach to classification than
prototype methods such as k-nearest neighbors. The basic idea of these methods is to partition
the space and identify some representative centroids.

They also differ from linear methods, e.g., linear discriminant analysis, quadratic discriminant
analysis, and logistic regression. These methods use hyperplanes as classification boundaries.

Classification trees are a hierarchical way of partitioning the space. We start with the entire space
and recursively divide it into smaller regions. In the end, every region is assigned to a class label.

A Medical Example

One big advantage of decision trees is that the classifier generated is highly interpretable. For
physicians, this is an especially desirable feature.

In this example, patients are classified into one of two classes: high risk versus low risk. It is
predicted that the high-risk patients would not survive at least 30 days based on the initial 24-
hour data. There are 19 measurements taken from each patient during the first 24 hours. These
include blood pressure, age, etc.

Here a tree-structured classification rules is generated and can be interpreted as follows:

First, we look at the minimum systolic blood pressure within the initial 24 hours and determine
whether it is above 91. If the answer is no, the patient is classified as high-risk. We don't need to
look at the other measurements for this patient. If the answer is yes, then we can't make a
decision yet. The classifier will then look at whether the patient's age is greater than 62.5 years
old. If the answer is no, the patient is classified as low risk. However, if the patient is over 62.5
years old, we still cannot make a decision and then look at the third measurement, specifically,
whether sinus tachycardia is present. If the answer is no, the patient is classified as low risk. If
the answer is yes, the patient is classified as high risk.

Only three measurements are looked at by this classifier. For some patients, only one
measurement determines the final result. Classification trees operate similarly to a doctor's
examination.

Construct the Tree

Notation

We will denote the feature space by X. Normally X is a multidimensional Euclidean space.


However, sometimes some variables (measurements) may be categorical such as gender, (male
or female). CART (Classification and Regression Tress) has the advantage of treating real
variables and categorical variables in a unified manner. This is not so for many other
classification methods, for instance, LDA (Linear Discriminant Analysis).

The input vector is indicated by X∈X contains p features X1, X2,⋯,XP.

Tree-structured classifiers are constructed by repeated splits of the space X into smaller and
smaller subsets, beginning with X itself.

We will also need to introduce a few additional definitions:

Rollover the definitions on the right in the interactive image below:


node, terminal node (leaf node), parent node, child node.

One thing that we need to keep in mind is that the tree represents the recursive splitting of the
space. Therefore, every node of interest corresponds to one region in the original space. Two
child nodes will occupy two different regions and if we put the two together, we get the same
region as that of the parent node. In the end, every leaf node is assigned with a class and a test
point is assigned with the class of the leaf node it lands in.

Additional Notation:

 A node is denoted by t. We will also denote the left child node by tL and the right one
by tR .

 Denote the collection of all the nodes in the tree by T and the collection of all the leaf
nodes by T~.

 A split will be denoted by s. The set of splits is denoted by S.

Let's take a look at how these splits can take place.


The whole space is represented by X.

The Three Elements

The construction of a tree involves the following three general elements:

1. The selection of the splits, i.e., how do we decide which node (region) to split and how to
split it?

2. If we know how to make splits or 'grow' the tree, how do we decide when to declare a
node terminal and stop splitting?

3. We have to assign each terminal node to a class. How do we assign these class labels?

In particular, we need to decide upon the following:

1. The pool of candidate splits that we might select from involves a set Q of binary
questions of the form {Is x∈A?}, A⊆X. Basically, we ask whether our input x belongs to
a certain region, A. We need to pick one A from the pool.
2. The candidate split is evaluated using a goodness of split criterion Φ(s,t) that can be
evaluated for any split s of any node t.

3. A stop-splitting rule, i.e., we have to know when it is appropriate to stop splitting. One
can 'grow' the tree very big. In an extreme case, one could 'grow' the tree to the extent
that in every leaf node there is only a single data point. Then it makes no sense to split
any farther. In practice, we often don't go that far.

4. Finally, we need a rule for assigning every terminal node to a class.

Now, let's get into the details for each of these four decisions that we have to make...
The Impurity Function
Estimate the Posterior Probabilities of Classes in Each Node
Advantages of the Tree-Structured Approach
As we have mentioned many times, the tree-structured approach handles both categorical and
ordered variables in a simple and natural way. Classification trees sometimes do an automatic
stepwise variable selection and complexity reduction. They provide an estimate of the
misclassification rate for a test point. For every data point, we know which leaf node it lands in
and we have estimation for the posterior probabilities of classes for every leaf node. The
misclassification rate can be estimated using the estimated class posterior.

Classification trees are invariant under all monotone transformations of individual ordered
variables. The reason is that classification trees split nodes by thresholding. Monotone
transformations cannot change the possible ways of dividing data points by thresholding.
Classification trees are also relatively robust to outliers and misclassified points in the training
set. They do not calculate an average or anything else from the data points
themselves. Classification trees are easy to interpret, which is appealing especially in medical
applications.

Variable Combinations
So far, we have assumed that the classification tree only partitions the space by hyperplanes
parallel to the coordinate planes. In the two-dimensional case, we only divide the space either by
horizontal or vertical lines. How much do we suffer by such restrictive partitions?

Let's take a look at this example...

In the example below, we might want to make a split using the dotted diagonal line which
separates the two classes well. Splits parallel to the coordinate axes seem inefficient for this data
set. Many steps of splits are needed to approximate the result generated by one split using a
sloped line.

There are classification tree extensions which, instead of thresholding individual variables,
perform LDA for every node.
Or we could use more complicated questions. For instance, questions that use linear
combinations of variables:

This would increase the amount of computation significantly. Research seems to suggest that
using more flexible questions often does not lead to obviously better classification result, if not
worse. Overfitting is more likely to occur with more flexible splitting questions. It seems that
using the right sized tree is more important than performing good splits at individual nodes.

Missing Values
We may have missing values for some variables in some training sample points. For instance,
gene-expression microarray data often have missing gene measurements.

Suppose each variable has 5% chance of being missing independently. Then for a training data
point with 50 variables, the probability of missing some variables is as high as 92.3%! This
means that at least 90% of the data will have at least one missing value! Therefore, we cannot
simply throw away data points whenever missing values occur.

A test point to be classified may also have missing variables.

Classification trees have a nice way of handling missing values by surrogate splits.

Suppose the best split for node t is s which involves a question on Xm. Then think about what to
do if this variable is not there. Classification trees tackle the issue by finding a replacement split.
To find another split based on another variable, classification trees look at all the splits using all
the other variables and search for the one yielding a division of training data points most similar
to the optimal split. Along the same line of thought, the second best surrogate split could be
found in case both the best variable and its top surrogate variable are missing, so on so forth.

One thing to notice is that to find the surrogate split, classification trees do not try to find the
second-best split in terms of goodness measure. Instead, they try to approximate the result of the
best split. Here, the goal is to divide data as similarly as possible to the best split so that it is
meaningful to carry out the future decisions down the tree, which descend from the best split.
There is no guarantee the second best split divides data similarly as the best split although their
goodness measurements are close.
Right Sized Tree via Pruning
Bagging and Random Forests
In the past, we have focused on statistical learning procedures that produce a single set of results.
For example:
 A regression equation, with one set of regression coefficients or smoothing parameters.
 A classification regression tree with one set of leaf nodes.
Model selection is often required: a measure of fit associated with each candidate model.

The Aggregating Procedure:

Here the discussion shifts to statistical learning building on many sets of outputs that are
aggregated to produce results. The aggregating procedure makes a number of passes over the
data.
On each pass, inputs X are linked with outputs Y just as before. However, of interest now is the
collection of all the results from all passes over the data. Aggregated results have several
important benefits:
 Averaging over a collection of fitted values can help to avoid overfitting. It tends to
cancel out the uncommon features of the data captured by a specific model. Therefore,
the aggregated results are more stable.
 A large number of fitting attempts can produce very flexible fitting functions.
 Putting the averaging and the flexible fitting functions together has the potential to break
the bias-variance tradeoff.

Revisit Overfitting:

Any attempt to summarize patterns in a dataset risk over fitting. All fitting procedures adapt to
the data on hand so that even if the results are applied to a new sample from the same population,
fit quality will likely decline. Hence, generalization can be somewhat risky.

"optimism increases linearly with the number of inputs or basis functions ..., but decreases as the
training sample size increases.'' -- Hastie, Tibshirani and Friedman (unjustified).

Decision Tree Example:

Consider decision trees as a key illustration. The overfitting often increases with (1) the number
of possible splits for a given predictor; (2) the number of candidate predictors; (3) the number of
stages which is typically represented by the number of leaf nodes.

When overfitting occurs in a classification tree, the classification error is underestimated; the
model may have a structure that will not generalize well. For example, one or more predictors
may be included in a tree that really does not belong.
Ideally, one would have two random samples from the same population: a training dataset and a
test dataset. The fit measure from the test data would be a better indicator of how accurate the
classification is. Often there is only a single dataset. The data are split up into several randomly
chosen, non-overlapping, partitions of about the same size. With ten partitions, each would be a
part of the training data in nine analyses and serve as the test data in one analysis. The following
figure illustrates the 2-fold cross validation for estimating the cross-validation prediction error
for model A and model B. The model selection is based on choosing the one with the smallest
cross-validation prediction error.

Bagging
There is a very powerful idea in the use of subsamples of the data and in averaging over
subsamples through bootstrapping.

Bagging exploits that idea to address the overfitting issue in a more fundamental manner. It was
invented by Leo Breiman, who called it "bootstrap aggregating" or simply "bagging" (see the
reference: "Bagging Predictors," Machine Learning, 24:123-140, 1996, cited by 7466).

In a classification tree, bagging takes a majority vote from classifiers trained on bootstrap
samples of the training data.

Algorithm: Consider the following steps in a fitting algorithm with a dataset


having N observations and a binary response variable.
1. Take a random sample of size N with replacement from the data (a bootstrap sample).

2. Construct a classification tree as usual but do not prune.

3. Assign a class to each terminal node, and store the class attached to each case coupled
with the predictor values for each observation.

4. Repeat Steps 1-3 a large number of times.

5. For each observation in the dataset, count the number of trees that it is classified in one
category over the number of trees.

6. Assign each observation to a final category by a majority vote over the set of trees. Thus,
if 51\% of the time over a large number of trees a given observation is classified as a "1'',
that becomes its classification.

Although there remain some important variations and details to consider, these are the key steps
to producing "bagged'' classification trees. The idea of classifying by averaging over the results
from a large number of bootstrap samples generalizes easily to a wide variety of classifiers
beyond classification trees.

Margins:

Bagging introduces a new concept, "margins." Operationally, the "margin" is the difference
between the proportion of times a case is correctly classified and the proportion of times it is
incorrectly classified. For example, if overall trees an observation is correctly classified 75\% of
the time, the margin is 0.75 - 0.25 = 0.50.

Large margins are desirable because a more stable classification is implied. Ideally, there should
be large margins for all of the observations. This bodes well for generalization to new data.

Out-Of-Bag Observations:

For each tree, observations not included in the bootstrap sample are called "out-of-bag''
observations. These "out-of-bag'' observations can be treated as a test dataset and dropped down
the tree.

To get a better evaluation of the model, the prediction error is estimated only based on the "out-
of-bag'' observations. In other words, the averaging for a given observation is done only using
the trees for which that observation was not used in the fitting process.

From Bagging to Random Forests


Bagging constructs a large number of trees with bootstrap samples from a dataset. But now, as
each tree is constructed, take a random sample of predictors before each node is split. For
example, if there are twenty predictors, choose a random five as candidates for constructing the
best split. Repeat this process for each node until the tree is large enough. And as in bagging, do
not prune.

Random Forests Algorithm

The random forests algorithm is very much like the bagging algorithm. Let N be the number of
observations and assume for now that the response variable is binary.

1. Take a random sample of size N with replacement from the data (bootstrap sample).

2. Take a random sample without replacement of the predictors.

3. Construct a split by using predictors selected in Step 2.

4. Repeat Steps 2 and 3 for each subsequent split until the tree is as large as desired. Do not
prune. Each tree is produced from a random sample of cases and at each split a random
sample of predictors.

5. Drop the out-of-bag data down the tree. Store the class assigned to each observation
along with each observation's predictor values.

6. Repeat Steps 1-5 a large number of times (e.g., 500).

7. For each observation in the dataset, count the number of trees that it is classified in one
category over the number of trees.

8. Assign each observation to a final category by a majority vote over the set of trees. Thus,
if 51% of the time over a large number of trees a given observation is classified as a "1",
that becomes its classification.

Why Random Forests Work

Variance reduction: the trees are more independent because of the combination of bootstrap
samples and random draws of predictors.

 It is apparent that random forests are a form of bagging, and the averaging over trees can
substantially reduce instability that might otherwise result. Moreover, by working with a
random sample of predictors at each possible split, the fitted values across trees are more
independent. Consequently, the gains from averaging over a large number of trees
(variance reduction) can be more dramatic.

Bias reduction: a very large number of predictors can be considered, and local feature predictors
can play a role in tree construction.
 Random forests are able to work with a very large number of predictors, even more,
predictors than there are observations. An obvious gain with random forests is that more
information may be brought to reduce bias of fitted values and estimated splits.

 There are often a few predictors that dominate the decision tree fitting process because on
the average they consistently perform just a bit better than their competitors.
Consequently, many other predictors, which could be useful for very local features of the
data, are rarely selected as splitting variables. With random forests computed for a large
enough number of trees, each predictor will have at least several opportunities to be the
predictor defining a split. In those opportunities, it will have very few competitors. Much
of the time a dominant predictor will not be included. Therefore, local feature predictors
will have the opportunity to define a split.

Indeed, random forests are among the very best classifiers invented to date (Breiman, 2001a).

Random forests include 3 main tuning parameters.

 Node Size: unlike in decision trees, the number of observations in the terminal nodes of
each tree of the forest can be very small. The goal is to grow trees with as little bias as
possible.

 Number of Trees: in practice, 500 trees is often a good choice.

 Number of Predictors Sampled: the number of predictors sampled at each split would
seem to be a key tuning parameter that should affect how well random forests perform.
Sampling 2-5 each time is often adequate.

Taking Costs into Account

In the example of domestic violence, the following predictors were collected from 500+
households: Household size and number of children; Male / female age (years); Marital duration;
Male / female education (years); Employment status and income; The number of times the police
had been called to that household before; Alcohol or drug abuse, etc.

Our goal is not to forecast new domestic violence, but only those cases in which there is
evidence that serious domestic violence has actually occurred. There are 29 felony incidents
which are very small as a fraction of all domestic violence calls for service (4%). And they
would be extremely difficult to forecast. When a logistic regression was applied to the data, not a
single incident of serious domestic violence was identified.

There is a need to consider the relative costs of false negatives (fail to predict a felony incident)
and false positives (predict a case to be a felony incident when it is not). Otherwise, the best
prediction would be assuming no serious domestic violence with an error rate of 4%. In random
forests, there are two common approaches. They differ by whether costs are imposed on the data
before each tree is built, or at the end when classes are assigned.

Weighted Classification Votes: After all of the trees are built, one can differentially weight the
classification votes over trees. For example, one vote for classification in the rare category might
count the same as two votes for classification in the common category.

Stratified Bootstrap Sampling: When each bootstrap sample is drawn before a tree is built, one
can oversample one class of cases for which forecasting errors are relatively more costly. The
procedure is much in the same spirit as disproportional stratified sampling used for data
collection (Thompson, 2002).

Using a cost ratio of 10 to 1 for false negatives to false positives favored by the police
department, random forests correctly identify half of the rare serious domestic violence incidents.

In summary, with forecasting accuracy as a criterion, bagging is in principle an improvement


over decision trees. It constructs a large number of trees with bootstrap samples from a dataset.
Random forests are in principle an improvement over bagging. It draws a random sample of
predictors to define each split.

Boosting

Boosting, like bagging, is another general approach for improving prediction results for various
statistical learning methods. It is also particularly well suited to decision trees.

You might also like