Boosting

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

` Lecture Notes

Boosting

Boosting is one of the most powerful ideas introduced in the field of machine learning in the
past few years. It was first introduced in 1997 by Freund and Schapire in the popular
algorithm, AdaBoost. It was originally designed for classification problems. Since its
inception, many other boosting algorithms have been developed that tackle regression
problems also and have become famous as they are used in the top solutions of many
Kaggle competitions. We will go through the concepts of the most popular boosting
algorithms - AdaBoost, Gradient Boosting and XGBoost.

Introduction to Boosting

Ensemble learning methods create a strong learner thus reducing the bias and/or variance
of the individual models.
Bagging:
Bagging is one such ensemble model which creates different training subsets from
the training data, with replacement. Then, an algorithm with the same set of
hyperparameters is built on these different subsets of data. In this way, the same algorithm
with a similar set of hyperparameters is exposed to different subsets of the training data,
resulting in a slight difference between the individual models. The predictions of these
individual models are combined by taking the average of all the values for regression or a
majority vote for a classification problem. Random forest is an example of the bagging
method. Bagging works very well for high-variance models like decision trees.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


Boosting:
Boosting is another popular approach to ensembling. This technique combines individual
models into a strong learner by creating sequential models such that the final model has
higher accuracy than the individual models. These individual models are connected in such a
way that the subsequent models are dependent on errors of the previous model and each
subsequent model tries to correct the errors of the previous models.

Weak learner, on the other hand, refers to a simple model which performs at least better
than a random guesser (the error rate should be lesser than 0.5). It primarily identifies only
the prominent pattern(s) present in the data and thus is not capable of overfitting. Weak
learners are combined sequentially such that each subsequent model corrects the mistakes
of the previous model, resulting in a strong overall model that gives good predictions.
Through weak learners we can:
 Reduce the variance of the final model, making it more robust(generalisable)
 Train the ensemble quickly resulting in faster computation time

An overview of Adaboost:
Adaboost starts with a uniform distribution of weights over training examples i.e. it gives
equal weights to all its observations. These weights tell the importance of each datapoint
being considered. We start with a single weak learner to make the initial predictions. Once
the initial predictions are made, patterns that were not captured by the previous weak
learner are taken care of by the next weak learner by giving more weightage to the

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


misclassified data points.

Apart from giving weightage to each observation, the model also gives weightage to each
weak learner. More the errors in the weak learner, lesser is the weightage given to it. This
helps when the ensembled model makes final predictions.

After getting these two weights - for the observations and the individual weak learners, the
next model (weak learner) in the sequence trains on the resampled data (data sampled
according to the weights) to make the next prediction.

The model will iteratively continue the steps mentioned above for a pre-specified number
of weak learners. In the end, we take a weighted sum of the predictions from all these
weak learners to get an overall strong learner.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


The say/importance each weak learner, in our case the decision tree stump, has in the final
classification depends on the total error it made.
α = 0.5 ln ((1 − Total error)/Total error)
You can see the relation of α(alpha) & error with this graph.

The value of error rate lies between 0 & 1 so let’s see how alpha & error is related. When
the base model performs with less error overall then, as you can see in the plot above, the α
is a large positive value, which means that the weak learner will have a high say in the final
model. If the error is 0.5, it means that it is not sure of the decision, then the α =0, i.e the
weak learner will have no say or significance in the final model.
If the model produces large errors (i.e close to 1), then α is a large negative value, meaning
that the predictions it makes is incorrect most of the time. Hence this weak learner will have
a very low say in the final model.
After calculating the say/importance of each weak learner, we find the new weights of
each observation present in the training dataset. The following formula is used to compute
the new weight for each observation:
New sample weight for the incorrectly classified = original sample weight * e^(α)
New sample weight for the correctly classified = original sample weight * e^(-α)
© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved
After calculating we need to normalize these values in order to proceed further, using the
following formula:

The samples which the previous stump incorrectly classified will be given higher weights and
the ones which the previous stump classified correctly will be given lower weights.
Next, a new stump will be created by randomly sampling the weighted observations. Due to
the weights given to each observation, the new dataset will have a tendency to contain
multiple copies of the observations that were misclassified by the previous tree and may not
contain all observations which were correctly classified. This will essentially help the next
weak learner to give more importance to the incorrectly classified sample so that it can
correct the mistake and correctly classify it now. This process will be repeated till a pre-
specified number of trees are built i.e. the ensemble is built.

The AdaBoost model makes predictions by having each tree in the ensemble classify the
sample. Then, we split the trees into groups according to their decisions. For each group, we
add up the significance of every tree inside the group. The final prediction made by the
ensemble as a whole is determined by the sign of the weighted sum.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


Here is the summary of the AdaBoost algorithm we have studied until now:

1. Initialize the probabilities of the distribution as where n is the number of data


points
2. For t = 0 to T, repeat the following (T is the total number of trees):

a. Fit a tree on the training data using the respective probabilities

b. Compute

c. Compute
d. Update

3. Final Model:

The point here is to force classifiers to concentrate on observations that are difficult to
classify correctly.

Summary of the concept behind the AdaBoost algorithm:

 Adaboost starts with a uniform distribution of weights over training examples.


 These weights tell the importance of the data point being considered.
 We first start with a weak learner h1(x) to create the initial prediction.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


 Patterns which are not captured by previous models become the goal for the next
model by giving more weightage.
 The next model (weak learner) trains on this resampled data to create the next
prediction.
 In the end, we make a weighted sum of all the linear classifiers to make a strong
classifier.

Gradient Boosting

In this session, we will learn another popular boosting algorithm - Gradient Boosting and a
modification of Gradient Boosting called XGBoost which is widely used in the industry.

An overview of Gradient Boosting:

Gradient Boosting like AdaBoost trains many models in a gradual, additive, and sequential
manner. In AdaBoost, we provide more say or weight to those data points which are
misclassified/ wrongly predicted earlier. Gradient boosting performs the same by using
gradients in the loss function. The loss function is a measure indicating how good the model
is able to fit the underlying data. In the case of a regression problem, the loss function
would be the error between true and predicted target values. For classification problems,
the loss function would be a measure of how good our predictive model is at classifying
each sample point that we have in our dataset.

Here are the broader points on how a GBM learns:

 Build the first weak learner using a sample from the training data; we will consider a
decision tree as the weak learner or the base model. It may not necessarily be a
stump, can grow a bigger tree but will still be weak i.e. still not be fully grown.
 Then the predictions are made on the training data using the decision tree just built.
 The gradient, in our case the residuals are computed and these residuals are the
new response or target values for the next weak learner.
 A new weak learner is built with the residuals as the target values and a sample of
observations from the original training data.
 Add the predictions obtained from the current weak learner to the predictions
obtained from all the previous weak learners. The predictions obtained at each step
are multiplied by the learning rate so that no single model makes a huge
contribution to the ensemble thereby avoiding overfitting. Essentially, with the

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


addition of each weak learner, the model takes a very small step in the right
direction.
 The next weak learner fits on the residuals obtained till now and these steps are
repeated, either for a prespecified number of weak learners or if the model starts
overfitting i.e. it starts to capture the niche patterns of the training data.
 GBM makes the final prediction by simply adding up the predictions from all the
weak learners (multiplied by the learning rate).

The algorithm behind the GBM model:

At any iteration t, we repeat the following steps in the Gradient Boosting scheme of things:

1. Initialize a crude initial function Fo as


 This will be the average value which will form our first predicition.
1. For m = 1 to M (where M is the number of trees)
1. Calculate the pseudo-residuals

the pseudo residuals are the negative gradients for all data points
2. Fit a base learner hm(x) to the pseudo-residuals, i.e train it using the training

set . Here the pseudo residuals are used as the


response variable.

3. Compute the step magnitude multiplier (in case of tree models,

compute a different for every leaf/prediction). In other words,


introducing step multiplier results in a small step in the right
direction.

4. Compute the next model

5. The final model is

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


To summarise, the gradient boosting algorithm has two steps at each iteration:

1. Find the residual and fit a new model on the residual


2. Add the new model to the older model and continue the next iteration

Gradient Descent:

You have learned that in the gradient descent algorithm, a function G(x) decreases fastest
around a point a if one goes along the negative gradient of G(x). It follows the iterative
process:

Now if the function is G(x, y), we have the following formulation:

Note that ∂/∂x refers to the partial derivative.

We see that the negative gradient of the square loss function gives the residue. So the
strategy of closing in on the residues at each step can be seen as taking a step towards the
negative gradient of the loss function. We can observe here that gradient descent helps in
generalising the boosting process.

Gradient Boosting Algorithm:

The negative gradients at an iteration t act as a target variable to fit a model hm and for a
regression setting with a square loss function, this target variable was the residue. at each
iteration we add an incremental model, which fits on the negative gradients of the loss
function evaluated at current target values. We stop when we see that the gradients are
very close to zero. For a regression setting, this means that when the residuals are very
close to zero, we stop iterating. Here, , known as the step multiplier, is typically between
0 and 1.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


In other words, introducing step multiplier results in a small step in the right direction.

corresponds to the hyperparameter α in terms of the utility (both determine by what

amount an update should be made), but here is trainable while α is a hyperparameter.

Here at each iteration, a new model (weak learner) is added and with each new addition, we
can observe a reduction in the pseudo residuals which is an indication that we are moving in
the right direction of the target values.

Extreme Gradient Boosting (XGBoost):

Extreme Gradient Boosting (XGBoost) is similar to the gradient boosting framework but
more efficient and advanced implementation of the Gradient Boosting algorithm.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


 AdaBoost is an iterative way of adding weak learners to form the final model. For
this, each model is trained to correct the errors made by the previous one. This
sequential model does this by adding more weight to cases with incorrect
predictions. By this approach, the ensemble model will correct itself while learning
by focusing on cases/data points that are hard to predict correctly.
 Next, we will talk about Gradient Boosting. You learned about gradient descent in
the earlier module. The same principle applies here as well, where the newly added
trees are trained to reduce the errors (loss function) of earlier models. So, overall, in
gradient boosting, we are optimizing the performance of the boosted model by
bringing down the loss one small step at a time.
 XGBoost is an extended version of gradient boosting, where it uses more accurate
approximations to tune the model and find the best fit.

XGBoost advantages:

1. Parallel Computing: when you run xgboost, by default, it would use all the cores of
your laptop/machine enabling its capacity to do parallel computation
2. Regularization: The biggest advantage of xgboost is that it uses regularization and
controls the overfitting and simplicity of the model which gives it better
performance.
3. Enabled Cross-Validation: XGBoost is enabled with internal Cross Validation function
4. Missing Values: XGBoost is designed to handle missing values internally. The missing
values are treated in such a manner that if there exists any trend in missing values, it
is captured by the model.
5. Flexibility: XGBoost is not just limited to regression, classification, and ranking
problems, it supports user-defined objective functions as well. Furthermore, it
supports user-defined evaluation metrics as well.

XGBoost Hyperparameters - Learning Rate, Number of Trees and Subsampling

λt, the learning rate, is also known as shrinkage. It can be used to regularize the gradient
tree boosting algorithm. λt typically varies from 0 to 1. Smaller values of λt lead to a larger
value of a number of trees T (called n_estimators in the Python package XGBoost). This is
because, with a slower learning rate, you need a larger number of trees to reach the
minima. This, in turn, leads to longer training time.

On the other hand, if λt is large, we may reach the same point with a lesser number of trees
(n_estimators), but there's the risk that we might actually miss the minima altogether (i.e.
cross over it) because of the long stride we are taking at each iteration.

Some other ways of regularization are explicitly specifying the number of trees T and
doing subsampling. Note that you shouldn't tune both λt and number of trees T together
since a high λt implies a low value of T and vice-versa.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


Subsampling is training the model in each iteration on a fraction of data (similar to how
random forests build each tree). A typical value of subsampling is 0.5 while it ranges from 0
to 1. In random forests, subsampling is critical to ensure diversity among the trees, since
otherwise, all the trees will start with the same training data and therefore look similar. This
is not a big problem in boosting since each tree is anyway built on the residual and gets a
significantly different objective function than the previous one.

,Gamma is a parameter used for controlling the pruning of the tree. A node is split only
when the resulting split gives a positive reduction in the loss function. Gamma specifies the
minimum loss reduction required to make a split and makes the algorithm conservative. The
values can vary depending on the loss function and should be tuned.

Conclusion:

We went through Gradient Boosting and a modification of it, XGBoost. We were introduced
to the intuition of gradient descent in the regression setting when we used the square loss
function to close in on the residues. We developed from here on the Gradient Boosting
algorithm in which each subsequent model trains on the gradients of the loss function.
XGBoost follows the same procedure on trees with additional features like regularization
and parallel tree learning algorithm for finding the best split.

© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved


© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved

You might also like