Professional Documents
Culture Documents
Boosting
Boosting
Boosting
Boosting
Boosting is one of the most powerful ideas introduced in the field of machine learning in the
past few years. It was first introduced in 1997 by Freund and Schapire in the popular
algorithm, AdaBoost. It was originally designed for classification problems. Since its
inception, many other boosting algorithms have been developed that tackle regression
problems also and have become famous as they are used in the top solutions of many
Kaggle competitions. We will go through the concepts of the most popular boosting
algorithms - AdaBoost, Gradient Boosting and XGBoost.
Introduction to Boosting
Ensemble learning methods create a strong learner thus reducing the bias and/or variance
of the individual models.
Bagging:
Bagging is one such ensemble model which creates different training subsets from
the training data, with replacement. Then, an algorithm with the same set of
hyperparameters is built on these different subsets of data. In this way, the same algorithm
with a similar set of hyperparameters is exposed to different subsets of the training data,
resulting in a slight difference between the individual models. The predictions of these
individual models are combined by taking the average of all the values for regression or a
majority vote for a classification problem. Random forest is an example of the bagging
method. Bagging works very well for high-variance models like decision trees.
Weak learner, on the other hand, refers to a simple model which performs at least better
than a random guesser (the error rate should be lesser than 0.5). It primarily identifies only
the prominent pattern(s) present in the data and thus is not capable of overfitting. Weak
learners are combined sequentially such that each subsequent model corrects the mistakes
of the previous model, resulting in a strong overall model that gives good predictions.
Through weak learners we can:
Reduce the variance of the final model, making it more robust(generalisable)
Train the ensemble quickly resulting in faster computation time
An overview of Adaboost:
Adaboost starts with a uniform distribution of weights over training examples i.e. it gives
equal weights to all its observations. These weights tell the importance of each datapoint
being considered. We start with a single weak learner to make the initial predictions. Once
the initial predictions are made, patterns that were not captured by the previous weak
learner are taken care of by the next weak learner by giving more weightage to the
Apart from giving weightage to each observation, the model also gives weightage to each
weak learner. More the errors in the weak learner, lesser is the weightage given to it. This
helps when the ensembled model makes final predictions.
After getting these two weights - for the observations and the individual weak learners, the
next model (weak learner) in the sequence trains on the resampled data (data sampled
according to the weights) to make the next prediction.
The model will iteratively continue the steps mentioned above for a pre-specified number
of weak learners. In the end, we take a weighted sum of the predictions from all these
weak learners to get an overall strong learner.
The value of error rate lies between 0 & 1 so let’s see how alpha & error is related. When
the base model performs with less error overall then, as you can see in the plot above, the α
is a large positive value, which means that the weak learner will have a high say in the final
model. If the error is 0.5, it means that it is not sure of the decision, then the α =0, i.e the
weak learner will have no say or significance in the final model.
If the model produces large errors (i.e close to 1), then α is a large negative value, meaning
that the predictions it makes is incorrect most of the time. Hence this weak learner will have
a very low say in the final model.
After calculating the say/importance of each weak learner, we find the new weights of
each observation present in the training dataset. The following formula is used to compute
the new weight for each observation:
New sample weight for the incorrectly classified = original sample weight * e^(α)
New sample weight for the correctly classified = original sample weight * e^(-α)
© Copyright 2020. upGrad Education Pvt. Ltd. All rights reserved
After calculating we need to normalize these values in order to proceed further, using the
following formula:
The samples which the previous stump incorrectly classified will be given higher weights and
the ones which the previous stump classified correctly will be given lower weights.
Next, a new stump will be created by randomly sampling the weighted observations. Due to
the weights given to each observation, the new dataset will have a tendency to contain
multiple copies of the observations that were misclassified by the previous tree and may not
contain all observations which were correctly classified. This will essentially help the next
weak learner to give more importance to the incorrectly classified sample so that it can
correct the mistake and correctly classify it now. This process will be repeated till a pre-
specified number of trees are built i.e. the ensemble is built.
The AdaBoost model makes predictions by having each tree in the ensemble classify the
sample. Then, we split the trees into groups according to their decisions. For each group, we
add up the significance of every tree inside the group. The final prediction made by the
ensemble as a whole is determined by the sign of the weighted sum.
b. Compute
c. Compute
d. Update
3. Final Model:
The point here is to force classifiers to concentrate on observations that are difficult to
classify correctly.
Gradient Boosting
In this session, we will learn another popular boosting algorithm - Gradient Boosting and a
modification of Gradient Boosting called XGBoost which is widely used in the industry.
Gradient Boosting like AdaBoost trains many models in a gradual, additive, and sequential
manner. In AdaBoost, we provide more say or weight to those data points which are
misclassified/ wrongly predicted earlier. Gradient boosting performs the same by using
gradients in the loss function. The loss function is a measure indicating how good the model
is able to fit the underlying data. In the case of a regression problem, the loss function
would be the error between true and predicted target values. For classification problems,
the loss function would be a measure of how good our predictive model is at classifying
each sample point that we have in our dataset.
Build the first weak learner using a sample from the training data; we will consider a
decision tree as the weak learner or the base model. It may not necessarily be a
stump, can grow a bigger tree but will still be weak i.e. still not be fully grown.
Then the predictions are made on the training data using the decision tree just built.
The gradient, in our case the residuals are computed and these residuals are the
new response or target values for the next weak learner.
A new weak learner is built with the residuals as the target values and a sample of
observations from the original training data.
Add the predictions obtained from the current weak learner to the predictions
obtained from all the previous weak learners. The predictions obtained at each step
are multiplied by the learning rate so that no single model makes a huge
contribution to the ensemble thereby avoiding overfitting. Essentially, with the
At any iteration t, we repeat the following steps in the Gradient Boosting scheme of things:
the pseudo residuals are the negative gradients for all data points
2. Fit a base learner hm(x) to the pseudo-residuals, i.e train it using the training
Gradient Descent:
You have learned that in the gradient descent algorithm, a function G(x) decreases fastest
around a point a if one goes along the negative gradient of G(x). It follows the iterative
process:
We see that the negative gradient of the square loss function gives the residue. So the
strategy of closing in on the residues at each step can be seen as taking a step towards the
negative gradient of the loss function. We can observe here that gradient descent helps in
generalising the boosting process.
The negative gradients at an iteration t act as a target variable to fit a model hm and for a
regression setting with a square loss function, this target variable was the residue. at each
iteration we add an incremental model, which fits on the negative gradients of the loss
function evaluated at current target values. We stop when we see that the gradients are
very close to zero. For a regression setting, this means that when the residuals are very
close to zero, we stop iterating. Here, , known as the step multiplier, is typically between
0 and 1.
Here at each iteration, a new model (weak learner) is added and with each new addition, we
can observe a reduction in the pseudo residuals which is an indication that we are moving in
the right direction of the target values.
Extreme Gradient Boosting (XGBoost) is similar to the gradient boosting framework but
more efficient and advanced implementation of the Gradient Boosting algorithm.
XGBoost advantages:
1. Parallel Computing: when you run xgboost, by default, it would use all the cores of
your laptop/machine enabling its capacity to do parallel computation
2. Regularization: The biggest advantage of xgboost is that it uses regularization and
controls the overfitting and simplicity of the model which gives it better
performance.
3. Enabled Cross-Validation: XGBoost is enabled with internal Cross Validation function
4. Missing Values: XGBoost is designed to handle missing values internally. The missing
values are treated in such a manner that if there exists any trend in missing values, it
is captured by the model.
5. Flexibility: XGBoost is not just limited to regression, classification, and ranking
problems, it supports user-defined objective functions as well. Furthermore, it
supports user-defined evaluation metrics as well.
λt, the learning rate, is also known as shrinkage. It can be used to regularize the gradient
tree boosting algorithm. λt typically varies from 0 to 1. Smaller values of λt lead to a larger
value of a number of trees T (called n_estimators in the Python package XGBoost). This is
because, with a slower learning rate, you need a larger number of trees to reach the
minima. This, in turn, leads to longer training time.
On the other hand, if λt is large, we may reach the same point with a lesser number of trees
(n_estimators), but there's the risk that we might actually miss the minima altogether (i.e.
cross over it) because of the long stride we are taking at each iteration.
Some other ways of regularization are explicitly specifying the number of trees T and
doing subsampling. Note that you shouldn't tune both λt and number of trees T together
since a high λt implies a low value of T and vice-versa.
,Gamma is a parameter used for controlling the pruning of the tree. A node is split only
when the resulting split gives a positive reduction in the loss function. Gamma specifies the
minimum loss reduction required to make a split and makes the algorithm conservative. The
values can vary depending on the loss function and should be tuned.
Conclusion:
We went through Gradient Boosting and a modification of it, XGBoost. We were introduced
to the intuition of gradient descent in the regression setting when we used the square loss
function to close in on the residues. We developed from here on the Gradient Boosting
algorithm in which each subsequent model trains on the gradients of the loss function.
XGBoost follows the same procedure on trees with additional features like regularization
and parallel tree learning algorithm for finding the best split.