Professional Documents
Culture Documents
Getting Started With Gradient Boosting Machines - Using XGBoost and LightGBM Parameters
Getting Started With Gradient Boosting Machines - Using XGBoost and LightGBM Parameters
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 1/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
“My only goal is to gradient boost over myself of yesterday. And to repeat this everyday with an unconquerable
spirit”; Photo by Jake Hills
Psst.. A Confession: I have, in the past, used and tuned models without really
knowing what they do. I tried to do the same with Gradient Boosting Machines
— LightGBM and XGBoost — and it was.. frustrating!
This technique (or rather laziness), works fine for simpler models like linear
regression, decision trees, etc. They have only a few hyperparameters —
learning_rate , no_of_iterations , alpha, lambda — and its easy to know
what they mean.
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 2/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
2. And to top it off, unlike Random Forest, their default settings are often
not the optimal one!
So, if you want to use GBMs for modelling your data, I believe that, you
have to get atleast a high-level understanding of what happens on the
inside. You can’t get away by using it as a complete black-box.
As you might have guessed already, I am not going to dive into the math in
this article. But if you are interested, I will post some good links that you
could follow if you want to make the jump.
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 3/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
. . .
You know how with each passing day we aim to improve ourselves by
focusing on the mistakes of yesterday. Well, you know what? — GBMs do
that too!
You: Wait, what “predictors” are you talking about? Isn’t GBM a predictor
itself??
Me: Yes it is. This is the first thing that you need to know about Gradient
Boosting Machine — It is a predictor built out of many smaller predictors. An
ensemble of simpler predictors. These predictors can be any regressor or
classifier prediction models. Each GBM implementation, be it LightGBM or
XGBoost, allows us to choose one such simple predictor. Oh hey! That brings us
to our first parameter —
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 5/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
Me: Yep, you guessed it right — this is the piece where we see how GBMs
improve themselves by “focusing on the mistakes of yesterday”!
So, a GBM basically creates a lot of individual predictors and each of them
tries to predict the true label. Then, it gives its final prediction by averaging
all those individual predictions.
But we aren’t talking about our normal 3rd-grade average over here; we
mean a weighted average. GBM assigns a weight to each of the predictors
which determines how much it contributes to the final result — higher the
weight, greater its contribution.
Each predictor in the ensemble is built sequentially, one after the other —
with each one focusing more on the mistakes of its predecessors.
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 6/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
Each time a new predictor is trained, the algorithm assigns higher weights
(from a new set of weights) to the training instances which the previous
predictor got wrong. So, this new predictor, now, has more incentive to
solve the more difficult predictions.
You: But hey, “averaging the predictions made by lots of predictors”.. isn’t that
Random Forest?
Me: Ahha.. nice catch! “Averaging the predictions made by lots of predictors” is
in fact what an ensemble technique is. And random forests and gradient
boosting machines are 2 types of ensemble techniques.
But unlike GBMs, the predictors built in Random Forest are independent of
each other. They aren’t built sequentially but rather parallely. So, there are no
weights for the predictors in Random Forest. They are all created equal.
You should check out the concept of bagging and boosting. Random Forest is a
bagging algorithm while Gradient Boosting Trees is a boosting algorithm.
Simple right?
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 7/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
How does the model decide the number of predictors to put in?
— Through a hyperparameter ofcourse:
You: But I have another question. Repetitively building new models to focus on
the mislabelled examples — isn’t that a recipe for overfitting?
Me: Oh that is just ENLIGHTENED! If you were thinking about this.. you got it!
It is, in fact, one of the main concerns with using Gradient Boosting Machines.
But the modern implementations like LightGBM and XGBoost are pretty good
at handling that by creating weak predictors.
3. “Weak Predictors”
A weak predictor is a simple prediction model that just performs better
than random guessing. Now, we want the individual predictors inside GBMs
to be weak, so that the overall GBM model can be strong.
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 8/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
Me: Well, it might sound backwards but this is how GBM creates a strong
model. It helps it to prevent overfitting. Read on to know how..
Since every predictor is going to focus on the observations that the one
preceding it got wrong, when we use a weak predictor, these mislabelled
observations tend to have some learnable information which the next
predictor can learn.
Whereas, if the predictor were already strong, it would be likely that the
mislabelled observations are just noise or nuances of that sample data. In
such a case, the model will simply be overfitting to the training data.
Also note that if the predictors are just too weak, it might not even be possible
to build a strong ensemble out of them.
These are the parameters that we need to tune to make the right predictors
(as discussed before, these simple predictors are decision trees):
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 9/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 10/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
The subtree marked in red has a leaf node with 1 data in it. So, that subtree can’t be generated as 1 <
`min_child_samples` for the above case
Note: These are the parameters that you can tune to control
overfitting.
. . .
That is it. Now you have a nice overview of the whole story of how a GBM
works!
But before I leave you, I would like you to know about a few more
parameters. These parameters don’t quite fit into the story I just sewed
above but they are used to squeeze out some efficiencies in performance.
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 11/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
I will also talk about how to tune this multitude of hyperparameters. And
finally, I will point you some links to dive deeper into the math and the
theory behind GBMs.
[Extra]. Subsampling
Even after tuning all the above parameters correctly, it might just happen
that some trees in the ensemble are highly correlated.
Me: Sure. I mean decision trees that are similar in structure because of similar
splits based on same features. When you have such trees in the ensemble, it
would mean that the ensemble as a whole is going to store less amount of
information than what it could have stored if the trees were different. So, we
want our trees to be as little correlated as possible.
To combat this problem, we subsample the data rows and columns before
each iteration and train the tree on this subsample. Meaning that different
trees are going to be trained on different subsamples of the entire dataset.
Different Data -> Different Trees
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 12/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 13/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
You can check out the sklearn API for LightGBM here and that for XGBoost
here.
Since my whole story was sewn around these hyperparameters, I would like
to give you a brief on how you can go around hyperparameter tuning.
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 14/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
But if you wish to go even further, you could look around the
hyperparameter set that it returns using GridSearchCV . Grid search will
train the model using every possible hyperparameter combination and
return the best set. Note that since it tries every possible combination, it can
be expensive to run.
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 15/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
Talking about using GBMs, I would like to tell you a little more about
Random Forest. I briefly talked about them above — Random Forests are a
different type of (tree-based) ensemble technique.
They are great because their default parameter settings are quite close to
the optimal settings. So, they will give you a good enough result with the
default parameter settings, unlike XGBoost and LightGBM which require
tuning. But once tuned, XGBoost and LightGBM are likely to perform better.
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 16/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
. . .
Let me know your thought in the comments section below. You can also
reach out to me on LinkedIn or Twitter (follow me there; I won’t spam you
feed ;) )
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 17/18
05/06/2020 Getting started with Gradient Boosting Machines — using XGBoost and LightGBM parameters
https://towardsdatascience.com/understanding-gradient-boosting-machines-using-xgboost-and-lightgbm-parameters-3af1f9db9700 18/18