Professional Documents
Culture Documents
Main
Main
Course 3: Boosting
Myriam Tami
myriam.tami@centralesupelec.fr
1 Motivations
5 AdaBoost
6 Summary
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 2 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Reminder
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 3 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Problem
Question:
Can weak learners ht be combined to generate a strong learner with
low bias?
Solution: Yes [Schapire, 1990], create ensemble predictors
T
X
HT (x) := αt ht (x)
t=1
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 4 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Suppose,
T
X
HT (x) = αt ht (x)
t=1
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 6 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Gradient-Based Method
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 7 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Gradient-Based Method
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 7 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Gradient-Based Method
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 7 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 9 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Gradient-Based Method
Idea: ∇L2 (Ht (xi )) is subtracted from Ht (xi ) to move against the
gradient toward the local minimum
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 10 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Let a hypothesis
space parametrized
by an Euclidean
3-space
→
− →
−
H is a prediction of y
To improve the
prediction we add
−−−−−→
−α∇L2 (H) to move
against the gradient
toward the local
minimum
We do that step by
step until to converge
to the local minimum
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 11 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Another intuition
Come back to our process of construction of the ensemble
Assume we have already finished t iterations and have an
ensemble predictor Ht (x)
In iteration t + 1 we want add one more weak learner ht+1 to the
ensemble
To this end, we assume α ∈ [0, 1] fixed and, we optimize
n
1X
ht+1 = argmin L2 (Ht (xi ) + αh(xi ), yi ) (1)
h∈F n
i=1
Ht+1 := Ht + αht+1
Questions:
I Is ht+1 the negative gradient of L2 at H(x)?
I How can we find such h ∈ F? Solve Eq. (1)? What about α?
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 12 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Overview
4 AdaBoost
5 Conclusion
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 13 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
1 Initialize h0 (x) = 0
2 For t = 1 to T
1 Compute
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 15 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
L2 -Boosting example
Suppose we use the square loss L2
Then in each step we minimize the objective function J
2
n
1 X
J(α, h) = yi − Ht−1 (xi ) +αh(xi )
n | {z }
i=1
new piece
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 16 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
L2 -Boosting example
Suppose we use the square loss L2
Then in each step we minimize the objective function J
2
n
1 X
J(α, h) = yi − Ht−1 (xi ) +αh(xi )
n | {z }
i=1
new piece
AdaBoost example
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 17 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Figure: The horizontal axis corresponds to the margin and vertical one to the
loss value. The black loss is the Misclassification loss, the blue one is the
Exponential loss, the red one is the Logistic loss and green is the Hinge loss.
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 18 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
AdaBoost example
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 19 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Note that the exponential loss puts a very large weight on bad
misclassifications
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 20 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
When the Bayes error rate is high (e.g., P(y 6= f (x)) = 0.25)
I e.g. there’s some intrinsic randomness in the label
I e.g. training examples with same input, but different classifications
Best we can do is predict the most likely class for each x
Some training predictions should be wrong
I because example doesn’t have majority class
I AdaBoost / exponential loss puts a lot of focus on getting those right
Thus, empirically, AdaBoost has degraded performance in
situations with
I high Bayes error rate, or when there’s
I high “label noise"
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 21 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
The next topic will be the resolution of the optimization problem (2)
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 22 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 23 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
We want to take the gradient w.r.t. “H", which isn’t a variable but a
function
J(H) only depends on H at the n training points
Let’s see the function H as a vector and define
H := (H(x1 ), . . . , H(xn ))T
And write the objective function as
n
X
J(H) = L (yi , Hi )
i=1
−g = −∇H J(H)
= −(∂H1 L(y1 , H1 ), . . . , ∂Hn L(yn , Hn ))
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 25 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Figure: Gradient descent illustration. (source: Ensemble Methods in Data Mining, Seni and Elder, 2010)
Exercise 1
Compute −g for the square loss L2
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 27 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
Exercise 1
Compute −g for the square loss L2
Also called the “pseudo-residuals"
I for square loss, they’re exactly the residuals:
−∂Hi (yi − Hi )2 = 2(yi − Hi )
Find the more collinear base hypothesis h ∈ F to the gradient
Optimization problem (project the gradient on F)
ht = argmin < g, h >n (3)
h∈F
Figure: The base learner most collinear to the negative gradient vector is
chosen at every step (source: Ensemble Methods in Data Mining, Seni and Elder, 2010)
i.e., we look for the learning rate that will imply the maximum
decrease of the empirical risk
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 29 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
A regression case: Y = R
I Also classification is possible Y = {+1, −1}
Weak learners h ∈ F are regressors, ∀x, h(x) ∈ R
I typically, fixed-depth (e.g., depth=4) regression trees (hence the
name)
Step-size α is fixed to a small constant (hyper parameter to tune)
Loss function could be any differentiable
Pconvex loss that
n
decomposes over the samples L(H) = i=1 L(H(xi ))
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 30 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
4 Update rule,
Ht (x) = Ht−1 (x) + αht (x)
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 31 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 33 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 34 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
AdaBoost
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 35 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
wit means the contribution of the ith point from the training
samples at the tth iteration
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 37 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
AdaBoost: Step-size
In the AdaBoost setting we can find the optimal step-size by
line-search every time we take a step direction
Given L, Ht , ht we want to solve,
n
1X
αt = argmin L (yi , Ht−1 (xi ) + αht (xi ))
α∈R+ n
i=1
n
1 X
= argmin e−yi (Ht (xi )+αht (xi ))
α∈R+ n
i=1
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 38 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
AdaBoost: Step-size
Now we have to extract αt by solving,
X n
yi ht (xi )e−(yi Ht (xi )+yi αt ht (xi )) = 0
i=1
X X
−(yi Ht (xi )+αt )
− e + e−(yi Ht (xi )−αt ) = 0
i:ht (xi )yi =1 i:ht (xi )yi =−1
X X
− wit e−αt + wit eαt = 0
i:ht (xi )yi =1 i:ht (xi )yi =−1
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 40 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
8: Ht = Ht−1 + αt ht
wit e−αt yi ht (xi )
9: ∀i, wit+1 ← Pn t+1 −αt yi ht (xi )
i=1 wi e
10: else
11: return: Ht
12: end if
13: end for
return: HT
Algorithm 2: AdaBoost in Pseudo-code (for classification)
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 42 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 43 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
To conclude
Boosting is a great way to turn a weak classifier into a strong
classifier:
I Boosting algorithms combine weak models trained sequentially to
give more importance to the training samples on which the
predictions are bad
I Boosting defines a whole family of algorithms, including Gradient
Boosting, AdaBoost and many others...
GBRT is one of the most popular algorithms for “Learning to
Rank", the branch of machine learning focused on learning
ranking functions (ex: for web search engines)
Inspired by Breiman’s Bagging, Stochastic Gradient Boosting
subsamples the training data for each weak learner
I This combines the benefits of bagging and boosting
AdaBoost is an extremely powerful algorithm, that turns any
weak learner that can classify any weighted version of the training
set with below 0.5 error into a strong learner whose training error
decreases exponentially
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 44 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
If you curious
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 45 / 46
Motivations FSAM GB GBRT AdaBoost Summary References
References I
Bartlett, P., Freund, Y., Lee, W. S., and Schapire, R. E. (1998).
Boosting the margin: A new explanation for the effectiveness of voting methods.
The annals of statistics, 26(5):1651–1686.
Friedman, J. H. (2001).
Greedy function approximation: a gradient boosting machine.
Annals of statistics, pages 1189–1232.
Schapire, R. E. (1990).
The strength of weak learnability.
Machine learning, 5(2):197–227.
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 46 / 46