Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Ensemble learning from theory to practice

Course 3: Boosting

Myriam Tami

myriam.tami@centralesupelec.fr

January - March, 2023


Motivations FSAM GB GBRT AdaBoost Summary References

1 Motivations

2 Forward stagewise additive modeling (FSAM)

3 Gradient Boosting - “Anyboost"

4 Gradient Boosted Regression Tree

5 AdaBoost

6 Summary

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 2 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Reminder

Course 1: Decision trees have bias and variance problem (weak


learners - the error will not tend to zero)
Course 2: Bagging/Random forests allow to tackle variance
problem by aggregating Decision trees
B
1X
h(x) := fb (x)
B
b=1

where fb : X → Y are basis functions from the hypothesis space


F, whose set of CART trees

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 3 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Problem

If the hypothesis class F is a set of predictors having large bias and


high training error
ex: CART trees with very limited depth

Question:
Can weak learners ht be combined to generate a strong learner with
low bias?
Solution: Yes [Schapire, 1990], create ensemble predictors
T
X
HT (x) := αt ht (x)
t=1

where αt ∈ R and ht ∈ F are chosen based on training data D

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 4 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Adaptive Basis Function Model


Learning the adaptive basis function expansion hT over F,
T
X
HT (x) = αt ht (x)
t=1

is choosing α1 , . . . , αT ∈ R and h1 , . . . , hT ∈ F to fit D


We can suppose our base hypothesis space F is parametrized by
Θ such as
T
X
HT (x) = αt ht (x, Θt )
t=1

where, (Θt )16t6T ∈ Θ characterises ht


E.g., ht could be a decision tree where Θt characterises the tth
tree in terms of split variables, cutpoints at each node and
terminal-node values
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 5 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

An intuition for process of construction

Suppose,
T
X
HT (x) = αt ht (x)
t=1

is built in an iterative fashion (“sequential ensemble")


In iteration t, we add the predictor αt ht (x) to the ensemble,

Ht (x) = Ht−1 (x) + αt ht (x)

At test time, we evaluate all predictors and return weighted sum

The process of construction is similar to gradient descent:


instead of updating the model parameters in each iteration, we add
functions to our ensemble

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 6 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Gradient-Based Method

We’ll consider learning by empirical risk minimization


Suppose our base hypothesis space F is parametrized by Θ
Write empirical risk minimization objective function as
n T
!
1X X
R(H) = R(α1 , . . . , αT , Θ1 , . . . , ΘT ) = L2 yi , αt ht (xi , Θt )
n
i=1 t=1

where L2 denotes the loss squared function


How to optimize R? How to learn?

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 7 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Gradient-Based Method

We’ll consider learning by empirical risk minimization


Suppose our base hypothesis space F is parametrized by Θ
Write empirical risk minimization objective function as
n T
!
1X X
R(H) = R(α1 , . . . , αT , Θ1 , . . . , ΘT ) = L2 yi , αt ht (xi , Θt )
n
i=1 t=1

where L2 denotes the loss squared function


How to optimize R? How to learn?
Can we can differentiate R w.r.t. αt ’s and Θt ’s? Optimize with
Gradient descent method?

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 7 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Gradient-Based Method

We’ll consider learning by empirical risk minimization


Suppose our base hypothesis space F is parametrized by Θ
Write empirical risk minimization objective function as
n T
!
1X X
R(H) = R(α1 , . . . , αT , Θ1 , . . . , ΘT ) = L2 yi , αt ht (xi , Θt )
n
i=1 t=1

where L2 denotes the loss squared function


How to optimize R? How to learn?
Can we can differentiate R w.r.t. αt ’s and Θt ’s? Optimize with
Gradient descent method?
For some hypothesis spaces (Θ = Rn ) and typical loss functions,
yes!

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 7 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Parametrization by axes corresponding to the values taken by the


training points

H (and h) are now


parametrized by the
training points as axis
of the euclidean
n-space
Ex: n = 3, Θ = R3 ,
D3 =
{(x1 , 1), (x2 , −1), (x3 , 1)}


i.e: y = (1, −1, 1)0
The first training point
can take its input
values in all the first
axis
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 8 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

What if Gradient Based Method does not apply?

Even if we could parameterize trees with Rn Euclidean n-space


(ht becomes an n-length vector of its predicted values)
I predictions (of H) would not change continuously w.r.t. the
observations xi ∈ Rn
I and so certainly not differentiable

In this course we’ll discuss gradient boosting which is applies


whenever
I the loss function is (sub)differentiable w.r.t. training predictions
H(xi ), and
I it’s possible to do regression with the base hypothesis space F
(e.g., regression trees)

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 9 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Gradient-Based Method

Gradient descent is based on the following,


If the multi-variable function L2 is defined and differentiable in a
neighborhood of a point H(xi ), then L2 decreases fastest it one goes
from H(xi ) in the direction of the negative gradient of L2 at H(xi ),
−∇L2 (H(xi )),
Ht+1 (xi ) = Ht (xi ) − α∇L2 (Ht (xi ))
for α ∈ R+ small enough

Idea: ∇L2 (Ht (xi )) is subtracted from Ht (xi ) to move against the
gradient toward the local minimum

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 10 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

An illustration of Gradient descent -Based Method

Let a hypothesis
space parametrized
by an Euclidean
3-space

− →

H is a prediction of y
To improve the
prediction we add
−−−−−→
−α∇L2 (H) to move
against the gradient
toward the local
minimum
We do that step by
step until to converge
to the local minimum
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 11 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Another intuition
Come back to our process of construction of the ensemble
Assume we have already finished t iterations and have an
ensemble predictor Ht (x)
In iteration t + 1 we want add one more weak learner ht+1 to the
ensemble
To this end, we assume α ∈ [0, 1] fixed and, we optimize
n
1X
ht+1 = argmin L2 (Ht (xi ) + αh(xi ), yi ) (1)
h∈F n
i=1

Then we add it to our ensemble

Ht+1 := Ht + αht+1
Questions:
I Is ht+1 the negative gradient of L2 at H(x)?
I How can we find such h ∈ F? Solve Eq. (1)? What about α?
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 12 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Overview

1 Forward stagewise additive modeling (FSAM)


I L2 -Boosting
I Exponential loss gives AdaBoost

2 Gradient Boosting - “Anyboost"

3 Gradient Boosted Regression Trees (GBRT)

4 AdaBoost

5 Conclusion

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 13 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Forward Stagewise Additive Modeling (FSAM)


FSAM is an iterative optimization algorithm for fitting adaptive
basis function models
Start with h0 = 0 (zero function)
After t − 1 iterations, we have
t−1
X
Ht−1 (x) = ατ hτ (x)
τ =1

In t’th iteration, we want to find


I step direction ht ∈ F (i.e., a basis function) and
I step size αt > 0
such that
Ht = Ht−1 + αt ht
improves the objective function value by as much as possible
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 14 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

FSAM for Empirical Risk Minimization

1 Initialize h0 (x) = 0
2 For t = 1 to T
1 Compute

the optimization problem (“FSAM step")


 
n
1 X
(αt , ht ) = argmin L yi , Ht−1 (xi ) +αh(xi ) (2)
 
α∈R,h∈F n | {z }
i=1
new piece

for a loss function L


2 Set Ht−1 + αt ht
3 Return: HT

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 15 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

L2 -Boosting example
Suppose we use the square loss L2
Then in each step we minimize the objective function J
  2
n
1 X
J(α, h) = yi − Ht−1 (xi ) +αh(xi )
 
n | {z }
i=1
new piece

If F is closed under rescaling (i.e., if h ∈ F, then αh ∈ F for all


α ∈ R), then don’t need α
Take α = 1 and minimize
n
1X
J(h) = ([yi − Ht−1 (xi )] − h(xi ))2
n
i=1

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 16 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

L2 -Boosting example
Suppose we use the square loss L2
Then in each step we minimize the objective function J
  2
n
1 X
J(α, h) = yi − Ht−1 (xi ) +αh(xi )
 
n | {z }
i=1
new piece

If F is closed under rescaling (i.e., if h ∈ F, then αh ∈ F for all


α ∈ R), then don’t need α
Take α = 1 and minimize
n
1X
J(h) = ([yi − Ht−1 (xi )] − h(xi ))2
n
i=1

This is just fitting the residuals with least-squares


regression!
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 16 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost example

AdaBoost is used for classification problems


Outcome space Y = {−1, 1}
A score function h : X → R
Margin for example (x, y ) is m = yh(x):
I m > 0 ⇐⇒ classification correct
I Larger m is better

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 17 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost example: Exponential Loss


Introduce the exponential loss: L(y , H(x)) = exp (−yH(x))

Figure: The horizontal axis corresponds to the margin and vertical one to the
loss value. The black loss is the Misclassification loss, the blue one is the
Exponential loss, the red one is the Logistic loss and green is the Hinge loss.
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 18 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost example

FSAM with Exponential loss


Consider classification setting: Y = {−1, 1}
Take loss function to be the exponential loss

L(y , H(x)) = exp (−yH(x))

Let F be a base hypothesis space of binary classifiers


h : X → {−1, 1}
AdaBoost is a version of FSAM

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 19 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost example: Exponential Loss

Note that the exponential loss puts a very large weight on bad
misclassifications

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 20 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost / Exponential Loss: Robustness Issues

When the Bayes error rate is high (e.g., P(y 6= f (x)) = 0.25)
I e.g. there’s some intrinsic randomness in the label
I e.g. training examples with same input, but different classifications
Best we can do is predict the most likely class for each x
Some training predictions should be wrong
I because example doesn’t have majority class
I AdaBoost / exponential loss puts a lot of focus on getting those right
Thus, empirically, AdaBoost has degraded performance in
situations with
I high Bayes error rate, or when there’s
I high “label noise"

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 21 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

FSAM for other loss functions

We know how to do FSAM for certain loss functions


I e.g., square loss, exponential loss
In each case, happens to reduce to another optimization problem
(cf. Eq. (2)) we know how to solve
However, not clear how to do FSAM in general
For example, logistic loss or cross-entropy loss (generalization of
logistic loss - Eq. (5) Course 1)

The next topic will be the resolution of the optimization problem (2)

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 22 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

FSAM an is iterative optimization

The FSAM step Eq. (2),


 
n
1 X
(αt , ht ) = argmin L yi , Ht−1 (xi ) +αh(xi )
 
α∈R,h∈F n i=1
| {z }
new piece

Hard part: finding the best step direction h


What if we looked for the locally best step direction?
I like in gradient descent

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 23 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Concept of Functional Gradient


We want to minimize the objective function
n
X
J(H) = L (yi , H(xi ))
i=1

We want to take the gradient w.r.t. “H", which isn’t a variable but a
function
J(H) only depends on H at the n training points
Let’s see the function H as a vector and define
H := (H(x1 ), . . . , H(xn ))T
And write the objective function as
n
X
J(H) = L (yi , Hi )
i=1

→ We will differentiate over each component of H seen as a variable


M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 24 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Functional Gradient Descent

Consider gradient descent on


n
X
J(H) = L (yi , Hi )
i=1

We want to optimize over H


The negative gradient step direction at H is

−g = −∇H J(H)
= −(∂H1 L(y1 , H1 ), . . . , ∂Hn L(yn , Hn ))

which we can easily calculate


−g ∈ Rn is the direction we want to change each of our n
predictions on training data

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 25 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Functional Gradient stepping

Figure: Gradient descent illustration. (source: Ensemble Methods in Data Mining, Seni and Elder, 2010)

The empirical risk R(H) is plotted as a function of


H = (H(x1 ), H(x2 )) which are the predictions on two training
points.
Starting with an initial guess, a sequence converging towards the
minimum of R(H) is generated by moving in the direction of the
negative of the gradient
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 26 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Functional Gradient Descent


The step direction is

−g = −∇H J(H) = −(∂H1 L(y1 , H1 ), . . . , ∂Hn L(yn , Hn ))T

Exercise 1
Compute −g for the square loss L2

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 27 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Functional Gradient Descent


The step direction is

−g = −∇H J(H) = −(∂H1 L(y1 , H1 ), . . . , ∂Hn L(yn , Hn ))T

Exercise 1
Compute −g for the square loss L2
Also called the “pseudo-residuals"
I for square loss, they’re exactly the residuals:
−∂Hi (yi − Hi )2 = 2(yi − Hi )
Find the more collinear base hypothesis h ∈ F to the gradient
Optimization problem (project the gradient on F)
ht = argmin < g, h >n (3)
h∈F

Therefore the update rule is Ht+1 = Ht + αt ht


M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 27 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

“Projected" Functional Gradient Stepping

Figure: The base learner most collinear to the negative gradient vector is
chosen at every step (source: Ensemble Methods in Data Mining, Seni and Elder, 2010)

T (x, p) ∈ F is our actual step direction (i.e., ht )


T (x, p) is like the projection of −g = −∇R(H) onto F
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 28 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Functional Gradient Descent: Step size

Finally, we choose a step-size


Option 1 (Line-search),
n
1X
αt = argmin L (yi , Ht−1 (xi ) + αht (xi ))
α∈R+ n
i=1

i.e., we look for the learning rate that will imply the maximum
decrease of the empirical risk

Option 2 (Shrinkage parameter - more common):


I We consider α = 1 to be the full gradient step
I Choose a fixed α ∈ [0, 1] - called a shrinkage parameter

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 29 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Gradient Boosted Tree

A regression case: Y = R
I Also classification is possible Y = {+1, −1}
Weak learners h ∈ F are regressors, ∀x, h(x) ∈ R
I typically, fixed-depth (e.g., depth=4) regression trees (hence the
name)
Step-size α is fixed to a small constant (hyper parameter to tune)
Loss function could be any differentiable
Pconvex loss that
n
decomposes over the samples L(H) = i=1 L(H(xi ))

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 30 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Gradient Boosted Regression Tree


The process is the same than presented before
1 We can fix the step-size α to a small constant or tune it as an
hyper parameter (e.g., fix α = 0.1 is more common)
2 The step direction is exactly the residuals for the square loss L2 ,

−g = −((y1 − H1 ), . . . , (yn − Hn ))T

3 We project the gradient on F i.e. fit the h ∈ F that best


approximates −g as step direction

ht = argmin < g, h >n


h∈F

4 Update rule,
Ht (x) = Ht−1 (x) + αht (x)

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 31 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Functional Gradient Descent: Step direction


Detail the 3rd step (projection of the gradient on F),
ht = argmin < g, h >n
h∈F
 
n
1 X
= argmin −2 (yi − H(xi )) h(xi )

h∈F n | {z }
i=1 −gi
 
n
1 X 2 2
= argmin  (yi − H(xi )) −2 (yi − H(xi )) h(xi ) + (h(xi )) 

h∈F n | {z } | {z } | {z }
i=1 −gi constant
−(gi )2 =constant
n
1 X
= argmin (−gi − h(xi ))2
h∈F n
i=1
This is a least squares regression problem over hypothesis space F
Remark: (h(xi ))2 can be seen as a constant by decoupling the length of a vector and
the direction
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 32 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

GBRT: Step direction interpretation

The final optimization problem to solve,


n
1X
ht = argmin (−gi − h(xi ))2
h∈F n
i=1

is simply fitting the residuals


We build a new tree for a different set of “labels" −g1 , . . . , −gn
g is the vector pointing from y to H
It’s important to use any differentiable and convex loss function
Thus, the next weak learner h will always be the regression tree
minimizing the squared loss

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 33 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Gradient Boosted Regression Tree in Pseudo Code



1:Require: A dataset D = (xi , yi )16i6n , a loss L, a number of
iterations T , an α value
2: Initialization: H = 0
3: for t = 1 to T do
4: for all i do
5: −gi = yi − H(xi )
6: end for
h = argminh∈F n1 ni=1 (h(xi ) − (−gi ))2
P
7:
8: H ← H + αh
9: end for
return: H
Algorithm 1: Pseudo-code describing GBRT approach (without line
search for the step-size)

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 34 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost

A classification case: Y = {+1, −1}


Weak learners h ∈ F are binary, ∀x, h(x) ∈ {+1, −1}
Step-size: we perform line-search to obtain the best step-size α
Loss function: Exponential loss L(H) = ni=1 e−yi H(xi )
P

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 35 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost: Step direction


1 We compute the gradient
∂Hi L(yi , Hi ) = −yi e−yi Hi
2 We project the gradient on F
n
X
ht = argmin −yi e−yi Ht (xi ) h(xi )
h∈F i=1
n
X e−yi Ht (xi )
= argmin −wit yi h(xi ) (substitute in: wit = Pn −yi Ht (xi )
)
h∈F i=1 i=1 e
( n n
)
X X
t t
= argmin wi 1{h(xi )6=yi } − wi 1{h(xi )=yi }
h∈F i=1 i=1
n
X n
X n
X
= argmin wit 1{h(xi )6=yi } (because wit =1− wit )
h∈F i=1 i:h(xi )=yi i:h(xi )6=yi

This is the weighted classification error of the training samples


M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 36 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost: Step direction

Project the gradient on F means fit a classification tree h such


that minimizing the weighted classification error
This tree is then ht
And we denote this classification error
n
X
εt = wit 1{ht (xi )6=yi }
i=1

wit means the contribution of the ith point from the training
samples at the tth iteration

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 37 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost: Step-size
In the AdaBoost setting we can find the optimal step-size by
line-search every time we take a step direction
Given L, Ht , ht we want to solve,
n
1X
αt = argmin L (yi , Ht−1 (xi ) + αht (xi ))
α∈R+ n
i=1
n
1 X
= argmin e−yi (Ht (xi )+αht (xi ))
α∈R+ n
i=1

We differentiate w.r.t. α and equate with zero,


n
X
−yi ht (xi )e−(yi Ht (xi )+yi αht (xi )) = 0
i=1

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 38 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost: Step-size
Now we have to extract αt by solving,
X n
yi ht (xi )e−(yi Ht (xi )+yi αt ht (xi )) = 0
i=1
X X
−(yi Ht (xi )+αt )
− e + e−(yi Ht (xi )−αt ) = 0
i:ht (xi )yi =1 i:ht (xi )yi =−1
X X
− wit e−αt + wit eαt = 0
i:ht (xi )yi =1 i:ht (xi )yi =−1

−(1 − εt )e−αt + εt eαt = 0


1 − εt
e2αt =
εt
1 1 − εt
αt = ln
2 εt
Find the optimal αt in such closed form (it’s unusual) allows AdaBoost
to converge extremely fast
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 39 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost: Step-size illustration

α is also called “learning rate"


1 1 − εt which is a decreasing function of
αt (εt ) = ln
2 εt the error ε
Also note that
limεt →0 αt = +∞ indicates
that for a small error we can
make a huge step
We observe that the maximal
error value in this case is
εt = 12 (because α > 0)
if εt = 12 the step-size α would
be zero, indeed we don’t want
to go in this direction

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 40 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost: weights update and re-normalization


After you take a step, i.e. Ht+1 = Ht + αt ht , you need to re-compute
all the weights and then re-normalize

Weights behavior: Up to the normalization constant,


wit+1 ∝ e−yi Ht (xi )
= e−yi (Ht−1 (xi )+αt ht (xi )) = e−yi Ht−1 (xi ) e−αt yi ht (xi )
∝ wit e−αt yi ht (xi )
I If a point (xi , yi ) is correctly classified (yi ht (xi ) > 0), its weight is
decreased
I while if it is incorrectly classified (yi ht (xi ) < 0), its weight is
increased

Then, the re-normalizing factor of the weights at round t + 1 is,


n
X
wit+1 e−αt yi ht (xi )
i=1
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 41 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

Pseudo-code describing AdaBoost


 approach

1: Require: A dataset D = (xi , yi )16i6n , the size T of the ensemble
2: Initialization: H0 = 0 and ∀i, wi0 = n1
3: for t = 1 to T do
ht = argminh∈F i:h(xi )6=yi wit
P
4:
εt = i:ht (xi )6=yi wit
P
5:
6: if εt < 21 then
7: αt = 12 ln 1−ε
εt
t

8: Ht = Ht−1 + αt ht
wit e−αt yi ht (xi )
9: ∀i, wit+1 ← Pn t+1 −αt yi ht (xi )
i=1 wi e
10: else
11: return: Ht
12: end if
13: end for
return: HT
Algorithm 2: AdaBoost in Pseudo-code (for classification)
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 42 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

AdaBoost: Few remarks

Note that we have necessarily εt (ht ) < 12 otherwise, the classifier


does worse than random and −ht is a better classifier, with
εt (−ht ) < 21
To help you understand the idea think about the following,
I If F is negation closed (i.e. ∀h ∈ F we must also have −h ∈ F)
I As h was found by minimizing the error, this is a contradiction!
1
The AdaBoost inner loop can terminate as the error εt = 2
1
I in most cases it will converge to 2 over time
In that case the latest weak learner h is only as good as a coin
toss and cannot benefit the ensemble ⇒ boosting terminates

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 43 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

To conclude
Boosting is a great way to turn a weak classifier into a strong
classifier:
I Boosting algorithms combine weak models trained sequentially to
give more importance to the training samples on which the
predictions are bad
I Boosting defines a whole family of algorithms, including Gradient
Boosting, AdaBoost and many others...
GBRT is one of the most popular algorithms for “Learning to
Rank", the branch of machine learning focused on learning
ranking functions (ex: for web search engines)
Inspired by Breiman’s Bagging, Stochastic Gradient Boosting
subsamples the training data for each weak learner
I This combines the benefits of bagging and boosting
AdaBoost is an extremely powerful algorithm, that turns any
weak learner that can classify any weighted version of the training
set with below 0.5 error into a strong learner whose training error
decreases exponentially
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 44 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

If you curious

The best-known implementation of Gradient Boosting is the


XGBOOST implementation (often considered as synonym)

The code is available at https://github.com/dmlc/xgboost

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 45 / 46
Motivations FSAM GB GBRT AdaBoost Summary References

References I
Bartlett, P., Freund, Y., Lee, W. S., and Schapire, R. E. (1998).
Boosting the margin: A new explanation for the effectiveness of voting methods.
The annals of statistics, 26(5):1651–1686.

Chen, T. and Guestrin, C. (2016).


Xgboost: A scalable tree boosting system.
In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages
785–794.
Freund, Y. and Schapire, R. E. (1995).
A desicion-theoretic generalization of on-line learning and an application to boosting.
In European conference on computational learning theory, pages 23–37. Springer.

Friedman, J. H. (2001).
Greedy function approximation: a gradient boosting machine.
Annals of statistics, pages 1189–1232.

Hastie, T., Tibshirani, R., and Friedman, J. (2009).


The elements of statistical learning: data mining, inference, and prediction.
Springer Science & Business Media.

Mason, L., Baxter, J., Bartlett, P., and Frean, M. (1999).


Boosting algorithms as gradient descent.
Advances in neural information processing systems, 12.

Schapire, R. E. (1990).
The strength of weak learnability.
Machine learning, 5(2):197–227.

M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 46 / 46

You might also like