Main

Ensemble learning from theory to practice
Course 3: Boosting
Myriam Tami
myriam.tami@centralesupelec.fr
January - March, 2023

Motivations FSAM GB GBRT AdaBoost Summary References
1 Motivations
2 Forward stagewise additive modeling (FSAM)
3 Gradient Boosting - “Anyboost"
4 Gradient Boosted Regression Tree
5 AdaBoost
6 Summary
M. Tami (CentraleSupélec) Ensemble learning from theory to practice January - March, 2023 2 / 46
Reminder
Course 1: Decision trees have bias and variance problem (weak

learners - the error will not tend to zero)
Course 2: Bagging/Random forests allow to tackle variance
problem by aggregating Decision trees
B
1X
h(x) := fb (x)
B
b=1
where fb : X → Y are basis functions from the hypothesis space

F, whose set of CART trees
Problem
If the hypothesis class F is a set of predictors having large bias and

high training error
ex: CART trees with very limited depth
Question:
Can weak learners ht be combined to generate a strong learner with
low bias?
Solution: Yes [Schapire, 1990], create ensemble predictors
T
X
HT (x) := αt ht (x)
t=1
where αt ∈ R and ht ∈ F are chosen based on training data D
Adaptive Basis Function Model

Learning the adaptive basis function expansion hT over F,
T
X
HT (x) = αt ht (x)
t=1
is choosing α1 , . . . , αT ∈ R and h1 , . . . , hT ∈ F to fit D

We can suppose our base hypothesis space F is parametrized by
Θ such as
T
X
HT (x) = αt ht (x, Θt )
t=1
where, (Θt )16t6T ∈ Θ characterises ht

E.g., ht could be a decision tree where Θt characterises the tth
tree in terms of split variables, cutpoints at each node and
terminal-node values
An intuition for process of construction
Suppose,
T
X
HT (x) = αt ht (x)
t=1
is built in an iterative fashion (“sequential ensemble")

In iteration t, we add the predictor αt ht (x) to the ensemble,
Ht (x) = Ht−1 (x) + αt ht (x)
At test time, we evaluate all predictors and return weighted sum
The process of construction is similar to gradient descent:

instead of updating the model parameters in each iteration, we add
functions to our ensemble
Gradient-Based Method
We’ll consider learning by empirical risk minimization

Suppose our base hypothesis space F is parametrized by Θ
Write empirical risk minimization objective function as
n T
!
1X X
R(H) = R(α1 , . . . , αT , Θ1 , . . . , ΘT ) = L2 yi , αt ht (xi , Θt )
n
i=1 t=1
where L2 denotes the loss squared function

How to optimize R? How to learn?

n T
!
1X X
n
i=1 t=1

Can we can differentiate R w.r.t. αt ’s and Θt ’s? Optimize with
Gradient descent method?

n T
!
1X X
n
i=1 t=1

Can we can differentiate R w.r.t. αt ’s and Θt ’s? Optimize with
Gradient descent method?
For some hypothesis spaces (Θ = Rn ) and typical loss functions,
yes!
Parametrization by axes corresponding to the values taken by the

training points
H (and h) are now

parametrized by the
training points as axis
of the euclidean
n-space
Ex: n = 3, Θ = R3 ,
D3 =
{(x1 , 1), (x2 , −1), (x3 , 1)}
→
−
i.e: y = (1, −1, 1)0
The first training point
can take its input
values in all the first
axis
What if Gradient Based Method does not apply?
Even if we could parameterize trees with Rn Euclidean n-space

(ht becomes an n-length vector of its predicted values)
I predictions (of H) would not change continuously w.r.t. the
observations xi ∈ Rn
I and so certainly not differentiable
In this course we’ll discuss gradient boosting which is applies

whenever
I the loss function is (sub)differentiable w.r.t. training predictions
H(xi ), and
I it’s possible to do regression with the base hypothesis space F
(e.g., regression trees)
Gradient descent is based on the following,

If the multi-variable function L2 is defined and differentiable in a
neighborhood of a point H(xi ), then L2 decreases fastest it one goes
from H(xi ) in the direction of the negative gradient of L2 at H(xi ),
−∇L2 (H(xi )),
Ht+1 (xi ) = Ht (xi ) − α∇L2 (Ht (xi ))
for α ∈ R+ small enough
Idea: ∇L2 (Ht (xi )) is subtracted from Ht (xi ) to move against the
gradient toward the local minimum
An illustration of Gradient descent -Based Method
Let a hypothesis
space parametrized
by an Euclidean
3-space
→
− →
−
H is a prediction of y
To improve the
prediction we add
−−−−−→
−α∇L2 (H) to move
against the gradient
toward the local
minimum
We do that step by
step until to converge
to the local minimum
Another intuition
Come back to our process of construction of the ensemble
Assume we have already finished t iterations and have an
ensemble predictor Ht (x)
In iteration t + 1 we want add one more weak learner ht+1 to the
ensemble
To this end, we assume α ∈ [0, 1] fixed and, we optimize
n
1X
ht+1 = argmin L2 (Ht (xi ) + αh(xi ), yi ) (1)
h∈F n
i=1
Then we add it to our ensemble
Ht+1 := Ht + αht+1
Questions:
I Is ht+1 the negative gradient of L2 at H(x)?
I How can we find such h ∈ F? Solve Eq. (1)? What about α?
Overview
1 Forward stagewise additive modeling (FSAM)

I L2 -Boosting
I Exponential loss gives AdaBoost
2 Gradient Boosting - “Anyboost"
3 Gradient Boosted Regression Trees (GBRT)
4 AdaBoost
5 Conclusion
Forward Stagewise Additive Modeling (FSAM)

FSAM is an iterative optimization algorithm for fitting adaptive
basis function models
Start with h0 = 0 (zero function)
After t − 1 iterations, we have
t−1
X
Ht−1 (x) = ατ hτ (x)
τ =1
In t’th iteration, we want to find

I step direction ht ∈ F (i.e., a basis function) and
I step size αt > 0
such that
Ht = Ht−1 + αt ht
improves the objective function value by as much as possible
FSAM for Empirical Risk Minimization
1 Initialize h0 (x) = 0
2 For t = 1 to T
1 Compute
the optimization problem (“FSAM step")

 
n
1 X
(αt , ht ) = argmin L yi , Ht−1 (xi ) +αh(xi ) (2)
 
α∈R,h∈F n | {z }
i=1
new piece
for a loss function L

2 Set Ht−1 + αt ht
3 Return: HT
L2 -Boosting example
Suppose we use the square loss L2
Then in each step we minimize the objective function J
  2
n
1 X
J(α, h) = yi − Ht−1 (xi ) +αh(xi )
 
n | {z }
i=1
new piece
If F is closed under rescaling (i.e., if h ∈ F, then αh ∈ F for all

α ∈ R), then don’t need α
Take α = 1 and minimize
n
1X
J(h) = ([yi − Ht−1 (xi )] − h(xi ))2
n
i=1
L2 -Boosting example
Suppose we use the square loss L2
Then in each step we minimize the objective function J
  2
n
1 X
J(α, h) = yi − Ht−1 (xi ) +αh(xi )
 
n | {z }
i=1
new piece
If F is closed under rescaling (i.e., if h ∈ F, then αh ∈ F for all

α ∈ R), then don’t need α
Take α = 1 and minimize
n
1X
J(h) = ([yi − Ht−1 (xi )] − h(xi ))2
n
i=1
This is just fitting the residuals with least-squares

regression!
AdaBoost example
AdaBoost is used for classification problems

Outcome space Y = {−1, 1}
A score function h : X → R
Margin for example (x, y ) is m = yh(x):
I m > 0 ⇐⇒ classification correct
I Larger m is better
AdaBoost example: Exponential Loss

Introduce the exponential loss: L(y , H(x)) = exp (−yH(x))
Figure: The horizontal axis corresponds to the margin and vertical one to the
loss value. The black loss is the Misclassification loss, the blue one is the
Exponential loss, the red one is the Logistic loss and green is the Hinge loss.
AdaBoost example
FSAM with Exponential loss

Consider classification setting: Y = {−1, 1}
Take loss function to be the exponential loss
L(y , H(x)) = exp (−yH(x))
Let F be a base hypothesis space of binary classifiers

h : X → {−1, 1}
AdaBoost is a version of FSAM
AdaBoost example: Exponential Loss
Note that the exponential loss puts a very large weight on bad
misclassifications
AdaBoost / Exponential Loss: Robustness Issues
When the Bayes error rate is high (e.g., P(y 6= f (x)) = 0.25)
I e.g. there’s some intrinsic randomness in the label
I e.g. training examples with same input, but different classifications
Best we can do is predict the most likely class for each x
Some training predictions should be wrong
I because example doesn’t have majority class
I AdaBoost / exponential loss puts a lot of focus on getting those right
Thus, empirically, AdaBoost has degraded performance in
situations with
I high Bayes error rate, or when there’s
I high “label noise"
FSAM for other loss functions
We know how to do FSAM for certain loss functions

I e.g., square loss, exponential loss
In each case, happens to reduce to another optimization problem
(cf. Eq. (2)) we know how to solve
However, not clear how to do FSAM in general
For example, logistic loss or cross-entropy loss (generalization of
logistic loss - Eq. (5) Course 1)
The next topic will be the resolution of the optimization problem (2)
FSAM an is iterative optimization
The FSAM step Eq. (2),

 
n
1 X
(αt , ht ) = argmin L yi , Ht−1 (xi ) +αh(xi )
 
α∈R,h∈F n i=1
| {z }
new piece
Hard part: finding the best step direction h

What if we looked for the locally best step direction?
I like in gradient descent
Concept of Functional Gradient

We want to minimize the objective function
n
X
J(H) = L (yi , H(xi ))
i=1
We want to take the gradient w.r.t. “H", which isn’t a variable but a
function
J(H) only depends on H at the n training points
Let’s see the function H as a vector and define
H := (H(x1 ), . . . , H(xn ))T
And write the objective function as
n
X
J(H) = L (yi , Hi )
i=1
→ We will differentiate over each component of H seen as a variable

Functional Gradient Descent
Consider gradient descent on

n
X
J(H) = L (yi , Hi )
i=1
We want to optimize over H

The negative gradient step direction at H is
−g = −∇H J(H)
= −(∂H1 L(y1 , H1 ), . . . , ∂Hn L(yn , Hn ))
which we can easily calculate

−g ∈ Rn is the direction we want to change each of our n
predictions on training data
Functional Gradient stepping
Figure: Gradient descent illustration. (source: Ensemble Methods in Data Mining, Seni and Elder, 2010)
The empirical risk R(H) is plotted as a function of

H = (H(x1 ), H(x2 )) which are the predictions on two training
points.
Starting with an initial guess, a sequence converging towards the
minimum of R(H) is generated by moving in the direction of the
negative of the gradient

The step direction is
−g = −∇H J(H) = −(∂H1 L(y1 , H1 ), . . . , ∂Hn L(yn , Hn ))T
Exercise 1
Compute −g for the square loss L2

The step direction is
−g = −∇H J(H) = −(∂H1 L(y1 , H1 ), . . . , ∂Hn L(yn , Hn ))T
Exercise 1
Compute −g for the square loss L2
Also called the “pseudo-residuals"
I for square loss, they’re exactly the residuals:
−∂Hi (yi − Hi )2 = 2(yi − Hi )
Find the more collinear base hypothesis h ∈ F to the gradient
Optimization problem (project the gradient on F)
ht = argmin < g, h >n (3)
h∈F
Therefore the update rule is Ht+1 = Ht + αt ht

“Projected" Functional Gradient Stepping
Figure: The base learner most collinear to the negative gradient vector is
chosen at every step (source: Ensemble Methods in Data Mining, Seni and Elder, 2010)
T (x, p) ∈ F is our actual step direction (i.e., ht )

T (x, p) is like the projection of −g = −∇R(H) onto F
Functional Gradient Descent: Step size
Finally, we choose a step-size

Option 1 (Line-search),
n
1X
αt = argmin L (yi , Ht−1 (xi ) + αht (xi ))
α∈R+ n
i=1
i.e., we look for the learning rate that will imply the maximum
decrease of the empirical risk
Option 2 (Shrinkage parameter - more common):

I We consider α = 1 to be the full gradient step
I Choose a fixed α ∈ [0, 1] - called a shrinkage parameter
Gradient Boosted Tree
A regression case: Y = R
I Also classification is possible Y = {+1, −1}
Weak learners h ∈ F are regressors, ∀x, h(x) ∈ R
I typically, fixed-depth (e.g., depth=4) regression trees (hence the
name)
Step-size α is fixed to a small constant (hyper parameter to tune)
Loss function could be any differentiable
Pconvex loss that
n
decomposes over the samples L(H) = i=1 L(H(xi ))
Gradient Boosted Regression Tree

The process is the same than presented before
1 We can fix the step-size α to a small constant or tune it as an
hyper parameter (e.g., fix α = 0.1 is more common)
2 The step direction is exactly the residuals for the square loss L2 ,
−g = −((y1 − H1 ), . . . , (yn − Hn ))T
3 We project the gradient on F i.e. fit the h ∈ F that best

approximates −g as step direction
ht = argmin < g, h >n

h∈F
4 Update rule,
Ht (x) = Ht−1 (x) + αht (x)
Functional Gradient Descent: Step direction

Detail the 3rd step (projection of the gradient on F),
ht = argmin < g, h >n
h∈F
 
n
1 X
= argmin −2 (yi − H(xi )) h(xi )

h∈F n | {z }
i=1 −gi
 
n
1 X 2 2
= argmin  (yi − H(xi )) −2 (yi − H(xi )) h(xi ) + (h(xi )) 

h∈F n | {z } | {z } | {z }
i=1 −gi constant
−(gi )2 =constant
n
1 X
= argmin (−gi − h(xi ))2
h∈F n
i=1
This is a least squares regression problem over hypothesis space F
Remark: (h(xi ))2 can be seen as a constant by decoupling the length of a vector and
the direction
GBRT: Step direction interpretation
The final optimization problem to solve,

n
1X
ht = argmin (−gi − h(xi ))2
h∈F n
i=1
is simply fitting the residuals

We build a new tree for a different set of “labels" −g1 , . . . , −gn
g is the vector pointing from y to H
It’s important to use any differentiable and convex loss function
Thus, the next weak learner h will always be the regression tree
minimizing the squared loss
Gradient Boosted Regression Tree in Pseudo Code

1:Require: A dataset D = (xi , yi )16i6n , a loss L, a number of
iterations T , an α value
2: Initialization: H = 0
3: for t = 1 to T do
4: for all i do
5: −gi = yi − H(xi )
6: end for
h = argminh∈F n1 ni=1 (h(xi ) − (−gi ))2
P
7:
8: H ← H + αh
9: end for
return: H
Algorithm 1: Pseudo-code describing GBRT approach (without line
search for the step-size)
AdaBoost
A classification case: Y = {+1, −1}

Weak learners h ∈ F are binary, ∀x, h(x) ∈ {+1, −1}
Step-size: we perform line-search to obtain the best step-size α
Loss function: Exponential loss L(H) = ni=1 e−yi H(xi )
P
AdaBoost: Step direction

1 We compute the gradient
∂Hi L(yi , Hi ) = −yi e−yi Hi
2 We project the gradient on F
n
X
ht = argmin −yi e−yi Ht (xi ) h(xi )
h∈F i=1
n
X e−yi Ht (xi )
= argmin −wit yi h(xi ) (substitute in: wit = Pn −yi Ht (xi )
)
h∈F i=1 i=1 e
( n n
)
X X
t t
= argmin wi 1{h(xi )6=yi } − wi 1{h(xi )=yi }
h∈F i=1 i=1
n
X n
X n
X
= argmin wit 1{h(xi )6=yi } (because wit =1− wit )
h∈F i=1 i:h(xi )=yi i:h(xi )6=yi
This is the weighted classification error of the training samples

AdaBoost: Step direction
Project the gradient on F means fit a classification tree h such

that minimizing the weighted classification error
This tree is then ht
And we denote this classification error
n
X
εt = wit 1{ht (xi )6=yi }
i=1
wit means the contribution of the ith point from the training
samples at the tth iteration
AdaBoost: Step-size
In the AdaBoost setting we can find the optimal step-size by
line-search every time we take a step direction
Given L, Ht , ht we want to solve,
n
1X
αt = argmin L (yi , Ht−1 (xi ) + αht (xi ))
α∈R+ n
i=1
n
1 X
= argmin e−yi (Ht (xi )+αht (xi ))
α∈R+ n
i=1
We differentiate w.r.t. α and equate with zero,

n
X
−yi ht (xi )e−(yi Ht (xi )+yi αht (xi )) = 0
i=1
AdaBoost: Step-size
Now we have to extract αt by solving,
X n
yi ht (xi )e−(yi Ht (xi )+yi αt ht (xi )) = 0
i=1
X X
−(yi Ht (xi )+αt )
− e + e−(yi Ht (xi )−αt ) = 0
i:ht (xi )yi =1 i:ht (xi )yi =−1
X X
− wit e−αt + wit eαt = 0
i:ht (xi )yi =1 i:ht (xi )yi =−1
−(1 − εt )e−αt + εt eαt = 0

1 − εt
e2αt =
εt
1 1 − εt
αt = ln
2 εt
Find the optimal αt in such closed form (it’s unusual) allows AdaBoost
to converge extremely fast
AdaBoost: Step-size illustration
α is also called “learning rate"

1 1 − εt which is a decreasing function of
αt (εt ) = ln
2 εt the error ε
Also note that
limεt →0 αt = +∞ indicates
that for a small error we can
make a huge step
We observe that the maximal
error value in this case is
εt = 12 (because α > 0)
if εt = 12 the step-size α would
be zero, indeed we don’t want
to go in this direction
AdaBoost: weights update and re-normalization

After you take a step, i.e. Ht+1 = Ht + αt ht , you need to re-compute
all the weights and then re-normalize
Weights behavior: Up to the normalization constant,

wit+1 ∝ e−yi Ht (xi )
= e−yi (Ht−1 (xi )+αt ht (xi )) = e−yi Ht−1 (xi ) e−αt yi ht (xi )
∝ wit e−αt yi ht (xi )
I If a point (xi , yi ) is correctly classified (yi ht (xi ) > 0), its weight is
decreased
I while if it is incorrectly classified (yi ht (xi ) < 0), its weight is
increased
Then, the re-normalizing factor of the weights at round t + 1 is,

n
X
wit+1 e−αt yi ht (xi )
i=1
Pseudo-code describing AdaBoost

approach

1: Require: A dataset D = (xi , yi )16i6n , the size T of the ensemble
2: Initialization: H0 = 0 and ∀i, wi0 = n1
3: for t = 1 to T do
ht = argminh∈F i:h(xi )6=yi wit
P
4:
εt = i:ht (xi )6=yi wit
P
5:
6: if εt < 21 then
7: αt = 12 ln 1−ε
εt
t
8: Ht = Ht−1 + αt ht
wit e−αt yi ht (xi )
9: ∀i, wit+1 ← Pn t+1 −αt yi ht (xi )
i=1 wi e
10: else
11: return: Ht
12: end if
13: end for
return: HT
Algorithm 2: AdaBoost in Pseudo-code (for classification)
AdaBoost: Few remarks
Note that we have necessarily εt (ht ) < 12 otherwise, the classifier

does worse than random and −ht is a better classifier, with
εt (−ht ) < 21
To help you understand the idea think about the following,
I If F is negation closed (i.e. ∀h ∈ F we must also have −h ∈ F)
I As h was found by minimizing the error, this is a contradiction!
1
The AdaBoost inner loop can terminate as the error εt = 2
1
I in most cases it will converge to 2 over time
In that case the latest weak learner h is only as good as a coin
toss and cannot benefit the ensemble ⇒ boosting terminates
To conclude
Boosting is a great way to turn a weak classifier into a strong
classifier:
I Boosting algorithms combine weak models trained sequentially to
give more importance to the training samples on which the
predictions are bad
I Boosting defines a whole family of algorithms, including Gradient
Boosting, AdaBoost and many others...
GBRT is one of the most popular algorithms for “Learning to
Rank", the branch of machine learning focused on learning
ranking functions (ex: for web search engines)
Inspired by Breiman’s Bagging, Stochastic Gradient Boosting
subsamples the training data for each weak learner
I This combines the benefits of bagging and boosting
AdaBoost is an extremely powerful algorithm, that turns any
weak learner that can classify any weighted version of the training
set with below 0.5 error into a strong learner whose training error
decreases exponentially
If you curious
The best-known implementation of Gradient Boosting is the

XGBOOST implementation (often considered as synonym)
The code is available at https://github.com/dmlc/xgboost
References I
Bartlett, P., Freund, Y., Lee, W. S., and Schapire, R. E. (1998).
Boosting the margin: A new explanation for the effectiveness of voting methods.
The annals of statistics, 26(5):1651–1686.
Chen, T. and Guestrin, C. (2016).

Xgboost: A scalable tree boosting system.
In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages
785–794.
Freund, Y. and Schapire, R. E. (1995).
A desicion-theoretic generalization of on-line learning and an application to boosting.
In European conference on computational learning theory, pages 23–37. Springer.
Friedman, J. H. (2001).
Greedy function approximation: a gradient boosting machine.
Annals of statistics, pages 1189–1232.
Hastie, T., Tibshirani, R., and Friedman, J. (2009).

The elements of statistical learning: data mining, inference, and prediction.
Springer Science & Business Media.
Mason, L., Baxter, J., Bartlett, P., and Frean, M. (1999).

Boosting algorithms as gradient descent.
Advances in neural information processing systems, 12.
Schapire, R. E. (1990).
The strength of weak learnability.
Machine learning, 5(2):197–227.

Main

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Main

Uploaded by

Copyright:

Available Formats

Ensemble learning from theory to practice

January - March, 2023

2 Forward stagewise additive modeling (FSAM)

3 Gradient Boosting - “Anyboost"

4 Gradient Boosted Regression Tree

Course 1: Decision trees have bias and variance problem (weak

where fb : X → Y are basis functions from the hypothesis space

If the hypothesis class F is a set of predictors having large bias and

where αt ∈ R and ht ∈ F are chosen based on training data D

Adaptive Basis Function Model

is choosing α1 , . . . , αT ∈ R and h1 , . . . , hT ∈ F to fit D

where, (Θt )16t6T ∈ Θ characterises ht

An intuition for process of construction

is built in an iterative fashion (“sequential ensemble")

Ht (x) = Ht−1 (x) + αt ht (x)

At test time, we evaluate all predictors and return weighted sum

The process of construction is similar to gradient descent:

We’ll consider learning by empirical risk minimization

where L2 denotes the loss squared function

We’ll consider learning by empirical risk minimization

where L2 denotes the loss squared function

We’ll consider learning by empirical risk minimization

where L2 denotes the loss squared function

Parametrization by axes corresponding to the values taken by the

H (and h) are now

What if Gradient Based Method does not apply?

Even if we could parameterize trees with Rn Euclidean n-space

In this course we’ll discuss gradient boosting which is applies

Gradient descent is based on the following,

An illustration of Gradient descent -Based Method

Then we add it to our ensemble

1 Forward stagewise additive modeling (FSAM)

2 Gradient Boosting - “Anyboost"

3 Gradient Boosted Regression Trees (GBRT)

Forward Stagewise Additive Modeling (FSAM)

In t’th iteration, we want to find

FSAM for Empirical Risk Minimization

the optimization problem (“FSAM step")

for a loss function L

If F is closed under rescaling (i.e., if h ∈ F, then αh ∈ F for all

If F is closed under rescaling (i.e., if h ∈ F, then αh ∈ F for all

This is just fitting the residuals with least-squares

AdaBoost is used for classification problems

AdaBoost example: Exponential Loss

FSAM with Exponential loss

L(y , H(x)) = exp (−yH(x))

Let F be a base hypothesis space of binary classifiers

AdaBoost example: Exponential Loss

AdaBoost / Exponential Loss: Robustness Issues

FSAM for other loss functions

We know how to do FSAM for certain loss functions

FSAM an is iterative optimization

The FSAM step Eq. (2),

Hard part: finding the best step direction h

Concept of Functional Gradient

→ We will differentiate over each component of H seen as a variable

Functional Gradient Descent

Consider gradient descent on

We want to optimize over H

which we can easily calculate

Functional Gradient stepping

The empirical risk R(H) is plotted as a function of