Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics

Overview of AdaBoost: Reconciling its views to better
understand its dynamics
Perceval Beja-Battais
Centre Borelli
ENS Paris-Saclay, Université Paris-Saclay
91190 Gif-sur-Yvette
perceval.beja-battais@ens-paris-saclay.fr
arXiv:2310.18323v1 [cs.LG] 6 Oct 2023
October 31, 2023
Contents
1 Introduction, problematic & notations 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Views of AdaBoost 5
2.1 The original view: a PAC learning algorithm . . . . . . . . . . . . . . . . . 5
2.1.1 What is a PAC learning algorithm? . . . . . . . . . . . . . . . . . . . 5
2.1.2 AdaBoost: the original formulation . . . . . . . . . . . . . . . . . . 5
2.1.3 Real AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 AdaBoost as successive optimization problems . . . . . . . . . . . . . . . . 7
2.2.1 AdaBoost as a gradient descent . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 AdaBoost as an additive model . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 AdaBoost as an entropy projection . . . . . . . . . . . . . . . . . . . 12
2.2.4 AdaBoost as a mirror descent (successive min-max optimization
problem) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 AdaBoost as traditional machine learning methods . . . . . . . . . . . . . 18
2.3.1 Why does AdaBoost generalize that well? . . . . . . . . . . . . . . . 18
2.3.2 AdaBoost as a Kernel method - Boosting for regression . . . . . . . 21
2.3.3 AdaBoost as a Product of Experts . . . . . . . . . . . . . . . . . . . 23
2.3.4 AdaBoost as a dynamical system: experiments and theoritical
insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Conclusion & Acknowledgements 30

3.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Appendices 34
1
Overview of AdaBoost: Reconciling its views to better understand its dynamics
A How to choose the base space of estimators? 34

A.1 Too weak or too strong estimators . . . . . . . . . . . . . . . . . . . . . . . 34
A.2 How to detect too good or too weak estimators? . . . . . . . . . . . . . . . 35
Centre Borelli 2/39 Perceval Beja-Battais

Abstract
Boosting methods have been introduced in the late 1980’s. They were born
following the theoritical aspect of PAC learning. The main idea of boosting methods
is to combine weak learners to obtain a strong learner. The weak learners are
obtained iteratively by an heuristic which tries to correct the mistakes of the previous
weak learner. In 1995, Freund and Schapire [18] introduced AdaBoost, a boosting
algorithm that is still widely used today. Since then, many views of the algorithm
have been proposed to properly tame its dynamics. In this paper, we will try to cover
all the views that one can have on AdaBoost. We will start with the original view
of Freund and Schapire before covering the different views and unify them with
the same formalism. We hope this paper will help the non-expert reader to better
understand the dynamics of AdaBoost and how the different views are equivalent
and related to each other.
Keywords: Boosting, AdaBoost, dynamical systems, PAC learning, gradient

descent, mirror descent, additive models, entropy projection, diversity, margin,
generalization error, kernel methods, product of experts, interpolating classifiers,
double descent
1 Introduction, problematic & notations

1.1 Introduction
Machine learning is a dynamic field that encompasses a wide range of algorithms
and methodologies, aiming to enable computational systems to learn from data and
make predictions or decisions without being explicitly programmed. The growth of
computational power, the availability of vast amounts of data, and advancements in
statistical and algorithmic techniques have propelled the field of machine learning to
new heights [29, 25]. The application of machine learning methods are ubiquitous in
our daily lives, from the personalized recommendations on internet platforms [51, 46,
45] to medicine [6, 49, 8], finance [12, 14, 9, 23], or autonomous driving [27, 48, 33]. One
prominent subset of machine learning methods that has gained considerable attention
and achieved remarkable success in various applications is boosting [18, 44].
At its core, machine learning involves the development of models that can
automatically extract patterns and insights from data. These models learn from
experience, iteratively refining their performance through exposure to labeled examples.
By analyzing patterns and relationships within the data, machine learning algorithms
can make accurate predictions on new, unseen instances, thereby uncovering hidden
knowledge and informing decision-making processes.
Boosting algorithms are a class of machine learning methods (more precisely of
ensemble methods [11, 53, 41]) that aim to enhance the performance of weak learners
by iteratively combining their predictions to create a more powerful and accurate
model [18, 44]. These methods were introduced in the 1990’s and have since emerged as a
cornerstone of contemporary machine learning research. Boosting algorithms iteratively
train a sequence of weak models, each focusing on the instances that were previously
misclassified or had the largest prediction errors. By assigning higher weights to these
challenging instances, subsequent models within the boosting framework can effectively
address their misclassification and improve overall accuracy.

One of the key advantages of boosting methods is their ability to adaptively allocate
resources to challenging examples, thereby mitigating the impact of noise and outliers.
By emphasizing the instances that are difficult to classify, boosting algorithms excel
at capturing complex decision boundaries and achieving high predictive performance.
Furthermore, boosting methods are versatile and can be applied to a variety of learning
tasks, including classification, regression, and ranking problems.
Boosting algorithms come in various flavors, each with its unique characteristics and
strengths. One of the earliest and most influential boosting methods is AdaBoost [18],
short for Adaptive Boosting. AdaBoost iteratively adjusts the weights of misclassified
instances, with subsequent models paying greater attention to these misclassified
examples. Another popular boosting method is Gradient Boosting [35, 4, 21], which
leverages the concept of gradient descent to optimize a loss function by adding weak
models in a stage-wise manner. XGBoost [7, 36, 10] and LightGBM [26, 47, 50] are notable
implementations of gradient boosting algorithms that have gained widespread adoption
due to their scalability and high performance.
1.2 Problematic
Boosting methods have emerged for the first ones in the early 1990’s, and the proper
modern formalization of boosting methods has been made in 1990 by Schapire [43].
At that time, boosting methods were seen as PAC learning algorithms. Later, Freund
and Schapire [17] collaborated to work deeper on boosting algorithms, and in 1995, they
introduced AdaBoost [18], which is still widely used today. Since then, many researchers
have tried to understand the dynamics of AdaBoost which were not sufficiently
understood. To this day, there is still a lot that we do not know about AdaBoost,
especially about the convergence properties of the algorithm when we increase the
number of weak learners [3]. The goal of this paper was in the first place to try to
answer the question of the cyclic behavior of AdaBoost which was proven in some cases
in 2004 by Rudin et al. [40], and that can be observed in many toy examples (see Fig. 2).
However, the same authors addressed an open problem in 2012 [39] for the general case:
Does AdaBoost always cycle? Some answers have been found very recently, in 2023, by
Belanich and Ortiz [3]. Indeed, by studying the ergodic dynamics of AdaBoost, the
authors have shown this conjecture as an intermediate result, which we will quickly
present in the last subsection.
Yet, this overview aims to cover all views that one can have on AdaBoost. We tried
to use the same formalism for all views, to try and unify them.
1.3 Notations
In all what follows, we will consider a classification problem. We will denote by 𝒳 the
input space, and by 𝒴 the output space. 𝒳 is a compact subset of R𝑑 , and 𝒴 is a finite set,
which will be {−1, 1} if the problem is binary, and {1, . . . , 𝐾} if the problem is multiclass.
The sequence of weights produced by AdaBoost will be denoted 𝑊0 , . . . , 𝑊𝑇 −1 , and the
sequence of classifiers will be denoted ℎ1 , . . . , ℎ𝑇 , where 𝑇 is the number of iterations of
the algorithm. The final classifier will be denoted 𝐻. We will denote by 1𝐴 the indicator
function of the set 𝐴. For binary problems, we will use the notation 𝜂 or 𝜂ℎ to denote the
vector (𝑦1 ℎ(𝑥1 ), . . . , 𝑦𝑚 ℎ(𝑥𝑚 )) which we call a dichotomy. 𝜂ℎ (𝑖) is 1 if the classifier ℎ is
right, and −1 if it is wrong. Also, we will denote by 𝜆 the standard Lebesgue measure.

2 Views of AdaBoost
2.1 The original view: a PAC learning algorithm
2.1.1 What is a PAC learning algorithm?
Definition 2.1.1 (PAC learning algorithm). In our setting, we say that a hypothesis space
ℋ is PAC learnable if there exists an algorithm 𝐴 such that:
• For any distribution 𝐷 over 𝒳 , and for any 0 < 𝜖, 𝛿 < 21 , the algorithm 𝐴 outputs a
classifier ℎ ∈ ℋ such that with probability at least 1 − 𝛿, we have:
P(𝑥,𝑦)∼𝐷 (ℎ(𝑥) ̸= 𝑦) ≤ 𝜖 (1)
• The number of examples 𝑚 needed to achieve this bound is polynomial in 1𝜖 , 1

𝛿 and 𝑑.
2.1.2 AdaBoost: the original formulation

The first view, when AdaBoost was introduced, was to see it as a PAC learning algorithm.
The idea was to create a boosting algorithm that iteratively takes into accound the errors
made by the previous classifiers, and to try to correct them. Freund and Schapire [18]
proposed the following algorithm, which is the original formulation of AdaBoost.
They also verified that the algorithm is a PAC learning algorithm, and give bounds
on the number of examples needed to achieve a given accuracy. The pseudo-code
implementation of the algorithm is given in Alg. 1.
Algorithm 1: Original formulation of AdaBoost (discrete) for a binary

classification problem
Data: (𝑥1 , 𝑦1 ), . . . , (𝑥𝑚 , 𝑦𝑚 ) ∈ 𝒳 × {−1, 1}
1
Initialize 𝑊0 (𝑖) = 𝑚 for 𝑖 = 1, . . . , 𝑚
for 𝑡 = 1, . . . , 𝑇 do
Train a weak learner ℎ𝑡 : 𝒳 → {−1, 1} w.r.t. the distribution 𝑊𝑡−1
𝑚
𝑊𝑡−1 (𝑖)1ℎ𝑡 (𝑥𝑖 )̸=𝑦𝑖 the weighted error of ℎ𝑡
∑︁
Compute 𝜖𝑡 =
𝑖=1
1 1 − 𝜖𝑡
Compute 𝛼𝑡 = log
2 𝜖𝑡
𝑊𝑡−1 (𝑖) exp(−𝛼𝑡 𝑦𝑖 ℎ𝑡 (𝑥𝑖 ))
Update 𝑊𝑡 (𝑖) = for 𝑖 = 1, . . . , 𝑚, where
𝑍𝑡
𝑚
∑︁
𝑍𝑡 = 𝑊𝑡−1 (𝑖) exp(−𝛼𝑡 𝑦𝑖 ℎ𝑡 (𝑥𝑖 ))
𝑖=1
end
(︃ 𝑇 )︃
∑︁
return 𝐻 : 𝑥 ∈ 𝒳 ↦→ sign 𝛼𝑡 ℎ𝑡 (𝑥) ∈𝒴
𝑡=1
Remark 1. Freund and Schapire designed this algorithm for weak learners,
i.e. classifiers that have an error rate slightly better than random guessing. They

proved that if the weak learners error rates are 𝜖1 , . . . , 𝜖𝑇 , then the error rate of the
final classifier is bounded by
𝑇 √︁
∏︁
𝑇
𝜖≤2 𝜖𝑡 (1 − 𝜖𝑡 ) (2)
𝑡=1
Remark 2. The first nad most natural set of classifiers that come to mind when we
think about weak learners are decision stumps, i.e. classifiers that are constant on a
half-space, and constant on the other half-space. Most of AdaBoost applications in
practice use decision stumps as weak learners. We also can think of decision trees of
an arbitrary depth as weak learners, but in practice, they may not be used because
they are slower to train.
Remark 3. The first few things we can observe are that first the final classifier does
not take in consideration the potential relevance (to be defined) of each classifier
in the final vote (the weights of the classifiers are all equal to 1), and second
that the sequence of weights are updated only considering the error made by the
previous classifier, and not the ones before. We can easily see that both sequence
of weights and classifiers are order 1 Markov chains, therefore we can legitimately
ask ourselves if the algorithm can get stuck in cycles, which can be the case in some
examples [38].
However, at that time, Freund and Schapire did not give any insight of how they
chose their updating rule for the weights. They just said that they wanted to correct the
errors made by the previous classifier, and that they wanted to give more importance to
the examples that were misclassified. In short, they designed a heuristic algorithm, and
proved that it was a PAC learning algorithm. Later on, it appeared that AdaBoost does
not only come from a heuristic, but also from a theoritical point of view, as we will see
in the next sections.
Freund and Schapire also gave a multiclass version of AdaBoost in 1996 [19], which
base on the same heuristic. We give the algorithm in Alg. 2.
Algorithm 2: Original formulation of AdaBoost (discrete) for a multiclass

Data: (𝑥1 , 𝑦1 ), . . . , (𝑥𝑚 , 𝑦𝑚 ) ∈ 𝒳 × {1, . . . , 𝐾}
1
for 𝑡 = 1, . . . , 𝑚 do
Train a weak learner ℎ𝑡 : 𝒳 → {1, . . . , 𝐾} w.r.t. the distribution 𝑊𝑡−1
Compute 𝜖𝑡 = 𝑚 𝑖=1 𝑊𝑡−1 (𝑖)1ℎ𝑡 (𝑥𝑖 )̸=𝑦𝑖 the weighted error of ℎ𝑡
∑︀
1−𝜖𝑡
Compute 𝛼𝑡 = 𝜖𝑡
𝑊𝑡−1 (𝑖)(𝛼𝑡 1
ℎ𝑡 (𝑥𝑖 )=𝑦𝑖 +1
ℎ𝑡 (𝑥𝑖 )̸=𝑦𝑖 )
Update 𝑊𝑡 (𝑖) = 𝑍𝑡 for 𝑖 = 1, . . . , 𝑚, where
𝑍𝑡 = 𝑖=1 𝑊𝑡−1 (𝑖) exp(−𝛼𝑡 1ℎ𝑡 (𝑥𝑖 )=𝑦𝑖 )
∑︀𝑚
end
1ℎ𝑡 (𝑥)=𝑦
∑︀𝑇 (︁ )︁
return 𝐻 : 𝑥 ∈ 𝒳 ↦→ arg max𝑦∈𝒴 𝑡=1 log 𝛼1𝑡

2.1.3 Real AdaBoost

The original formulation of AdaBoost is a discrete version of the algorithm. However,
it is possible to define a real version of AdaBoost, as Freund and Schapire did [19]. The
main difference is that the weak learners are not restricted to output only {−1, 1}, but
are functionals ℎ : 𝑋 × 𝑌 → [0, 1].
Definition 2.1.2 (Pseudo-loss and Update Rule for real AdaBoost). To each couple
(𝑥, 𝑦), we associate a plausibility ℎ(𝑥, 𝑦) that 𝑥 belongs to the class 𝑦 (plausibility instead of
probability, because it is not one). We then define the pseudo-loss 𝜖𝑡 of ℎ𝑡 at iteration 𝑡 as:
1 ∑︁
𝜖𝑡 = 𝑊𝑡−1 (𝑖, 𝑦)(1 − ℎ𝑡 (𝑥𝑖 , 𝑦𝑖 ) + ℎ𝑡 (𝑥𝑖 , 𝑦)) (3)
2 (𝑖,𝑦):𝑦̸=𝑦
𝑖
where 𝑊𝑡−1 (𝑖, 𝑦) is the weight of the couple (𝑥𝑖 , 𝑦) at iteration 𝑡 − 1. Finally, we can define
the weight update rule, following the same principle, as:
1
𝑊0 (𝑖, 𝑦) = , ∀𝑖 ∈ {1, . . . , 𝑚}, ∀𝑦 ∈ 𝒴
𝑚𝐾
1
(1−ℎ𝑡 (𝑥𝑖 ,𝑦𝑖 )+ℎ𝑡 (𝑥𝑖 ,𝑦)) (4)
𝑊𝑡−1 (𝑖, 𝑦)𝛼𝑡2
𝑊𝑡 (𝑖, 𝑦) =
𝑍𝑡
where 𝑍𝑡 is a normalization factor.
The algorithm now writes as in Alg. 3.
Algorithm 3: Original formulation of AdaBoost (real) for a multiclass

Data: (𝑥1 , 𝑦1 ), . . . , (𝑥𝑚 , 𝑦𝑚 ) ∈ 𝒳 × 𝒴
1
Initialize 𝑊0 (𝑖, 𝑦) = 𝑚𝐾 for 𝑖 = 1, . . . , 𝑚, 𝑦 ∈ 𝒴
for 𝑡 = 1, . . . , 𝑇 do
Train a weak learner ℎ𝑡 : 𝒳 × 𝒴 → [0, 1] w.r.t. the distribution 𝑊𝑡−1
Compute 𝜖𝑡 = 12 (𝑖,𝑦):𝑦̸=𝑦𝑖 𝑊𝑡−1 (𝑖, 𝑦)(1 − ℎ𝑡 (𝑥𝑖 , 𝑦𝑖 ) + ℎ𝑡 (𝑥𝑖 , 𝑦))
∑︀
Compute 𝛼𝑡 = 1−𝜖 𝜖𝑡
𝑡
1 (1−ℎ (𝑥 ,𝑦 )+ℎ (𝑥 ,𝑦))

𝑡 𝑖 𝑖 𝑡 𝑖
𝑊𝑡−1 (𝑖,𝑦)𝛼𝑡2
Update 𝑊𝑡 (𝑖, 𝑦) = 𝑍𝑡
end
∑︀𝑇 (︁ )︁
return 𝐻 : 𝑥 ∈ 𝒳 ↦→ arg max𝑦∈𝒴 𝑡=1 log 𝛼1𝑡 ℎ𝑡 (𝑥, 𝑦)
2.2 AdaBoost as successive optimization problems

2.2.1 AdaBoost as a gradient descent
In 1999, Mason et al. [32] proposed a new view of boosting. They saw any boosting
algorithm, including AdaBoost, as a gradient descent on the set of linear combinations
of classifiers in ℋ.

Definition 2.2.1 (Optimization problem for AdaBoost as a gradient descent). Let ⟨, ⟩

be an inner product on Span(ℋ). Define a cost function 𝐶 : Span(ℋ) → R. We define the
optimization problem associated to the cost function 𝐶 as:
min 𝐶(𝐻) (5)

𝐻∈Span(ℋ)
Definition 2.2.2 (Margin and margin cost-functionals). We define the margin of a

classifier ℎ ∈ ℋ as
𝑚
1 ∑︁
𝛾(ℎ) = ⟨ℎ(𝑥), 𝑦⟩ = ℎ(𝑥𝑖 )𝑦𝑖 . (6)
𝑚 𝑖=1
and the margin cost-functionals as
𝑚
1 ∑︁
𝐶(ℎ) = 𝑐(ℎ(𝑥𝑖 )𝑦𝑖 ) (7)
𝑚 𝑖=1
where 𝑐 is a differentiable function of the margin 𝛾.
We can now prove Thm. 2.2.3.
Theorem 2.2.3 (AdaBoost as a gradient descent). AdaBoost is a gradient descent

algorithm on the optimization problem defined in Def. 2.2.1 for the margin-cost functional
𝐶 with 𝑐(𝛾) = exp −𝛾.
Proof. At each iteration 𝑡, we will denote by 𝐻𝑡 the current resulting classifier,

which writes
𝑡
∑︁
𝐻𝑡 = 𝛼𝑖 ℎ𝑖 (8)
𝑖=1
Now, consider the inner product (recall that 𝑚 is the number of training examples)
𝑚
1 ∑︁
⟨ℎ, 𝑔⟩ = ℎ(𝑥𝑖 )𝑔(𝑥𝑖 ). (9)
𝑚 𝑖=1
For AdaBoost, fix 𝑐(𝛾) = exp(−𝛾), thus the cost function we want to optimize
over the set of classifiers ℋ is
𝑚
1 ∑︁
𝐶(ℎ) = exp(−𝑦𝑖 ℎ(𝑥𝑖 )) (10)
𝑚 𝑖=1
To find the next classifier in the sequence, we have to find the classifier ℎ𝑡
that minimizes the weighted error 𝜖𝑡 . With a reasonable cost function 𝑐 of the
margin (i.e. monotonic decreasing), it is equivalent to finding the classifier ℎ𝑡 that
maximizes −⟨∇𝐶(ℎ𝑡−1 ), ℎ𝑡 ⟩. Indeed, we have on the one hand
1
∇𝐶(ℎ) = − 𝑦𝑖 𝑒−𝑦𝑖 ℎ(𝑥𝑖 ) 1𝑥=𝑥𝑖 (11)
𝑚

which leads to
𝑚
1 ∑︁
−⟨∇𝐶(𝐻𝑡−1 ), ℎ𝑡 ⟩ = 𝑦𝑖 ℎ𝑡 (𝑥𝑖 )𝑒−𝑦𝑖 𝐻𝑡−1 (𝑥𝑖 )
𝑚2 𝑖=1
𝑚 (12)
1 ∑︁
∝ 2 𝑦𝑖 ℎ𝑡 (𝑥𝑖 )𝑊𝑡−1 (𝑥𝑖 )
𝑚 𝑖=1
up to a normalization constant. That means that maximizing −⟨∇𝐶(𝐻𝑡−1 ), ℎ𝑡 ⟩

(i.e. gradient descent step) is equivalent to minimizing 𝑚 𝑖=1 𝑊𝑡−1 (𝑥𝑖 )1ℎ𝑡 (𝑥𝑖 )̸=𝑦𝑖 ,
∑︀
which is the weighted error of ℎ𝑡 . Thus, the update rule for this gradient descent
is the same as the one of AdaBoost.
That new view of AdaBoost is interesting because it gives both theoritical and
intuitive justification of the algorithm. For some cost function 𝑐 that ponderates how
important it is to classify correctly our observations, we look for the direction for which
the cost function decreases the most, and we take a step in that direction. We can now
formulate a new (equivalent) version of discrete AdaBoost which is written in Alg. 4.
Algorithm 4: AdaBoost as a gradient descent

Data: (𝑥1 , 𝑦1 ), . . . , (𝑥𝑚 , 𝑦𝑚 ) ∈ 𝒳 × {−1, 1}
1
Initialize ℎ0 (𝑥) = 0
for 𝑡 = 1, . . . , 𝑇 do
Compute ℎ𝑡 = arg maxℎ∈ℋ −⟨∇𝐶(ℎ𝑡−1 ), ℎ⟩
Let 𝛼𝑡 = 21 log 1−𝜖 𝜖𝑡
𝑡
Let 𝐻𝑡 = 𝐻∑︀
𝑡−1 +𝛼𝑡 ℎ𝑡
𝑡
𝑠=1
|𝛼𝑠 |
′
Update 𝑊𝑡 = ∑︀𝑚𝑐 (𝑦𝑐𝑖′𝐻(𝑦𝑡 (𝑥
𝐻
𝑖 ))
𝑖=1 𝑖 𝑡 (𝑥𝑖 ))
end
return 𝐻𝑇
2.2.2 AdaBoost as an additive model

2.2.2.1 Additive models As seen in the previous subsection, at each iteration the
classifier which is produced by AdaBoost can be written:
𝑡
∑︁
𝐻𝑡 = 𝛼𝑖 ℎ𝑖 (13)
𝑖=1
This is an example of what we call an additive model. A subset of the additive models
are the additive regression models.
Definition 2.2.4 (Additive regression models). These models have the form:
𝑚
∑︁
𝐻(𝑥) = ℎ𝑖 (𝑥𝑖 ) (14)
𝑖=1

where each ℎ𝑖 is a function of only one data point 𝑥𝑖 .
Remark 4. A way to find the good functions ℎ𝑖 is to use the backfitting algorithm [22].
A backfitting update writes:
⎛ ⎞
∑︁
ℎ𝑖 (𝑥𝑖 ) = E ⎝𝑦 − ℎ𝑗 (𝑥𝑗 )|𝑥𝑖 ⎠ (15)
𝑗̸=𝑖
In our case, for more general additive model as the one we have in AdaBoost, we can
use the following update:
(𝛼𝑡 , ℎ𝑡 ) = arg min E (𝐶(𝑦 − 𝐻𝑡−1 (𝑥) − 𝛼ℎ(𝑥))) (16)

𝛼∈R,ℎ∈ℋ
for a given cost function 𝐶. This update is called the greedy forward stepwise approach.
Remark 5. For classification problems, we can use Bayes theorem because all we
have to know, in order to produce a good classifier, is the conditional probabilities
P(𝑦 = 𝑘|𝑥) for all classes 𝑘. For instance, for a binary classification problem, a subset
of additive models are additive logistic binary classification models, which write:
𝑇
∑︁ P(𝑦 = 1|𝑥)
𝐻(𝑥) = ℎ𝑡 (𝑥) = log (17)
𝑡=1
P(𝑦 = −1|𝑥)
That directly implies that

𝑒𝐻(𝑥)
P(𝑦 = 1|𝑥) = (18)
1 + 𝑒𝐻(𝑥)
2.2.2.2 AdaBoost as an additive model In [20], Friedman, Hassie and Tibshirani show
two main results considering AdaBoost. The first one is the following:
Proposition 2.2.5 (AdaBoost as an additive model). Discrete AdaBoost can be seen as

an additive logistic regression model via Newton updates for the loss 𝐶(𝛾) = 𝑒−𝛾 .
Proof. Suppose we are at iteration 𝑡 of AdaBoost, and that we have already

computed the classifier 𝐻𝑡−1 . We want to find the classifier ℎ𝑡 such that:
(︁ )︁
(𝛼𝑡 , ℎ𝑡 ) = arg min E 𝑒−𝑦(𝛼ℎ(𝑥)+𝐻𝑡−1 (𝑥))
𝛼,ℎ
(︁ )︁
= arg min E 𝑒−𝛼𝜂ℎ 𝑒−𝑦𝐻𝑡−1 (𝑥)
𝛼,ℎ (19)
(︃ (︃ )︃)︃
𝛼2
≈ arg min E 𝑒−𝑦𝐻𝑡−1 (𝑥) 1 − 𝛼𝜂ℎ +
𝛼,ℎ 2
since 𝑦 2 ℎ2 (𝑥) = 1 and where we did a Taylor expansion at the last line (recall from
the notations subsection that 𝜂ℎ = 𝑦ℎ(𝑥)). Now, to minimize both in 𝛼𝑡 and ℎ𝑡 ,

we will first fix 𝛼𝑡 = 𝛼 and minimize over ℎ𝑡 , then take this minimizer ℎ𝑡,𝛼 to
minimize over 𝛼𝑡 .
First, we have
(︃ )︃
−𝑦𝐻𝑡−1 (𝑥) 𝛼2
ℎ𝑡,𝛼 (𝑥) = arg min E 𝑒 (1 − 𝛼𝜂ℎ + ) (20)
ℎ 2
which has a solution which is independent of 𝛼:

⎧ (︁ )︁
⎨1 if E 𝑒−𝑦𝐻𝑡−1 (𝑥) 𝑦|𝑥 > 0
ℎ𝑡 (𝑥) = (21)
⎩−1 otherwise.
Then, to minimize over 𝛼, we can show that:

(︁ )︁
𝛼𝑡 = arg min E 𝑒−𝑦𝐻𝑡−1 (𝑥) 𝑒−𝛼𝜂ℎ𝑡
𝛼
1 1 − 𝜖𝑡 (22)
= log
2 𝜖𝑡
where 𝜖𝑡 = E 𝑒−𝑦𝐻𝑡−1 (𝑥) 1ℎ(𝑥)̸=𝑦 is the weighted error of ℎ𝑡 . Finally, this update
(︁ )︁
rule writes:
1 1 − 𝜖𝑡
𝐻𝑡 (𝑥) = 𝐻𝑡−1 (𝑥) + log ℎ𝑡 (𝑥)
2 𝜖𝑡 (23)
−𝛼𝑡 𝜂ℎ𝑡
𝑊𝑡 (𝑥) = 𝑊𝑡−1 (𝑥)𝑒
We can notice that this update rule is the same as the one in the original
formulation of AdaBoost.
The second result of [20] is the following.
Proposition 2.2.6 (Real AdaBoost as an additive model). Real AdaBoost can be seen as
an additive logistic regression by stagewise and approximate optimization of 𝐶(𝛾) = 𝑒−𝛾 .
Proof. Again, we start at the iteration 𝑡 of (real) AdaBoost, and we have already
computed the classifier 𝐻𝑡−1 . We look for the classifier ℎ𝑡 such that:
(︁ )︁
ℎ𝑡 = arg min E 𝑒−𝑦(ℎ(𝑥)+𝐻𝑡−1 (𝑥)) |𝑥
𝛼,ℎ
(24)
= arg min 𝑒−ℎ(𝑥) E 𝑒−𝑦𝐻𝑡−1 (𝑥) 1𝑦=1 |𝑥 + 𝑒ℎ(𝑥) E 𝑒−𝑦𝐻𝑡−1 (𝑥) 1𝑦=−1 |𝑥
(︁ )︁ (︁ )︁
Setting the derivatives w.r.t. ℎ𝑡 to 0, we get:
E 𝑒−𝑦𝐻𝑡−1 (𝑥) 1𝑦=1 |𝑥

(︁ )︁
1
ℎ𝑡 (𝑥) = log (︀ −𝑦𝐻 (𝑥) (25)
1𝑦=−1 |𝑥
)︀
2 E 𝑒 𝑡−1
Thus, the weight update rule writes:
𝑊𝑡 (𝑥) = 𝑊𝑡−1 (𝑥)𝑒−𝜂ℎ𝑡 (26)
which is the same as the one in the original formulation of real AdaBoost.

Remark 6. The authors from [20] also propose a new algorithm, based on what we
have just shown, that uses as the loss function the binomial log-likelihood 𝐶(𝛾) =
− log(1 + 𝑒−2𝛾 ). The algorithm is called LogitBoost. It has a similar form as AdaBoost
because with a Taylor expansion of the loss function at 𝐹 (𝑥) = 0, we have, with
𝛾 = 𝑦𝐹 (𝑥):
− log(1 + 𝑒−2𝑦𝐹 (𝑥) ) ≈ 𝑒−𝑦𝐹 (𝑥) + log 2 − 1 (27)
2.2.3 AdaBoost as an entropy projection

This view has been proposed in 1999 by Kivinen and Warmuth [28]. In relies on the idea
that the corrective updates of AdaBoost, and more generally speaking of boosting, can
be seen as a solution to a relative entropy (constrained) minimization problem.
2.2.3.1 Relative entropy minimization theorem
Definition 2.2.7 (Relative entropy). Define the relative entropy between two distributions
𝑝 and 𝑞 as the Kullback-Leibler divergence:
𝑚
∑︁ 𝑝𝑖
𝐾𝐿(𝑝||𝑞) = 𝑝𝑖 log (28)
𝑖=1
𝑞𝑖
At every iteration 𝑡 of AdaBoost, we want to find a weight distribution 𝑊𝑡 that is not

correlated with the mistakes from the previous classifier ℎ𝑡 that learnt on 𝑊𝑡−1 . We can
force this by introducing the constraint 𝑊𝑡𝑇 ℎ𝑡 (𝑥) = 0. Then, the updated distribution 𝑊𝑡
should not differ too much from the previous distributions in order to take into account
the knowledge we already have from the data, and also in order not to react too much
to the possible noise. To do so, we want to minimize the relative entropy between 𝑊𝑡−1
and 𝑊𝑡 under the constraint 𝑊𝑡𝑇 ℎ𝑡 (𝑥) = 0.
Definition 2.2.8 (Relative entropy minimization problem). The optimization problem

writes:
𝑚
∑︁ 𝑊 (𝑖)
min𝑚 𝑊 (𝑖) log
𝑊 ∈Δ
𝑖=1
𝑊𝑡−1 (𝑖) (29)
s.t. 𝑊 𝑇 𝜂ℎ𝑡 = 0
Kivinen and Warmuth [28] showed that minimizing the relative entropy is the same
problem as maximizing − log 𝑍𝑡 (𝛼), where
𝑚
𝑊𝑡−1 (𝑖)𝑒−𝛼𝑦𝑖 ℎ𝑡 (𝑥𝑖 )
∑︁
𝑍𝑡 (𝛼) = (30)
𝑖=1
In short, we have the following result:

Theorem 2.2.9 (Relative entropy minimization theorem).

𝑚
∑︁ 𝑊 (𝑖)
min 𝑊 (𝑖) log = max − log 𝑍𝑡 (𝛼) (31)
𝑊 ∈Δ𝑚 :𝑊 𝑇 𝑦ℎ𝑡 (𝑥)=0
𝑖=1
𝑊𝑡−1 (𝑖) 𝛼∈R
Furthermore, if 𝑊𝑡 , 𝛼𝑡 are the two solutions of these problems, we have that
𝑊𝑡−1 (𝑖)𝑒−𝛼𝑡 𝑦𝑖 ℎ𝑡 (𝑥𝑖 )

𝑊𝑡 (𝑖) = , ∀𝑖 ∈ {1, . . . , 𝑚} (32)
𝑍𝑡 (𝛼𝑡 )
Proof. Let us prove those two results. Let ℒ(𝑊, 𝛼) be the Lagrangian of the
optimization problem:
𝑚
∑︁ 𝑊 (𝑖)
ℒ(𝑊, 𝛼) = 𝑊 (𝑖) log + 𝛼𝑊 𝑇 𝜂ℎ𝑡 (33)
𝑖=1
𝑊𝑡−1 (𝑖)
First we show that the following minimax equation holds:
min max ℒ(𝑊, 𝛼) = max min𝑚 ℒ(𝑊, 𝛼) (34)
𝑊 ∈Δ𝑚 𝛼∈R 𝛼∈R 𝑊 ∈Δ
We always have that (see [15], or convice yourself by the quote "the tallest of the
dwarfs is shorter than the shortest of the giants"):
max min𝑚 ℒ(𝑊, 𝛼) ≤ min𝑚 max 𝐿(𝑊, 𝛼) (35)
𝛼∈R 𝑊 ∈Δ 𝑊 ∈Δ 𝛼∈R
Now, on the one side we have:

max min𝑚 ℒ(𝑊, 𝛼) = min𝑚 𝐿(𝑊, 𝛼* ) (36)
𝛼∈R 𝑊 ∈Δ 𝑊 ∈Δ
where 𝛼* = arg max𝛼∈R (min𝑊 ∈Δ𝑚 𝐿(𝑊, 𝛼)). On the other side, if 𝑊 𝑇 𝜂ℎ𝑡 =
0, then max𝛼∈R ℒ(𝑊, 𝛼) = 𝐾𝐿(𝑊 ||𝑊𝑡−1 ). Otherwise max𝛼∈R ℒ(𝑊, 𝛼) = ∞.
Therefore,
max min𝑚 ℒ(𝑊, 𝛼) = min 𝐾𝐿(𝑊 ||𝑊𝑡−1 ) (37)
𝛼∈R 𝑊 ∈Δ 𝑊 ∈Δ𝑚 :𝑊 𝑇 𝜂ℎ𝑡 =0
Thus, using Eq. 37:

min max 𝐿(𝑊, 𝛼) = 𝐿(𝑊 * (𝛼* ), 𝛼* ) ≥ min𝑚 𝐿(𝑊, 𝛼* ) = min 𝐾𝐿(𝑊 ||𝑊𝑡−1 )
𝑊 ∈Δ𝑚 𝛼∈R 𝑊 ∈Δ 𝑊 ∈Δ𝑚 :𝑊 𝑇 𝜂ℎ𝑡 =0
(38)
That implies the other inequality:
max min𝑚 ℒ(𝑊, 𝛼) ≥ min𝑚 max 𝐿(𝑊, 𝛼) (39)
𝛼∈R 𝑊 ∈Δ 𝑊 ∈Δ 𝛼∈R
and we have equality between these two terms.

Now we show Eq. 31. We need to show that min𝑊 ∈Δ𝑚 ℒ(𝑊, 𝛼) = − log 𝑍𝑡 (𝛼). As
𝐾𝐿(.||𝑊𝑡−1 ) is a convex differentiable function, we can obtain the minimum by
setting its gradient to 0:
∇𝑊 ℒ(𝑊, 𝛼) = ∇𝑊 𝐾𝐿(𝑊 ||𝑊𝑡−1 ) + 𝛼∇𝑊 𝑊 𝑇 𝜂ℎ𝑡 = 0
⇐⇒ ∇𝑊 𝐾𝐿(𝑊 ||𝑊𝑡−1 ) = −𝛼∇𝑊 𝑊 𝑇 𝜂ℎ𝑡
𝑚
(40)
∑︁ 𝑊 (𝑖)
⇐⇒ ∇𝑊 𝑊 (𝑖) log = −𝛼∇𝑊 𝑊 𝑇 𝜂ℎ𝑡
𝑖=1
𝑊 𝑡−1 (𝑖)

Componentwise, this writes:
∀𝑖 ∈ {1, . . . , 𝑚}, log 𝑊 (𝑖) − log 𝑊𝑡−1 (𝑖) + 1 = −𝛼𝑦𝑖 ℎ𝑡 (𝑥𝑖 )

(41)
⇐⇒ ∀𝑖 ∈ {1, . . . , 𝑚}, 𝑊 (𝑖) ∝ 𝑊𝑡−1 (𝑖)𝑒−𝛼𝑦𝑖 ℎ𝑡 (𝑥𝑖 )
𝑊
We then need to normalize 𝑊 to obtain the distribution 𝑊𝑡 = 𝑍𝑡 (𝛼) (note that this
distribution depends on 𝛼).
We then have:
𝑚
∑︁ 𝑊𝑡 (𝑖)
min𝑚 ℒ(𝑊, 𝛼) = 𝐾𝐿(𝑊𝑡 ||𝑊𝑡−1 ) = 𝑊𝑡 (𝑖) log
𝑊 ∈Δ
𝑖=1
𝑊𝑡−1 (𝑖)
𝑚 𝑊𝑡−1 (𝑖)𝑒−𝛼𝑦𝑖 ℎ𝑡 (𝑥𝑖 )
∑︁ 𝑊𝑡−1 (𝑖)𝑒−𝛼𝑦𝑖 ℎ𝑡 (𝑥𝑖 ) 𝑍𝑡 (𝛼)
= log
𝑖=1
𝑍𝑡 (𝛼) 𝑊𝑡−1 (𝑖)
𝑚
∑︁ 𝑊𝑡−1 (𝑖)𝑒−𝛼𝑦𝑖 ℎ𝑡 (𝑥𝑖 )
= (−𝛼𝑦𝑖 ℎ𝑡 (𝑥𝑖 ) − log 𝑍𝑡 (𝛼))
𝑖=1
𝑍𝑡 (𝛼)
𝑚
∑︁
= − log 𝑍𝑡 (𝛼) + 𝛼 𝑊𝑡 (𝑖)𝑦𝑖 ℎ𝑡 (𝑥𝑖 )
𝑖=1
= − log 𝑍𝑡 (𝛼)
(42)
∑︀𝑚
because we have 𝑖=1 𝑊𝑡 (𝑖)𝑦𝑖 ℎ𝑡 (𝑥𝑖 ) = 0 by Eq. 37 (𝑊𝑡𝑇 𝜂ℎ𝑡 = 0). This completes
the proof.
Remark 7. We can also demonstrate that the same result holds (up to the admissible
space) if we update our weights according to the totally corrective update which does
not only correct the mistakes from the previous classifier, but also the mistakes from
all previous classifiers. This update is defined optimizing the same problem but with
𝑡 constraints instead of 1:
𝑚
∑︁ 𝑊 (𝑖)
min𝑚 𝑊 (𝑖) log
𝑊 ∈Δ
𝑖=1
𝑊𝑡−1 (𝑖) (43)
𝑇
s.t. 𝑊 𝜂ℎ𝑠 = 0, ∀𝑠 ∈ {1, . . . , 𝑡}
2.2.3.2 The geometric insights Thinking the relative entropy as a distance (which we
need to be careful with, because it is not truly one) we can see this minimization problem
over a linear admissible space as a projection on the hyperplane H𝑡 := {𝑊 s.t. 𝑊 𝑇 𝜂ℎ𝑡 =
0} (or for the totally corrective update, on the intersection of 𝑡 hyperplanes).
In [28], the authors affirm that as soon as a weight distribution 𝑊𝑡 can be written in the
exponential form, i.e.
𝑊𝑡−1 (𝑖)𝑒−𝛼𝑡 𝑦𝑖 ℎ𝑡 (𝑥𝑖 )

𝑊𝑡 (𝑖) = , ∀𝑖 ∈ {1, . . . , 𝑚} (44)
𝑍𝑡 (𝛼𝑡 )
we have for any 𝛼 ∈ R:
arg min 𝐾𝐿(𝑊 ||𝑊𝑡−1 ) = arg min 𝐾𝐿(𝑊 ||𝑊𝑡 (𝛼)) = 𝑊𝑡 (45)
𝑊 ∈Δ𝑚 :𝑊 𝑇 𝜂ℎ𝑡 =0 𝑊 ∈Δ𝑚 :𝑊 𝑇 𝜂ℎ𝑡 =0

That means, geometrically, that for any weight distribution 𝑊 in the curve the curve
𝛼 ↦→ 𝑊𝑡 (𝛼), the projection of 𝑊 on the hyperplane H𝑡 is the only point where the curve
intersects the hyperplane.
The authors finish by giving a adapted formulation of Pythagora’s theorem in this setup:
Letting 𝑊𝑡 be the projection on H𝑡 of 𝑊𝑡−1 , we have:
𝐾𝐿(𝑊𝑡−1 ||𝑊𝑡 ) = 𝐾𝐿(𝑊𝑡−1 ||𝑊𝑡 (𝛼𝑡 )) + 𝐾𝐿(𝑊𝑡 (𝛼𝑡 )||𝑊𝑡 ) (46)
for any 𝛼𝑡 ∈ R. We now truly see the meaning of 𝑊𝑡 (𝛼): it is the original update that the
algorithm would like to be able to do, but as it does not belong to the admissible space,
𝑊𝑡 (𝛼) is projected onto the admissible space and becomes 𝑊𝑡 .
2.2.4 AdaBoost as a mirror descent (successive min-max optimization problem)

Another, more recent, view of AdaBoost is to see it as a mirror descent algorithm. Mirror
descent is a generalization of subgradient descent, which itself is a generalization of
gradient descent.
2.2.4.1 Optimization background Let 𝑓 : 𝒳 → R be a convex function, and let 𝒳 be

a convex subset of R𝑑 . Suppose we want to minimize 𝑓 over 𝒳 . The problem of the
gradient descent is that we suppose 𝑓 to be differentiable, which is not always the case.
Therefore, we can define the subgradient of 𝑓 at 𝑥 ∈ 𝒳 as a generalization of the gradient
of 𝑓 at 𝑥.
Definition 2.2.10 (Subgradient). We say 𝑔 is a subgradient of 𝑓 at 𝑥 if for all 𝑦 ∈ 𝒳 ,
𝑓 (𝑦) ≥ 𝑓 (𝑥) + ⟨𝑔, 𝑦 − 𝑥⟩ (47)
We denote by 𝜕𝑓 (𝑥) the set of subgradients of 𝑓 at 𝑥. When 𝑓 is differentiable, 𝜕𝑓 (𝑥) is

a singleton, and 𝜕𝑓 (𝑥) = {∇𝑓 (𝑥)}. When 𝑓 is not differentiable, 𝜕𝑓 (𝑥) is a convex set
representing the set of affine functions that are below 𝑓 at 𝑥.
That being said, we can now define the subgradient descent algorithm, the analog of
the gradient descent in a non-differentiable case. Let 𝑓 be a convex function. Instead of
updating iteratively following the direction of the gradient of 𝑓 , we update iteratively
following the direction of a subgradient of 𝑓 . Eventually, we can project the resulting
point on 𝒳 to ensure that we stay in 𝒳 . The algorithm is described in Alg. 5.
Algorithm 5: Subgradient descent

Data: 𝑥0 ∈ 𝒳
Result: 𝑥𝑡 that minimizes 𝑓 over 𝒳
𝑡←0
while Not Stopping Criterion and 0 ∈ / 𝜕𝑓 (𝑥𝑡 ) do
Let 𝑔𝑡 ∈ 𝜕𝑓 (𝑥𝑡 )
Let 𝑥𝑡+1 = Π𝒳 (𝑥𝑡 − 𝜂𝑡 𝑔𝑡 ), with 𝜂𝑡 > 0, and Π𝒳 the projection on 𝒳
𝑡←𝑡+1
end
return 𝑥𝑡

Now, the mirror descent algorithm is a generalization of the subgradient descent

algorithm that can be applied to min-max optimization problems. Let 𝑓 : 𝒳 → R a
convex function we want to minimize. Here, we need 𝒳 to be a convex subset of R𝑑 .
Without losing too much generality, we can suppose that 𝑓 is 𝐿𝑓 Lipschitz continuous
(typically, this holds as soon as 𝒳 is compact). Our interest lies in the cases where 𝑓
writes:
𝑓 (𝑥) = max 𝜑(𝑥, 𝜆) (48)
𝜆∈𝒟
where 𝒟 is a convex compact subset of R𝑑 , and 𝜑 : 𝒳 ×𝒟 → R is convex in 𝑥 and concave

in 𝜆. The dual function of 𝑓 is defined as
𝑓 * (𝜆) = min 𝜑(𝑥, 𝜆) (49)

𝑥∈𝒳
and we may solve the dual problem
min 𝑓 * (𝜆) (50)

𝜆∈𝒟
to find our solution, thanks to Von Neumann’s minimax theorem [15]. Indeed, under
our hypothesis, we have no duality gap, i.e.
∃(𝑥* , 𝜆* ) ∈ 𝒳 × 𝒟, ∀(𝑥, 𝜆) ∈ 𝒳 × 𝒟, 𝑓 * (𝜆) ≤ 𝑓 * (𝜆* ) = 𝑓 (𝑥* ) ≤ 𝑓 (𝑥) (51)
The way the mirror descent is a generalization of the subgradient descent is the
following: consider a differentiable 1-strongly convex function 𝑑 : 𝒳 → R.1 𝑑 is used
to define the Bregman distance between two points 𝑥, 𝑦 ∈ 𝒳 as
𝐷(𝑥, 𝑦) = 𝑑(𝑥) − 𝑑(𝑦) − ⟨∇𝑑(𝑦), 𝑥 − 𝑦⟩ (52)
The idea of the mirror descent is to succesively iterate on the primal and dual variables,
with the primal update being a gradient descent step on a slightly modified function
(with the Bregman distance), and the dual update being a gradient ascent step on the
dual function. The algorithm writes as in Alg. 6.
Algorithm 6: Mirror descent

Data: 𝑥0 ∈ 𝒳
Let 𝜆0 = 0
for 𝑡 = 1, . . . , 𝑇 do
Compute 𝜆 ˜ 𝑡 ∈ arg max
𝜆∈𝒟 𝜑(𝑥𝑡−1 , 𝜆)
Compute 𝑔𝑡 = ∇𝑥 𝜑(𝑥𝑡−1 , 𝜆 ˜𝑡)
Choose 𝛼𝑡 > 0
Compute 𝑥𝑡 ∈ arg min𝑥∈𝒳 {𝛼𝑡 ⟨𝑔𝑡 , 𝑥⟩ + 𝐷(𝑥, 𝑥𝑡−1 )}
∑︀𝑡−1 ˜
𝛼 𝑠 𝜆𝑠
𝜆𝑡 = ∑︀𝑠=1
𝑡−1
𝑠=1
𝛼𝑠
end
return 𝑥𝑇
1
Recall that a function 𝑑 is 1-strongly convex if for all 𝑥, 𝑦 ∈ 𝒳 , 𝑑(𝑦) ≥ 𝑑(𝑥) + ⟨∇𝑑(𝑥), 𝑦 − 𝑥⟩ + 12 ‖𝑦 − 𝑥‖2 .

2.2.4.2 AdaBoost as a mirror descent Here, suppose the possibilities for our
classifiers are contained in a finite set of classifiers (the base of classifiers) ℋ =
{ℎ1 , . . . , ℎ𝑛 }. This is not a big assumption, as we will see later in Sec. 2.3.4 (because
AdaBoost eventually converges to a cycle). We also suppose that given a weight
distribution over the data 𝑊 ∈ Δ𝑚 , we can find the classifier in ℋ that minimizes the
weighted error.
Theorem 2.2.11 (AdaBoost as a mirror descent). AdaBoost is a mirror descent algorithm

with the following parameters:
• 𝒳 = Δ𝑚
• 𝒟 = Δ𝑛
• 𝜑(𝑊, 𝜆) = 𝑊 𝑇 𝐴𝜆, with 𝐴𝑖𝑗 = 𝑊 (𝑖)ℎ𝑗 (𝑥𝑖 )
Proof. Denote by 𝐴𝑗 the 𝑗-th column of 𝐴. We have, for any distribution 𝑊 , the
edge of the classifier ℎ𝑗 w.r.t. 𝑊 defined as 𝑊 𝑇 𝐴𝑗 . For a given weight distribution
𝑤, the maximum edge over classifiers is thus
𝑓 (𝑊 ) = max 𝑊 𝑇 𝐴𝑗 = max𝑛 𝑊 𝑇 𝐴𝜆 = max𝑛 𝜑(𝑊, 𝜆) (53)

𝑗=1,...,𝑛 𝜆∈Δ 𝜆∈Δ
To ensure that the data is well classified by the final classifier no matter what the
true underlying data distribution is, we want to minimize the maximum edge
over classifiers. We now are in the format of a min-max optimization problem,
and we can apply the mirror descent algorithm. The dual function is given here
by:
𝑓 * (𝜆) = min𝑚 𝑊 𝑇 𝐴𝜆 = min𝑚 𝜑(𝑊, 𝜆) (54)
𝑊 ∈Δ 𝑊 ∈Δ
and the dual problem writes:

max 𝑓 * (𝜆) (55)
𝜆∈Δ𝑛
The main result from [16] is that the sequence of weights and classifiers produced
by AdaBoost is the sequence of primal and dual variables produced by the mirror
descent algorithm, setting 𝑑(𝑊 ) := 𝑚
∑︀
𝑖=1 𝑊 (𝑖) log 𝑊 (𝑖) + log(𝑚).
Indeed, let’s follow step by step what the mirror descent produces in this case:
we fix 𝑊0 ∈ Δ𝑚 , 𝜆0 = 0. Successively, in the loop for 𝑡 = 1 to 𝑡 = 𝑇 :
˜ 𝑡 ∈ arg max 𝜑(𝑊𝑡−1 , 𝜆) = arg max 𝑊 𝑇 𝐴𝜆

𝜆 𝑡−1
𝜆∈Δ𝑛 𝜆∈Δ𝑛
𝑇
(56)
⇐⇒ 𝐴𝑗𝑡 = arg max 𝑊𝑡−1 𝐴𝑗𝑡
𝑗=1,...,𝑛
where 𝑗𝑡 is the index of the classifier that minimizes the weighted error at iteration
𝑡. It is necessarily existing because the problem is linear on the simplex, so
the optimum is reached at one of the vertices. This step corresponds exactly
to fitting the classifier ℎ𝑡 to the data, i.e. to find the classifier that minimizes
the weighted error, which is precisely the first phase of AdaBoost. Then, as

𝑇 𝐴
𝐴𝑗𝑡 = arg max𝑗=1,...,𝑛 𝑊𝑡−1 𝑗𝑡 ⇐⇒ 𝐴𝑗𝑡 ∈ 𝜕𝑓 (𝑊𝑡−1 ), we have that:
˜ 𝑡 ) = 𝐴𝑗
𝑔𝑡 = ∇𝑊 𝜑(𝑊𝑡−1 , 𝜆 𝑡
⇐⇒ 𝑔𝑡 = ∇𝑊 max𝑛 𝜑(𝑊𝑡−1 , 𝜆) (57)

𝜆∈Δ
⇐⇒ 𝑔𝑡 = ∇𝑊 𝑓 (𝑊𝑡−1 )
The next step in the mirror descent is to choose 𝛼𝑡 > 0 and to compute
𝑊𝑡 ∈ arg min {𝛼𝑡 ⟨𝑔𝑡 , 𝑊 ⟩ + 𝐷(𝑊, 𝑊𝑡−1 )}

𝑊 ∈Δ𝑚
𝑚 𝑚 (58)
{︃ }︃
𝑇
∑︁ 𝑊 (𝑖) ∑︁
∈ arg min 𝛼𝑡 𝑊 𝐴𝑗𝑡 + 𝑊 (𝑖) log + (𝑊𝑡−1 (𝑖) − 𝑊 (𝑖))
𝑊 ∈Δ𝑚 𝑖=1
𝑊𝑡−1 (𝑖) 𝑖=1
The first order optimality condition on each component of 𝑤 implies that:
𝑑 𝑊 (𝑖)
(︂ (︂ )︂)︂
∀𝑖, 𝑊 (𝑖) 𝛼𝑡 𝑦𝑖 ℎ𝑗𝑡 (𝑥𝑖 ) + log + 𝑊𝑡−1 (𝑖) − 𝑊 (𝑖) =0
𝑑𝑊 (𝑖) 𝑊𝑡−1 (𝑖)
𝑊 (𝑖) (59)
⇐⇒ ∀𝑖, 𝛼𝑡 𝑦𝑖 ℎ𝑗𝑡 (𝑥𝑖 ) + log =0
𝑊𝑡−1 (𝑖)
⇐⇒ 𝑊 (𝑖) = 𝑊𝑡−1 (𝑖) exp(−𝛼𝑡 ℎ𝑗𝑡 (𝑥𝑖 )𝑦𝑖 )
which is exactly the weight update from original AdaBoost. The last
normalization step is straightforward.
The fact that AdaBoost iterations are actually exactly the same as a mirror descent for 𝑑
fixed earlier allows the authors to prove new properties and complexity bounds on the
algorithm [16].
2.3 AdaBoost as traditional machine learning methods

2.3.1 Why does AdaBoost generalize that well?
Recently, there was a gain of interest about what are called interpolating classifiers. It has
been proven that they generalize very well [31]. In 2017, Wyner et al. [52] introduced
the term of interpolating classifiers, and they have shown that we can see Random Forest
and AdaBoost classifiers as interpolating classifiers. This is a part of the explanation of
why AdaBoost generalizes so well. Another part of the explanation is the links between
AdaBoost and the margin theory, as explained in Sec. 2.3.1.3.
2.3.1.1 Interpolating classifiers
Definition 2.3.1 (Interpolating classifier). A classifier ℎ is said to be an interpolating

classifier if for each sample of the training set, the classifier assigns the correct label,
i.e. ∀𝑖 ∈ {1, . . . , 𝑚}, 𝑓 (𝑥𝑖 ) = 𝑦𝑖 .

Remark 8. The term interpolating can be explained geometrically by thinking of a

set of points we want to interpolate with a class of smooth functions, let’s say
polynomial for instance. Most people tend to say that such classifier will overfit,
achieving 100% accuracy on the training set, and having a poor generalization error
(e.g. one nearest-neighbor).
However, Random Forests and AdaBoost are example of interpolating classifier that still
seem to generalize well.
2.3.1.2 AdaBoost as an interpolating classifier & the double descent Experimentally,

in [52], it is shown that AdaBoost tends to smooth the decision boundary the more
iterations it does. In simple words, it may need 𝑇 iterations for AdaBoost to reach
100% accuracy on the training set, but after 𝑇 iterations, the generalization error may be
very poor. However, continuing the iterations will eventually reduce the generalization
error as the resulting classifier will act as an average of several interpolating classifiers
(i.e. classifiers that reach 100% accuracy on the train set). In some way, this can be linked
with the deep learning double descent phenomena (see [34]). Indeed, the first descent
phase consists in fitting the training data, and the second phase consists in letting the
algorithm run more iterations to reduce the generalization error.
To put it in a more formal way, suppose we need 𝑇 iterations to reach 100% accuracy
on the training set. Then, the resulting classifier 𝐻𝑇 = 𝑇𝑡=1 𝛼𝑡 ℎ𝑡 is an interpolating
∑︀
classifier. We suppose we ran 𝐾𝑇 iterations of AdaBoost. It is reasonable to think that

all following classifiers are interpolating classifiers as well:
𝑇
∑︁
𝐻𝑇𝑘 = 𝛼𝑘𝑇 +𝑡 ℎ𝑘𝑇 +𝑡 , ∀𝑘 ∈ {1, . . . , 𝐾 − 1} (60)
𝑡=1
We then can see the resulting classifier 𝐻𝐾𝑇 as an average of interpolating classifiers 𝐻𝑇𝑘 :
𝐾
∑︁ 𝐾𝑇
∑︁
𝐻𝐾𝑇 = 𝐻𝑇𝑘 = 𝛼𝑡 ℎ𝑡 (61)
𝑘=1 𝑡=1
Experimentally (see [52]), 𝐻𝐾𝑇 seems to generalize way better than any of the 𝐻𝑇𝑘 . This
makes us think that AdaBoost can be seen as an average of interpolating classifiers.
In [52], this property is called self-averaging property of boosting.
Remark 9. This behavior of boosting algorithms actually have been identified more
than 20 years ago, by Schapire et al. [42]. They identified that the generalization error
would likely decrease and AdaBoost would not overfit as it would be predicted
by the VC-dimension theory. A few theoritical arguments have been proposed to
explain this phenomena. The authors from [2] show that the generalization error of
the resulting classifier 𝐻𝑇 can be upper bounded. Indeed, let 𝑑 be the VC-dimension
of ℋ, let 𝒟 be the underlying distribution of the data, and let 𝒮 be the training set of

size 𝑚 (i.e. the (discrete) distribution of the data we have observed). Then, we have:
⎛√︃ ⎞
𝑑 ⎠
P𝒟 (𝑦𝐻𝑇 (𝑥) ≤ 0) ≤ P𝒮 (𝑦𝐻𝑇 (𝑥) ≤ 𝜃) + 𝑂 ⎝ (62)
𝑚𝜃2
for any 𝜃 > 0 with high probability. Note that this bound is independant of the
number of iterations 𝑇 of AdaBoost. It can partially explain that boosting algorithms
do not overfit.
However, quantitatively, the bounds are weak and do not explain the double descent
phenomena2 . However, an older view of AdaBoost also partially explains this
phenomena, by showing that AdaBoost increases the 𝑙1 margin of the model over the
iterations (see Sec. 2.3.1.3).
2.3.1.3 AdaBoost as a regularized path to a maximum margin classifier This idea has
been proposed in 2004 by Rosset et al. [37], but it bases itself on results that were directly
underlined by Schapire et al. [2]. This formulation comes from the combinations of two
points of view: the first one is the Gradient descent formulation explained in Sec. 2.2.1,
and the second one concentrates on the effects of boosting on the margin 𝑦𝑖 𝐻𝑡 (𝑥𝑖 ) of
the classifier 𝐻𝑡 . It is probably one of the most useful to understand why AdaBoost
converges and why it generalizes well.
Recall the two different loss functions 𝑐(𝛾) presented in Sec. 2.2.1:
Exponential loss: 𝑐(𝛾) = 𝑒−𝛾

(63)
Binomial log-likelihood loss: 𝑐(𝛾) = log(1 + 𝑒−𝛾 )
Those two loss functions are actually very similar, because if 𝜂𝐻𝑇 ≥ 0, the two functions
behave as exponential loss [37].
In parallel, consider the additive model 𝐻𝑇 (𝑥) = 𝑇𝑡=1 𝛼𝑡 ℎ𝑡 (𝑥). We can link this additive
∑︀
model to SVM theory by considering the margin of the classifier w.r.t. observation 𝑥𝑖 :
𝑇
∑︁
𝑦𝑖 𝐻𝑇 (𝑥𝑖 ) = 𝛼𝑡 𝑦𝑖 ℎ𝑡 (𝑥𝑖 ) (64)
𝑡=1
Suppose our classifier achieves 100% accuracy on the training set, i.e. 𝑦𝑖 𝐻𝑇 (𝑥𝑖 ) ≥ 0 for all
𝑖 ∈ {1, . . . , 𝑚}. Then, we can define the margin of the classifier as the minimum margin
of the observations w.r.t. a certain distance (e.g. 𝑙𝑝 distance):
𝑦𝑖 𝐻𝑇 (𝑥𝑖 )
𝑚𝑝 (𝐻𝑇 ) = min (65)
𝑖∈{1,...,𝑚} ||(𝛼1 , . . . , 𝛼𝑇 )||𝑝
Now the interesting part is that there is a link between AdaBoost and the 𝑙1 margin. In
case of separable training data, AdaBoost produces a non-decreasing sequence w.r.t. the
𝑙1 margin of the model:
∀𝑡, 𝑚1 (𝐻𝑡+1 ) ≥ 𝑚1 (𝐻𝑡 ) (66)
2
Note that this term is used to echo the phenomena we all know about in deep learning from Nakkiran
et al. [34]. However, this term is slightly unadapted here. Indeed, there is no double descent as there is
no ascent. We only see a phenomena where the generalization error continues to decrease after achieving
perfect classification.

𝑇 →∞
𝑇 →∞
Figure 1: Illustration of the increase of the 𝑙1 margin of the model over the iterations of
AdaBoost. The decision boundaries of the iterations of AdaBoost are shown in red. The
blue line represents the decision boundary of the maximum margin classifier. Eventually,
AdaBoost will converge to the maximum margin classifier.
What is important here is for the generalization error. Indeed, if we incease the 𝑙1 margin
of the model, as for SVM, we are likely to decrease the generalization error.
This is to put in parallel with the double descent phenomena explained in Sec. 2.3.1.
Indeed, these two vision of AdaBoost are very similar. The double descent phenomena
can be explained here by the regularization effect of AdaBoost which increases the 𝑙1
margin of the model.
2.3.2 AdaBoost as a Kernel method - Boosting for regression

Very recently, in 2019, Aravkin et al. [1] proposed a new view of boosting methods. This
view is more adapted for regression problems, that is why we will adapt our notations
from before. However, the authors affirm this view can be adapted to classification
problems as well.
2.3.2.1 General algorithm The way to see boosting for regression is actually very
close from the way we see boosting as successive optimization problems. Indeed, given
data 𝑦1 , . . . , 𝑦𝑚 ∈ R, we want to find a function 𝐻 : 𝒳 → R that minimizes the regression
error
𝐻 = arg min 𝐿(𝐻(𝑋) − 𝑦) (67)
𝐻
where 𝐿 is a loss function and 𝑋 = (𝑥1 , . . . , 𝑥𝑚 )𝑇 . As often for regression problem, we

want to prevent overfitting by adding a regularization term to the loss function.
Definition 2.3.2 (Optimization problem in boosting for regression). Choosing a kernel

matrix 𝐾 ∈ R𝑚×𝑚 , we can write our optimization problem as
𝐻 = arg min 𝐿(𝐻(𝑋) − 𝑦) + 𝛾𝑋 𝑇 𝐾𝑋 =: 𝐽(𝑋, 𝑦) (68)

𝐻
where 𝛾 > 0 is a regularization parameter.
Then, we can see boosting as stated in Algorithm 7.
Algorithm 7: Boosting for regression

Set ℎ0 = arg min𝐻 𝐽(𝑋, 𝑦).
for 𝑡 = 1, . . . , 𝑇 do
Update 𝑦𝑡 = 𝑦 − 𝐻𝑡−1 (𝑋)
𝑦𝑡 is the residual error at iteration 𝑡. We can see it as
the new target for the next iteration. That will boost
the importance of the examples we did not approximate
well at the previous iteration, and will reduce the
importance of the examples we approximated correctly.
Compute ℎ𝑡 = arg min𝐻 𝐽(𝑋, 𝑦𝑡 )
Compute 𝐻𝑡 = 𝐻𝑡−1 + ℎ𝑡
end
return 𝐻𝑇
2.3.2.2 Link with Kernel methods for linear regression What is the link with Kernel
methods? Fix 𝐿(𝐻(𝑋) − 𝑦) = ||𝑦 − 𝐻(𝑋)||2 where ||.|| is the Euclidean norm. Suppose
as well that ℋ is the set of linear functions, i.e. ℋ = {𝐻 : 𝑋 ↦→ 𝛽𝑋, 𝛽 ∈ R𝑚×𝑚 }, and
𝑦 ∼ 𝒩 (0, 𝜎 2 𝐼𝑚 ). Let 𝜆 the kernel scale parameter related to 𝐾. Then, we can identify ℎ𝑡
2
to 𝛽𝑡 , and we can write, setting 𝛾 = 𝜎𝜆 :
𝛽𝑡 = arg min ||𝑦𝑡 − 𝛽𝑋||2 + 𝛾𝑋 𝑇 𝐾𝑋

𝛽∈R𝑚
(69)
−1
= 𝜆𝐾𝛽 𝑇 (𝜆𝛽𝐾𝛽 𝑇 + 𝜎 2 𝐼𝑚 ) 𝑦𝑡
which has an explicit form. Setting 𝑃𝜆 = 𝜆𝛽𝐾𝛽 𝑇 , and ℎ = 𝛽𝑋, the predicted data output
writes:
ℎ𝑡 (𝑋) = arg min ||𝑦𝑡 − ℎ||2 + 𝜎 2 ℎ𝑇 𝑃𝜆−1 ℎ
ℎ∈R𝑚 (70)
−1
= 𝑃𝜆 (𝑃𝜆 + 𝜎 2 𝐼) 𝑦𝑡
Definition 2.3.3 (Boosting kernel). For all 𝑡 ≥ 1, we call boosting kernel the quantity
𝑃𝜆,𝑡 , defined by:
−1 −𝑡
(︁ )︁
𝑃𝜆,𝑡 = 𝜎 2 𝐼 − 𝑃𝜆 (𝑃𝜆 + 𝜎 2 𝐼) − 𝜎2𝐼 (71)
Proposition 2.3.4 (𝐻𝑡 is a kernel-based estimator). One can show [1] that the updated
classifier at each iteration 𝐻𝑡 is a kernel-based estimator where the kernel is 𝑃𝜆,𝑡 .

Proof. We will here only give a sketch of the proof.

According to the boosting scheme defined for regression: set 𝑆𝜆 =
−1
𝑃𝜆 (𝑃𝜆 + 𝜎 2 𝐼) to simplify the notations. Then, we have:
𝐻0 (𝑋) = 𝑆𝜆 𝑦
𝐻1 (𝑋) = 𝑆𝜆 𝑦 + 𝑆𝜆 (𝐼 − 𝑆𝜆 )𝑦
..
. (72)
𝑡−1
(𝐼 − 𝑆𝜆 )𝑖 𝑦
∑︁
𝐻𝑡 (𝑋) = 𝑆𝜆
𝑖=0
−1
However, we also have that 𝑆𝜆,𝑡 := 𝑃𝜆,𝑡 (𝑃𝜆,𝑡 + 𝜎 2 𝐼) simplifies to
𝑖
𝑆𝜆 𝑡−1 − ≥
∑︀
𝑖=0 (𝐼 𝑆 𝜆 ) for all 𝑡 1. Therefore, at iteration 𝑡, boosting returns the same
estimator as the kernel-based estimator with kernel 𝑃𝜆,𝑡 .
The authors provide more detailed proof and further insights of boosting as a kernel
method in [1].
2.3.3 AdaBoost as a Product of Experts

This subsection is based on [13].
2.3.3.1 Product of Experts models The idea behind Product of Experts (PoE) models
is that we have access to several experts models, and we combine their predictions to get
a better prediction. This is a very general idea, the same that motivates every ensemble
method. PoE differ from other ensemble methods in the way that we suppose we already
have access to the experts models and look for a way to combine the models, rather
than generating them in the best way. It is a more probabilistic approach to ensemble
methods.
According to [13], boosting can be seen as incremental learning in PoE. Incremental
learning in PoE is an algorithm close to boosting that generates an estimator iteratively
using PoE.
Algorithm 8: Incremental learning in PoE

Data: (𝑥1 , 𝑦1 ), . . . , (𝑥𝑚 , 𝑦𝑚 ) ∈ 𝒳 × 𝒴
Initialize 𝑊0 = (1/𝑚, . . . , 1/𝑚)
for 𝑡 = 1, . . . , 𝑇 do
𝑊𝑗 (𝑖)
Find a hypothesis ℎ𝑡 such that 𝑚 𝑖=1 P(𝑦𝑖 |𝑥𝑖 ,ℎ𝑗−1 ) ≤ 2 (for binary classification)
∑︀
Update 𝑊𝑡 (𝑖) = 𝑊𝑡−1 (𝑖)(1 − P(𝑦𝑖 |𝑥𝑖 , ℎ𝑡 )) for 𝑖 = 1, . . . , 𝑚

Normalize 𝑊𝑡
end
return 𝐻(𝑥) = sign (P(𝑦 = 1|𝑥, ℎ1 , . . . , ℎ𝑇 ) − P(𝑦 = −1|𝑥, ℎ1 , . . . , ℎ𝑇 ))
2.3.3.2 Boosting as incremental learning in PoE

Theorem 2.3.5 (Boosting as incremental learning in PoE). The AdaBoost algorithm is

equivalent to incremental learning in PoE.
Proof. Consider the estimator ℎ𝑡 at iteration 𝑡 of AdaBoost for a binary

classification problem. We can write the probability of classifying correctly an
example 𝑥𝑖 at iteration 𝑡 as
P(𝑦𝑖 = 𝑦|𝑥𝑖 , ℎ𝑡 ) =P(𝑦𝑖 = 𝑦|ℎ𝑡 (𝑥𝑖 ) = 𝑦)P(ℎ𝑡 (𝑥𝑖 ) = 𝑦|𝑥𝑖 )
(73)
+P(𝑦𝑖 = 𝑦|ℎ𝑡 (𝑥𝑖 ) = −𝑦)P(ℎ𝑡 (𝑥𝑖 ) = −𝑦|𝑥𝑖 )
Suppose that the error from classifier ℎ𝑡 is symmetric, i.e.
P(𝑦𝑖 = 𝑦|ℎ𝑡 (𝑥𝑖 ) = −𝑦) = P(𝑦𝑖 = −𝑦|ℎ𝑡 (𝑥𝑖 ) = 𝑦) = 𝑃𝑡 (74)
Then, we have
P(𝑦𝑖 = 𝑦|𝑥𝑖 , ℎ𝑡 ) = (1 − P(ℎ𝑡 (𝑥𝑖 ) ̸= 𝑦𝑖 |𝑥𝑖 )) (1 − 𝑃𝑡 ) + P(ℎ𝑡 (𝑥𝑖 ) ̸= 𝑦𝑖 |𝑥𝑖 )𝑃𝑡 (75)
Now, on the one hand, we have
𝑚 𝑚
∑︁ 𝑊𝑗−1 (𝑖) ∑︁ 𝑊𝑡−1 (𝑖)
=
𝑖=1
P(𝑦𝑖 |𝑥𝑖 , ℎ𝑡 ) 𝑖=1
(1 − P(ℎ𝑡 (𝑥𝑖 ) ̸= 𝑦𝑖 |𝑥𝑖 )) (1 − 𝑃𝑡 ) + P(ℎ𝑡 (𝑥𝑖 ) ̸= 𝑦𝑖 |𝑥𝑖 )𝑃𝑡
(76)
∑︁ 𝑊𝑡−1 (𝑖) ∑︁ 𝑊𝑡−1 (𝑖)
= +
ℎ𝑡 (𝑥𝑖 )=𝑦𝑖
(1 − 𝑃𝑡 ) ℎ (𝑥 )̸=𝑦 𝑃𝑡
𝑡 𝑖 𝑖
by supposing our classifier satisfies P(ℎ𝑡 (𝑥𝑖 ) = 1|𝑥𝑖 ) ∈ {0, 1}, which means that
it produces decisions which have no random component (which is almost always
the case for AdaBoost, as we use decision trees). On the other hand, we must
𝑊𝑡−1 (𝑖)
impose the condition 𝑚 𝑖=1 P(𝑦𝑖 |𝑥𝑖 ,ℎ𝑡 ) ≤ 2, which implies:
∑︀
∑︁ 1
𝜖𝑡 := 𝑊𝑡−1 (𝑖) ≤ 𝑃𝑡 ≤ (77)
ℎ𝑡 (𝑥𝑖 )̸=𝑦𝑖
2
This hypothesis is reasonable since we hope our classifier to perform better than
random guessing. Setting 𝑃𝑡 = 𝜖𝑡 , we have
𝑒−𝛼𝑡
𝑃𝑡 = (78)
𝑒−𝛼𝑡 + 𝑒𝛼𝑡
We finally need to update the weights, which is made in Alg. 8 by
𝑊𝑡 (𝑖) = 𝑊𝑡−1 (𝑖)(1 − P(𝑦𝑖 |𝑥𝑖 , ℎ𝑡 )) (79)
and in our case,
{︃
𝑃𝑡 if ℎ𝑡 (𝑥𝑖 ) = 𝑦𝑖
1 − P(𝑦𝑖 |𝑥𝑖 , ℎ𝑡 ) =
1 − 𝑃𝑡 otherwise.
(80)
𝑒−𝛼𝑡 𝑦𝑖 ℎ𝑡 (𝑥𝑖 )
= −𝛼𝑡
𝑒 + 𝑒 𝛼𝑡
Therefore we recover the weight update from AdaBoost.
We can now write the AdaBoost algorithm as incremental learning in PoE (see Alg. 9).

Algorithm 9: AdaBoost as incremental learning in PoE

Data: (𝑥1 , 𝑦1 ), . . . , (𝑥𝑚 , 𝑦𝑚 ) ∈ 𝒳 × {−1, 1}
Initialize 𝑊0 = (1/𝑚, . . . , 1/𝑚)
for 𝑡 = 1, . . .∑︀
, 𝑇 do
Set 𝜖𝑡 = ℎ𝑡 (𝑥𝑖 )̸=𝑦𝑖 𝑊𝑡−1 (𝑖)
Find a hypothesis ℎ𝑡 such that 𝜖𝑡 ≤ 12 that minimizes 𝜖𝑡
Set 𝛼𝑡 = 12 log 1−𝜖 𝜖𝑡
𝑡
−𝛼𝑡 𝑦 ℎ𝑡 (𝑥 )
P(𝑦𝑖 |𝑥𝑖 , ℎ𝑡 ) = 1 − 𝑒 𝑒−𝛼𝑡𝑖+𝑒𝛼𝑡𝑖
Update 𝑊𝑡 (𝑖) = 𝑊𝑡−1 (𝑖)(1 − P(𝑦𝑖 |𝑥𝑖 , ℎ𝑡 )) for 𝑖 = 1, . . . , 𝑚
Normalize 𝑊𝑡
end
(︁∑︀ )︁
𝑇
return 𝐻(𝑥) = sign 𝑡=1 𝛼𝑡 ℎ𝑡 (𝑥)
In practice, in [13], it is shown that Alg. 9 can be derived to obtain significantly better
results than the original AdaBoost algorithm.
2.3.4 AdaBoost as a dynamical system: experiments and theoritical insights

2.3.4.1 Diversity and cycling behavior Some may see boosting algorithm as a way to
increase diversity of a set of weak learners to produce a better additive model. We will
discuss here how diversity intervene in boosting algorithms, how it can be measured but
also how it is limited. Indeed, by seeking for an estimator which must be very diverse
from the previous one, AdaBoost often falls into a cycling behavior which is not the
purpose for which it was designed.
2.3.4.2 What is diversity, and how do we measure it? The key for boosting
algorithms to work is to be able to combine some different weak learners. Indeed, if
the set of weak learners in which we choose our ℎ𝑡 is not diverse enough, we may not be
able to increase too much the accuracy from a single weak learner to an additive model
of weak learners.
Thus, it is important that the weak learners iteratively chosen by AdaBoost are different
enough from each other. This is what we call diversity.
There are several ways to measure diversity, as there are several ways to measure the
efficiency of an estimator [24] (precision, recall, F1-score, etc.). Let’s fix a definition here.
We will be using the definition of diversity from [30]:
Definition 2.3.6 (Diversity of a set of weak learners). Given a set of weak learners
ℋ = {ℎ1 , . . . , ℎ𝑇 }, we define the diversity of ℋ as
2 ∑︁
div(ℋ) = 1 − sim(ℎ𝑡 , ℎ𝑠 ) (81)
𝑇 (𝑇 + 1) 1≤𝑡̸=𝑠≤𝑇

where sim(ℎ𝑡 , ℎ𝑠 ) is a measure of the similarity between ℎ𝑡 and ℎ𝑠 defined as

𝑚
1 ∑︁
sim(ℎ𝑡 , ℎ𝑠 ) = ℎ𝑡 (𝑥𝑖 )ℎ𝑠 (𝑥𝑖 ) (82)
𝑚 𝑖=1
We can see that sim(ℎ𝑡 , ℎ𝑠 ) ∈ [−1, 1], and that sim(ℎ𝑡 , ℎ𝑠 ) = 1 if and only if ℎ𝑡 = ℎ𝑠 , and
sim(ℎ𝑡 , ℎ𝑠 ) = −1 if and only if ℎ𝑡 = −ℎ𝑠 .
Therefore, the diversity of ℋ increases if we add very different weak learners to ℋ than
the ones already in ℋ, and decreases if we add very similar weak learners to ℋ.
Thus, for AdaBoost as for any other ensemble method, the key for the algorithm to
work is to have a diverse enough set of weak learners (e.g. decision trees).
However, as we will illustrate in the next subsubsection, even with a diverse enough
set of estimators, AdaBoost iterations do not necessarily produces a subset of diverse
estimators.
2.3.4.3 Diversity is not always the key: Does AdaBoost always cycle? In [40], the
authors show that for some problems, AdaBoost iterations become cyclic. This means
that the new weak learners that we are adding to our ensemble model decrease diversity
of the set (in the sense of Def. 2.3.6).
To illustrate this, we can consider the following example. Let 𝒳 = [0, 1] × [0, 1] and
𝒴 = {−1, 1}. For each 𝑥, we assign a label 𝑦 as follows:
{︃ }︃
1 if 𝑥1 ≤ 14 or 𝑥2 ≤ 1
4 or 𝑥2 ≥ 3
4
𝑦(𝑥) = (83)
−1 otherwise.
Consider ℋ the set of all decision stumps on 𝒳 . ℋ is a diverse set of weak learners.
However, if we run AdaBoost on this problem, we will see that the algorithm will
produce a cyclic sequence very quickly. Infact, we will likely get a sequence of 3 different
decision stumps that are repeated over and over again.
Simplification 2
start of t
× × × × × × ×
×
Local
ensemble 3
{ ×
× ×
●
●
× ×
● × ×
× Simplification 3
× ● ● ● ●
× × ● ●
× ● ● ●
● ●
×
● ● ● start
× ●
Local
ensemble 1
{ ×
×
●
× × × ●
●
× ● ×
×
Simplification 1
start of thread 1
× × × × × ×
{
Local ensemble 2
Figure 2: Example of cyclic AdaBoost iterations. This represents the toy problem defined
in Eq. 83. The image row shows the decision stumps chosen by AdaBoost, the second
image illustrate them in the space of decision stumps where the distance between the
classifiers can be seen as the opposite of the similarity defined in Sec. 2.3.4.1.

This is illustrated in Fig. 2. More difficult problems don’t necessarily enlighten this
phenomena this quickly, but we can still observe it in higher dimensions real-world
problems.
The behavior that we tend to see is that in the first place, AdaBoost tries to cover a
base space of estimators, and then, it seems to try to find the best combination of the
estimators in this base space by repeating the same estimators over and over again but
not necessarily as often as the others.
This intution has been formalized theoritically in the open problem presented in [39]
in 2012, and this conjecture has very recently been proven in [3] in 2023. We will present
the main ideas in the next subsubsection.
2.3.4.4 Theoritical insights This subsubsection is based on [3]. This article proposes
an original way to see AdaBoost. It is very formal, and we will try to keep most of
their notations. Also, we volontary simplify the work they have done, which is way
more riguourous and precise than this subsection. This subsection only stands to give
intuitions of the results they have proven.
A dynamical system can be defined as follows:
Definition 2.3.7 (Dynamical system). A dynamical system is a tuple (𝒳 , Σ, 𝜇, 𝑓 ) where:
• 𝒳 is a compact metric space
• Σ is a 𝜎-algebra on 𝒳
• 𝜇 is a probability measure on 𝒳
• 𝑓 is a measurable function from 𝒳 to 𝒳 which describes the evolution of the system
We will consider AdaBoost as a dynamical system over the weights of the data at each
iteration. Thus, in our case, the dynamical system we consider is (Δ𝑚 , Σ𝑚 , 𝜇, 𝒜) where:
∑︀𝑚
• Δ𝑚 is the 𝑚-dimensional simplex, i.e. Δ𝑚 = {𝑝 ∈ R𝑚 | 𝑖=1 𝑝𝑖 = 1, 𝑝𝑖 ≥ 0}
• Σ𝑚 is the Borel 𝜎-algebra on Δ𝑚
• 𝜇 is a measure we will define later
• 𝒜 is the AdaBoost update over the weights
The main goal of the paper is to be able to apply the Birkhoff Ergodic Theorem 2.3.8 [5] to
the previous dynamical system.
Theorem 2.3.8 (Birkhoff Ergodic Theorem). Let (𝒳 , Σ, 𝜇, 𝒜) be a dynamical system and

𝑓 be a measurable function from 𝒳 to R. Then, there exists a measurable function 𝑓 * such
that:
𝑇
1 ∑︁
lim 𝑓 (𝒜𝑖 (𝑥)) = 𝑓 * (𝑥) (84)
𝑛→∞ 𝑇
𝑖=1

for 𝜇-almost every 𝑥 ∈ 𝒳 . Also, we have that 𝑓 * is 𝒜-invariant, i.e. 𝑓 * ∘ 𝒜 = 𝑓 * .
Applying this theorem to our dynamical system would allow us to prove the
convergence of AdaBoost (w.r.t. the weights), and to prove that there exists a fixed point
of the system.
To apply such theorem, we need to ensure that there exists a measure 𝜇 such that 𝒜 is
𝜇-preserving. This is what the Krylov-Bogoliubov Theorem 2.3.9 states.
Theorem 2.3.9 (Krylov-Bogoliubov Theorem). If (𝑊, 𝑁 ) is a metric compact space and

𝑔 : 𝑊 → 𝑊 is a continuous function, then there exists a Borel probability measure 𝜇 on 𝑊
such that 𝑔 is 𝜇-preserving.
We now see that in order to apply the Birkhoff Ergodic Theorem, we only have to apply
the Krylov-Bogoliubov Theorem to our (well-chosen) dynamical system. What we first
can observe is that it is only relevant to set our first weight 𝑤1 in the interior subset of
the simplex Δ∘𝑚 = {𝑤 ∈ R𝑚 | 𝑚
∑︀
𝑖=1 𝑤𝑖 = 1, 𝑤𝑖 > 0}. Indeed, if we set 𝑤1 to be 0 for some
𝑖, then the weight will stay 0 for all the iterations.
Let 𝜂 a dichotomy of the data, i.e. 𝜂 = (𝑦1 ℎ(𝑥1 ), . . . , 𝑦𝑚 ℎ(𝑥𝑚 )) ∈ {−1, 1}𝑚 which is
positive at the 𝑖th compenent only if ℎ classifies correctly sample 𝑥𝑖 . We can define a
set of weights 𝜋(𝜂) := {𝑤 ∈ Δ𝑚 , 𝜂 ∈ arg min𝜂′ 𝜂 ′𝑇 𝑤} the set of weights such that the
estimator ℎ that minimizes the weighted error 𝑤𝑇 𝜂ℎ verifies 𝜂ℎ = 𝜂.
Define also 𝜋 + (𝜂) = {𝑤 ∈ 𝜋(𝜂), 𝜂 𝑇 𝑤 > 0} the set of weights in 𝜋(𝜂) that are make non-
zero error on mistake dichotomy 𝜂. Then, we can define Δ+ +
𝑚 = ∪𝜂∈{−1,1}𝑚 𝜋 (𝜂) the set
of weights that make non-zero error on at least one mistake dichotomy.
The first big result that is established in the paper is the following:
Proposition 2.3.10 (Continuity of AdaBoost Update). The AdaBoost update 𝒜 is

continuous over the set ∪𝜂∈{−1,1}𝑚 𝜋 ∘ (𝜂).
This is a key result to apply the Krylov-Bogoliubov Theorem.
Proof. Indeed, setting 𝑤𝑠 → 𝑤 a sequence of weights in ∪𝜂∈{−1,1}𝑚 𝜋 ∘ (𝜂), we have

that:
1 𝜂(𝑖)
)︂1−𝜂(𝑖)
1 1
(︂ )︂ (︂
𝒜(𝑤𝑠 )(𝑖) = 𝑤𝑠 (𝑖)
2 𝜂𝑤𝑠 1 − 𝜂𝑤𝑠
(85)
1 𝜂(𝑖)
)︂1−𝜂(𝑖)
1 1
(︂ )︂ (︂
→ 𝑤(𝑖) = 𝒜(𝑤)(𝑖)
𝑠→∞ 2 𝜂𝑤 1 − 𝜂𝑤
because 𝑤𝑠 → 𝑤 and 𝜂𝑤𝑠 → 𝜂𝑤. So we have that 𝒜(𝑤𝑠 ) → 𝒜(𝑤), which proves
the continuity of 𝒜 over ∪𝜂∈{−1,1}𝑚 𝜋 ∘ (𝜂).
Their second main result is that the relative error of each weak classifier produced by
AdaBoost can be lower bounded.

Proposition 2.3.11 (Lower bound on the relative error of weak classifiers).

1
∀𝑡 ∈ N, 𝑤𝑡𝑇 𝜂ℎ𝑡 ≥ (86)
2𝑡+1
The proof is by induction, but is slightly more technical and needs to introduce more
notations that we won’t detail here.
In [3], the authors don’t apply the Birkhoff Ergodic Theorem to the whole set Δ+ 𝑚 , but
only to a subset of it: the limit set of AdaBoost that can be reached by an infinite number
of iterations starting from weights in Δ+ + ∞ 𝑡 +
𝑚 that they denote Ω∞ = ∩𝑡=1 𝒜 (Δ𝑚 ).
Proposition 2.3.12 (Compactness of AdaBoost limit set). Ω+

∞ is compact.
The compactness of this set allows the authors to apply the Birkhoff Ergodic Theorem
over Ω+
∞ . We thus have the following proposition:
Proposition 2.3.13 (AdaBoost is Ergodic over Ω+ ∞ ). The average over any AdaBoost
+
sequence starting at 𝑤1 ∈ Ω∞ converges. More precisely,
−1
1 𝑇∑︁
∀𝑤1 ∈ Ω+ 1
∞ , ∀𝑓 ∈ 𝐿 (𝜇), 𝑓 (𝒜𝑡 (𝑤1 )) → 𝐿(𝑓 ) ∈ R (87)
𝑇 𝑡=0 𝑇 →∞
Again, the proof is technical, and does not present much interest for our purpose. Finally,
the authors show a last big result which they prove under different hypothesis. Here
as well, we won’t go too much into the details here, but the theorem resembles to the
following:
Theorem 2.3.14 (AdaBoost is Ergodic and Converges to a Cycle). Let 𝒜𝜏 a specific

sequence of functions that converge towards the AdaBoost update 𝒜 uniformly over Δ𝑚 .
(In practice, those functions are explicit in the paper, and the author show the uniform
convergence). Then,
• (𝑤𝑡 ) converges in finite time to a cycle in Δ+

𝑚 of period 𝑝.
• The AdaBoost system is ergodic.
• Let 𝑇0 be the first time at which 𝑤𝑡 enters the cycle. Then, for any 𝑓 ∈ 𝐿1 (𝜇), we have
that:
−1 𝑝−1
1 𝑇∑︁ 1 ∑︁
𝑓 (𝒜𝑡 (𝑤1 )) → 𝑓 (𝒜𝑇0 +𝑡 (𝑤1 )) (88)
𝑇 𝑡=0 𝑇 →∞ 𝑝
𝑡=0
That shows that after a high number of iterations, AdaBoost becomes perfectly cyclic.
That confirms the intuition we can have Sec. 2.3.4.1.
However, this theorem is not completely satisfying. Indeed, this demonstrates that
AdaBoost cycles w.r.t. the weights (thus w.r.t. the estimators ℎ), but the cycle may be very
long, and take a lot of iterations to be reached. In practice, we want to know if AdaBoost
cycles w.r.t. the estimators ℎ. The problem is that a cycle w.r.t. the estimators ℎ may not

imply a cycle w.r.t. the weights. Indeed, the weights are not unique for a given estimator
ℎ. This is what motivated a discussion that we present in a future paper we will publish
in a short time.
3 Conclusion & Acknowledgements

3.1 Conclusion
The primary objective of this paper has been to provide a comprehensive and expansive
overview of AdaBoost, exploring the diverse interpretations and facets that extend
beyond its initial introduction as a PAC learning algorithm. We have uncovered
various ways to perceive and understand AdaBoost, highlighting the significance of
comprehending each interpretation to attain a comprehensive understanding of the
algorithm’s inner workings. In particular, the ergodic dynamics of AdaBoost have
emerged as a compelling avenue of research, offering a new perspective that can foster
theoretical advancements and shed light on its long-term behavior.
Although AdaBoost’s iterative dynamics are explicitly accessible at each step,
comprehending its overall behavior on a global scale remains a challenging task.
Consequently, studying the ergodic dynamics of AdaBoost has become an active and
fruitful research domain for the past 30 years. By viewing AdaBoost as an ergodic
dynamical system, novel theoretical frameworks and perspectives can be developed,
enhancing our comprehension of its convergence properties, generalization bounds, and
optimization landscape. This approach opens doors to explore the interplay between
AdaBoost and related fields, such as deep learning, where phenomena like double
descent have garnered significant attention.
Remarkably, the presence of a similar behavior to the double descent phenomenon
was observed soon after AdaBoost’s introduction, yet it has not received extensive study
except for a few notable papers. Given the recent surge of interest in double descent
within the realm of deep learning, establishing connections and parallels between
the dynamics of AdaBoost iterations and deep learning can significantly advance our
understanding of both domains. By bridging these two fields of research, we can gain
valuable insights into the intricate dynamics governing AdaBoost, leading to a deeper
appreciation of its predictive power and potential applications.
This paper aims to serve various purposes for its readers. For those unfamiliar with
AdaBoost, it offers a clear and concise introduction to the algorithm, elucidating its
multiple interpretations and shedding light on its fundamental principles. By presenting
the algorithm’s different perspectives in an accessible manner, readers can develop a
solid foundation in AdaBoost and its significance within the broader machine learning
landscape. Furthermore, experienced readers will find value in the paper’s ability to
establish connections and unify diverse views of AdaBoost, presenting a cohesive and
comprehensive understanding of the algorithm. This synthesis of perspectives provides
a launching pad for future research endeavors centered around AdaBoost’s dynamics,
enabling scholars to delve deeper into its behavior and explore novel research directions.
In conclusion, this paper has endeavored to provide a thorough exploration of
AdaBoost, transcending its initial formulation as a PAC learning algorithm. By
unifying the diverse interpretations and facets of AdaBoost, we have unveiled the
intriguing concept of ergodic dynamics, which holds promise for advancing theoretical

frameworks and deepening our comprehension of the algorithm’s behavior. We have

also highlighted the significance of investigating the double descent phenomenon and
establishing connections between AdaBoost other fields. By presenting AdaBoost in a
multi-faceted manner, this paper caters to both novice and experienced readers, fostering
a comprehensive understanding of the algorithm and paving the way for future research
endeavors in AdaBoost’s dynamics and its broader implications in machine learning.
3.2 Acknowledgements
I would like to deeply thank Nicolas Vayatis and Argyris Kalogeratos for the time and
effort they put in this project. I would also like to thank the IDAML Chair of Centre
Borelli and its private partners for the funding of this project, without which this work
would not have been possible.
References
[1] Aleksandr Y Aravkin, Giulio Bottegal, and Gianluigi Pillonetto. “Boosting as a
kernel-based method”. In: Machine Learning 108.11 (2019), pp. 1951–1974.
[2] Peter Bartlett et al. “Boosting the margin: A new explanation for the effectiveness
of voting methods”. In: The annals of statistics 26.5 (1998), pp. 1651–1686.
[3] Joshua Belanich and Luis E Ortiz. “On the convergence properties of optimal
AdaBoost”. In: arXiv preprint arXiv:1212.1108 (2023).
[4] Candice Bentejac, Anna Csorgo, and Gonzalo Martinez-Munoz. “A comparative
analysis of gradient boosting algorithms”. In: Artificial Intelligence Review 54 (2021),
pp. 1937–1967.
[5] George D Birkhoff. “Proof of the ergodic theorem”. In: Proceedings of the National
Academy of Sciences 17.12 (1931), pp. 656–660.
[6] Hongming Chen et al. “The rise of deep learning in drug discovery”. In: Drug
discovery today 23.6 (2018), pp. 1241–1250.
[7] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In:
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and
data mining. 2016, pp. 785–794.
[8] Joseph A Cruz and David S Wishart. “Applications of machine learning in cancer
prediction and prognosis”. In: Cancer informatics 2 (2006), p. 117693510600200030.
[9] Robert Culkin and Sanjiv R Das. “Machine learning in finance: the case of deep
learning for option pricing”. In: Journal of Investment Management 15.4 (2017),
pp. 92–100.
[10] Sukhpreet Singh Dhaliwal, Abdullah-Al Nahid, and Robert Abbas. “Effective
intrusion detection system using XGBoost”. In: Information 9.7 (2018), p. 149.
[11] Thomas G Dietterich. “Ensemble methods in machine learning”. In: Multiple
Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21–23,
2000 Proceedings 1. Springer. 2000, pp. 1–15.
[12] Matthew F Dixon, Igor Halperin, and Paul Bilokon. Machine learning in Finance.
Vol. 1170. Springer, 2020.

[13] Narayanan U Edakunni, Gary Brown, and Tim Kovacs. “Boosting as a product of
experts”. In: arXiv preprint arXiv:1202.3716 (2012).
[14] Sophie Emerson et al. “Trends and applications of machine learning in quantitative
finance”. In: 8th international conference on economics and finance research (ICEFR
2019). 2019.
[15] Ky Fan. “Minimax theorems”. In: Proceedings of the National Academy of Sciences 39.1
(1953), pp. 42–47.
[16] Robert M. Freund, Paul Grigas, and Rahul Mazumder. “AdaBoost and Forward
Stagewise Regression are First-Order Convex Optimization Methods”. en. In: (July
2013). arXiv:1307.1192 [cs, math, stat]. URL: http://arxiv.org/abs/1307.
1192 (visited on 05/11/2023).
[17] Yoav Freund. “Boosting a weak learning algorithm by majority”. In: Information
and computation 121.2 (1995), pp. 256–285.
[18] Yoav Freund and Robert E Schapire. “A Decision-Theoretic Generalization of On-
Line Learning and an Application to Boosting”. en. In: Journal of Computer and
System Sciences 55.1 (Aug. 1997), pp. 119–139. ISSN: 00220000. DOI: 10 . 1006 /
jcss.1997.1504. URL: https://linkinghub.elsevier.com/retrieve/
pii/S002200009791504X (visited on 05/11/2023).
[19] Yoav Freund, Robert E Schapire, et al. “Experiments with a new boosting
algorithm”. In: icml. Vol. 96. Citeseer. 1996, pp. 148–156.
[20] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. “Additive logistic
regression: a statistical view of boosting (with discussion and a rejoinder by the
authors)”. In: The annals of statistics 28.2 (2000), pp. 337–407.
[21] Jerome H Friedman. “Stochastic gradient boosting”. In: Computational statistics &
data analysis 38.4 (2002), pp. 367–378.
[22] Jerome H Friedman and Werner Stuetzle. “Projection pursuit regression”. In:
Journal of the American statistical Association 76.376 (1981), pp. 817–823.
[23] Periklis Gogas and Theophilos Papadimitriou. “Machine learning in economics
and finance”. In: Computational Economics 57 (2021), pp. 1–4.
[24] Margherita Grandini, Enrico Bagli, and Giorgio Visani. “Metrics for multi-class
classification: an overview”. In: arXiv preprint arXiv:2008.05756 (2020).
[25] Michael I Jordan and Tom M Mitchell. “Machine learning: Trends, perspectives,
and prospects”. In: Science 349.6245 (2015), pp. 255–260.
[26] Guolin Ke et al. “Lightgbm: A highly efficient gradient boosting decision tree”. In:
Advances in neural information processing systems 30 (2017).
[27] B Ravi Kiran et al. “Deep reinforcement learning for autonomous driving: A
survey”. In: IEEE Transactions on Intelligent Transportation Systems 23.6 (2021),
pp. 4909–4926.
[28] Jyrki Kivinen and Manfred K. Warmuth. “Boosting as entropy projection”. en. In:
Proceedings of the twelfth annual conference on Computational learning theory. Santa
Cruz California USA: ACM, July 1999, pp. 134–144. ISBN: 978-1-58113-167-3. DOI:
10.1145/307400.307424. URL: https://dl.acm.org/doi/10.1145/
307400.307424 (visited on 05/11/2023).

[29] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature
521.7553 (2015), pp. 436–444.
[30] Nan Li, Yang Yu, and Zhi-Hua Zhou. “Diversity Regularized Ensemble Pruning.”
In: ECML/PKDD (1). 2012, pp. 330–345.
[31] Tengyuan Liang and Benjamin Recht. “Interpolating classifiers make few
mistakes”. In: arXiv preprint arXiv:2101.11815 (2021).
[32] Llew Mason et al. “Boosting algorithms as gradient descent”. In: Advances in neural
information processing systems 12 (1999).
[33] Sajjad Mozaffari et al. “Deep learning-based vehicle behavior prediction for
autonomous driving applications: A review”. In: IEEE Transactions on Intelligent
Transportation Systems 23.1 (2020), pp. 33–47.
[34] Preetum Nakkiran et al. “Deep double descent: Where bigger models and more
data hurt”. In: Journal of Statistical Mechanics: Theory and Experiment 2021.12 (2021),
p. 124003.
[35] Alexey Natekin and Alois Knoll. “Gradient boosting machines, a tutorial”. In:
Frontiers in neurorobotics 7 (2013), p. 21.
[36] Adeola Ogunleye and Qing-Guo Wang. “XGBoost model for chronic kidney
disease diagnosis”. In: IEEE/ACM transactions on computational biology and
bioinformatics 17.6 (2019), pp. 2131–2140.
[37] Saharon Rosset, Ji Zhu, and Trevor Hastie. “Boosting as a Regularized Path to a
Maximum Margin Classifier”. en. In: ().
[38] Cynthia Rudin, Ingrid Daubechies, and Robert E Schapire. “The Dynamics of
AdaBoost: Cyclic Behavior and Convergence of Margins”. en. In: ().
[39] Cynthia Rudin, Robert E Schapire, and Ingrid Daubechies. “Open Problem: Does
AdaBoost Always Cycle?” In: Conference on Learning Theory. JMLR Workshop and
Conference Proceedings. 2012, pp. 46–1.
[40] Cynthia Rudin et al. “The dynamics of AdaBoost: cyclic behavior and convergence
of margins.” In: Journal of Machine Learning Research 5.10 (2004).
[41] Omer Sagi and Lior Rokach. “Ensemble learning: A survey”. In: Wiley
Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8.4 (2018), e1249.
[42] Robert E Schapire. “The boosting approach to machine learning: An overview”.
In: Nonlinear estimation and classification (2003), pp. 149–171.
[43] Robert E Schapire. “The strength of weak learnability”. In: Machine learning 5
(1990), pp. 197–227.
[44] Robert E Schapire and Yoav Freund. “Boosting: Foundations and algorithms”. In:
Kybernetes 42.1 (2013), pp. 164–166.
[45] Brent Smith and Greg Linden. “Two decades of recommender systems at Amazon.
com”. In: Ieee internet computing 21.3 (2017), pp. 12–18.
[46] Harald Steck et al. “Deep learning for recommender systems: A Netflix case
study”. In: AI Magazine 42.3 (2021), pp. 7–18.
[47] Xiaolei Sun, Mingxi Liu, and Zeqian Sima. “A novel cryptocurrency price trend
forecasting model based on LightGBM”. In: Finance Research Letters 32 (2020),
p. 101084.

[48] Ayşegül Uçar, Yakup Demir, and Cüneyt Güzeliş. “Object recognition and
detection with deep learning for autonomous driving applications”. In: Simulation
93.9 (2017), pp. 759–769.
[49] Jessica Vamathevan et al. “Applications of machine learning in drug discovery and
development”. In: Nature reviews Drug discovery 18.6 (2019), pp. 463–477.
[50] Dehua Wang, Yang Zhang, and Yi Zhao. “LightGBM: an effective miRNA
classification method in breast cancer patients”. In: Proceedings of the 2017
international conference on computational biology and bioinformatics. 2017, pp. 7–11.
[51] Jian Wei et al. “Collaborative filtering and deep learning based recommendation
system for cold start items”. In: Expert Systems with Applications 69 (2017), pp. 29–
39.
[52] Abraham J Wyner et al. “Explaining the success of adaboost and random forests
as interpolating classifiers”. In: The Journal of Machine Learning Research 18.1 (2017),
pp. 1558–1590.
[53] Zhi-Hua Zhou. “Ensemble methods”. In: Combining pattern classifiers. Wiley,
Hoboken (2014), pp. 186–229.
Appendices
Appendix A How to choose the base space of estimators?
From original to most modern versions of AdaBoost, it is always mentioned that
AdaBoost, as any other boosting method, should be executed on a set of weak estimators.
The definition of a weak estimator is not precise. In the original papers, boosting was
considered as a PAC learning algorithm meaning that each estimator had at least a
slightly better performance as a purely random estimator. Mathematically, this can write,
for a classification problem with 𝐾 classes, as follows:
If the observation 𝑥 has the (true) label 𝑘, then there exists 𝛿 > 0 such that
P(𝑦 = 𝑘|ℎ(𝑥) = 𝑘) ≥ P𝑝𝑟𝑖𝑜𝑟 (𝑦 = 𝑘) + 𝛿 (89)
where P𝑝𝑟𝑖𝑜𝑟 is a fixed prior

∑︀ that we chose according to our knowledge. For instance, we
𝑚
1𝑦𝑖 =𝑘
could fix P𝑝𝑟𝑖𝑜𝑟 (𝑦 = 𝑘) = 𝑖=1
𝑚 .
A.1 Too weak or too strong estimators

However, would boosting be relevant if our estimators were already strong in that sense?
To properly see this, let’s take ℋ𝑑 (𝒳 ) the set of decision trees of depth 𝑑 over the set 𝒳 .
Suppose we have 𝑚 observations in 𝒳 . It is of course possible to build a tree ℎ ∈ ℋ𝑚 (𝒳 )
which achieves 100% accuracy on the set. Nonetheless, AdaBoost is not even designed
to generate the next iteration of such classifier, as it is suppose to have an error 𝜖 > 0. But
even considering decision trees of depth sufficiently small to ensure that none of them
can classify the whole set, boosting can lose sense. Indeed, we can have two opposite
phenomenas:

• Too weak estimators
• Too strong estimators
How can we have too weak estimators? Considering a classification task with 𝐾 >> 2
classes, we need to have complex enough weak estimators for boosting to properly work.
If ℋ1 (𝒳 ) is our set of weak estimators, each of the tree of this set will achieve a very low
accuracy, and probably way less than any random guesser if the data is complex enough
to be non-linearly seperable for instance.
Now, we can have estimators that are too strong as well. Indeed, if each of the tree
in the boosting sequence achieves an accuracy which is close to perfect accuracy, the
benefits from boosting methods vanish automatically. Of course, it still depends on what
you aim for in your problem, but using a boosting method to fit 100 trees of depth 10 on
a complex task can require much more time than training a single strong classifier. Plus,
this will not likely prevent overfitting as the sequence of decision trees will not vary a lot
(because each tree achieves very satisfying accuracy on the train set), meaning that the
boosting classifier will truly use only a few different decision trees when you expect it to
use tens or hundreds different ones to aggregate the result.
A.2 How to detect too good or too weak estimators?

The easier to detect is when your estimator are too strong. It is also probably the one
that is the most likely to happen. Indeed, in that case, you can observe several things
running boosting:
• Each one of the estimator in the sequence achieves high accuracy on the train set.
• The similarity between the estimators is too high.
• The difference between the precision of each estimator and the precision of the
aggregated estimator is small.
Let’s take an example here. Consider the Fashion MNIST dataset. We consider several
sets of estimators: ℋ1 (𝒳 ), . . . , ℋ15 (𝒳 ). For each depth, we run AdaBoost for 𝑇 =
{10, 20, . . . , 100} iterations. We then compare the difference between the mean accuracy
of each decision tree and the ensemble model. Also, we compute the similarity between
each estimator of the sequence.
On the Fig. 3, 4, 5, 6, 7, 8, 9, we see that if we take a depth of 1, the estimators are too
weak and we cannot learn properly as the algorithm stucks itself in a cycle to quickly.
However, we have the opposite with depth 15 for which boosting seems quite useless,
and all estimators seem to be the same. A good set of classifiers here is thus trees of
depth 4 or 5, for example. We can also observe that the more more complex the set of
estimator is, the more similar the sequence of estimator is.

0 1.0
4
8
12
16
20 0.8
24
28
32
36
40 0.6
44
48
52
56
60 0.4
64
68
72
76
80 0.2
84
88
92
96
0.0
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
72
76
80
84
88
92
96
Figure 3: Similarity matrix for depth 1. In (𝑖, 𝑗), the similarity measure between classifier
𝑖 and 𝑗 measured by kappa statistic. We recognize here the cyclic patterns due to the lack
of complexity of our set of estimators.
0.35
0.30
train
test
0.25 Mean accuracy of one estimator
0.20
0.15
20 40 60 80 100
Figure 4: Accuracy for depth 1. On the 𝑥-axis, the number of estimators kept to build the
ensemble, and on the 𝑦-axis, the accuracy of the considered estimator.

0 1.0
4
8
12
16 0.8
20
24
28
32
36 0.6
40
44
48
52 0.4
56
60
64
68
72 0.2
76
80
84
88 0.0
92
96
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
72
76
80
84
88
92
96
Figure 5: Similarity matrix for depth 4. In (𝑖, 𝑗), the similarity measure between classifier
𝑖 and 𝑗 measured by kappa statistic. Here, the produced sequence is very diverse and all
estimators are sufficiently different from each other to have a true benefit from boosting.
0.8
0.7
0.6 train
test
Mean accuracy of one estimator
0.5
0.4
0.3
20 40 60 80 100
Figure 6: Accuracy for depth 4. On the 𝑥-axis, the number of estimators kept to build the
ensemble, and on the 𝑦-axis, the accuracy of the considered estimator.

0 1.00
4
8
12
16 0.95
20
24
28
32 0.90
36
40
44
48 0.85
52
56
60
64 0.80
68
72
76
80 0.75
84
88
92
96 0.70
0
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
68
72
76
80
84
88
92
96
Figure 7: Similarity matrix for depth 15. In (𝑖, 𝑗), the similarity measure between
classifier 𝑖 and 𝑗 measured by kappa statistic. As we can see, the similarity is very
high between each estimator, meaning that the boosting sequence is not diverse enough
because the estimators are too strong.
1.00
0.95
0.90
0.85
train
test
0.80 Mean accuracy of one estimator
Training error = 0%
20 40 60 80 100
Figure 8: Accuracy for depth 15. On the 𝑥-axis, the number of estimators kept to build
the ensemble, and on the 𝑦-axis, the accuracy of the considered estimator.

0.8
0.6
Mean similarity
0.4
0.2
2 4 6 8 10 12 14
Depth
Figure 9: Mean similarity inside the sequence of estimators. We notice that the more
complex the set of estimator is, the more similar the sequence of estimators become.

Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Overview of Adaboost: Reconciling Its Views To Better Understand Its Dynamics

Uploaded by

Copyright:

Available Formats

Overview of AdaBoost: Reconciling its views to better

understand its dynamics

October 31, 2023

3 Conclusion & Acknowledgements 30

A How to choose the base space of estimators? 34

Centre Borelli 2/39 Perceval Beja-Battais

Keywords: Boosting, AdaBoost, dynamical systems, PAC learning, gradient

1 Introduction, problematic & notations

Centre Borelli 3/39 Perceval Beja-Battais

Centre Borelli 4/39 Perceval Beja-Battais

P(𝑥,𝑦)∼𝐷 (ℎ(𝑥) ̸= 𝑦) ≤ 𝜖 (1)

• The number of examples 𝑚 needed to achieve this bound is polynomial in 1𝜖 , 1

2.1.2 AdaBoost: the original formulation

Algorithm 1: Original formulation of AdaBoost (discrete) for a binary

Centre Borelli 5/39 Perceval Beja-Battais

Algorithm 2: Original formulation of AdaBoost (discrete) for a multiclass

Centre Borelli 6/39 Perceval Beja-Battais

2.1.3 Real AdaBoost

The algorithm now writes as in Alg. 3.

Algorithm 3: Original formulation of AdaBoost (real) for a multiclass

1 (1−ℎ (𝑥 ,𝑦 )+ℎ (𝑥 ,𝑦))

2.2 AdaBoost as successive optimization problems

Centre Borelli 7/39 Perceval Beja-Battais

Definition 2.2.1 (Optimization problem for AdaBoost as a gradient descent). Let ⟨, ⟩

min 𝐶(𝐻) (5)

Definition 2.2.2 (Margin and margin cost-functionals). We define the margin of a

where 𝑐 is a differentiable function of the margin 𝛾.

We can now prove Thm. 2.2.3.

Theorem 2.2.3 (AdaBoost as a gradient descent). AdaBoost is a gradient descent

Proof. At each iteration 𝑡, we will denote by 𝐻𝑡 the current resulting classifier,

Centre Borelli 8/39 Perceval Beja-Battais

up to a normalization constant. That means that maximizing −⟨∇𝐶(𝐻𝑡−1 ), ℎ𝑡 ⟩

Algorithm 4: AdaBoost as a gradient descent

2.2.2 AdaBoost as an additive model

Centre Borelli 9/39 Perceval Beja-Battais

where each ℎ𝑖 is a function of only one data point 𝑥𝑖 .

(𝛼𝑡 , ℎ𝑡 ) = arg min E (𝐶(𝑦 − 𝐻𝑡−1 (𝑥) − 𝛼ℎ(𝑥))) (16)

That directly implies that

Proposition 2.2.5 (AdaBoost as an additive model). Discrete AdaBoost can be seen as

Proof. Suppose we are at iteration 𝑡 of AdaBoost, and that we have already

Centre Borelli 10/39 Perceval Beja-Battais

which has a solution which is independent of 𝛼:

Then, to minimize over 𝛼, we can show that:

The second result of [20] is the following.

Setting the derivatives w.r.t. ℎ𝑡 to 0, we get:

E 𝑒−𝑦𝐻𝑡−1 (𝑥) 1𝑦=1 |𝑥

Thus, the weight update rule writes:

𝑊𝑡 (𝑥) = 𝑊𝑡−1 (𝑥)𝑒−𝜂ℎ𝑡 (26)

Centre Borelli 11/39 Perceval Beja-Battais

2.2.3 AdaBoost as an entropy projection

2.2.3.1 Relative entropy minimization theorem

At every iteration 𝑡 of AdaBoost, we want to find a weight distribution 𝑊𝑡 that is not

Definition 2.2.8 (Relative entropy minimization problem). The optimization problem

In short, we have the following result:

Centre Borelli 12/39 Perceval Beja-Battais

Theorem 2.2.9 (Relative entropy minimization theorem).

Furthermore, if 𝑊𝑡 , 𝛼𝑡 are the two solutions of these problems, we have that

𝑊𝑡−1 (𝑖)𝑒−𝛼𝑡 𝑦𝑖 ℎ𝑡 (𝑥𝑖 )

Now, on the one side we have:

Thus, using Eq. 37:

and we have equality between these two terms.

Centre Borelli 13/39 Perceval Beja-Battais

Componentwise, this writes: