Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

1.

INTRODUCTION

Ensemble learning is a machine learning paradigm where multiple models


(often called “weak learners”) are trained to solve the same problem and combined to
get better results. The main hypothesis is that when weak models are correctly
combined we can obtain more accurate and/or robust models.
1.1 What is Ensemble learning
Ensemble learning is a procedure where multiple learner modules are applied
on a dataset to extract multiple predictions, which are then combined into one
composite prediction. The ensemble learning process is commonly broken down into
two tasks: First, constructing a set of base learners from the training data; second,
combining some or all of these models to form a unified prediction model. Ensemble
methods are mathematical procedures that start with a set of base learner models.
Multiple forecasts based on the different base learners are constructed and combined
into an enhanced composite model superior to the base individual models. The final
composite model will provide superior prediction accuracy than the average of all
the individual base model predictions. This integration of all good individual models
into one improved composite model generally leads to higher accuracy levels.
Ensemble learning provides a critical boost to forecasting abilities and decision-
making accuracy. Ensemble methods attempt to improve forecasting bias while
simultaneously increasing robustness and reducing variance. Ensemble methods
produce predictions according to a combination of all the individual base model
forecasts to produce the final prediction. Ensemble methods are expected to be
useful when there is uncertainty in choosing the best prediction model and when it is
critical to avoid large prediction errors.
1.2 Model Selection
This is perhaps the primary reason why ensemble based systems are used in
practice: what is the most appropriate classifier for a given classification problem?
This question can be interpreted in two different ways: i) what type of classifier
should be chosen among many competing models, such as multilayer
perception (MLP), support vector machines (SVM), decision trees, naive Bayes
classifier, etc; ii) Given a particular classification algorithm, which realization of this
algorithm should be chosen - for example, different initializations of MLPs can give

1
rise to different decision boundaries, even if all other parameters are kept constant.
The most commonly used procedure - choosing the classifiers with the smallest error
on training data - is unfortunately a flawed one. Performance on a training dataset -
even when computed using a cross-validation approach - can be misleading in terms
of the classification performance on the previously unseen data. Then, of all
(possibly infinite) classifiers that may all have the same training - or even the same
(pseudo) generalization performance as computed on the validation data (part of the
training data left unused for evaluating the classifier performance) - which one
should be chosen? Everything else being equal, one may be tempted to choose at
random, but with that decision comes the risk of choosing a particularly poor model.
Using an ensemble of such models - instead of choosing just one - and combining
their outputs by - for example, simply averaging them - can reduce the risk of an
unfortunate selection of a particularly poorly performing classifier. It is important to
emphasize that there is no guarantee that the combination of multiple classifiers will
always perform better than the best individual classifier in the ensemble. Combining
classifiers may not necessarily beat the performance of the best classifier in the
ensemble, but it certainly reduces the overall risk of making a particularly poor
selection.

Figure 1.1: Combining an ensemble of classifiers.


It allows individual classifiers to generate different decision boundaries. If
proper diversity is achieved, a different error is made by each classifier, strategic

2
combination of which can then reduce the total error. Figure 1.1 graphically
illustrates this concept, where each classifier - trained on a different subset of the
available training data - makes different errors (shown as instances with dark
borders), but the combination of the (three) classifiers provides the best decision
boundary.

3
2. HISTORY

Ensemble learning was originally proposed for classification tasks in a manner


of supervised learning in 1965 (Nilsson, 1965). It combines two or more
classification models in order to produce one optimal predictive model.

Boosting proposed by (Schapire 1990). Boosting is an iterative technique which


adjusts the weight of an observation based on the last classification. If an observation
was classified incorrectly, it tries to increase the weight of this observation and vice
versa. Boosting in general decreases the bias error and builds strong predictive
models. However, they may sometimes over fit on the training data.

(Wolpert 1992), introduces stacked generalization involves training a learning


algorithm to combine the predictions of several other learning algorithms. First, all
of the other algorithms are trained using the available data, then a combiner
algorithm is trained to make a final prediction using all the predictions of the other
algorithms as additional inputs.

(Breiman 1996) proposed Bootstrap aggregating, also called bagging, is a machine


learning ensemble meta-algorithm designed to improve the stability and accuracy
of machine learning algorithms used in statistical classification and regression. It
also reduces variance and helps to avoid over fitting. Although it is usually applied
to decision tree methods, it can be used with any type of method. Bagging is a
special case of the model averaging approach.

4
3. ENSEMBLE LEARNING METHODS

Ensemble methods are a machine learning technique that combines several base
models in order to produce one optimal predictive model. Ensemble modeling is a
powerful way to improve the performance of your model. The main principle behind
the ensemble model is that a group of weak learners come together to form a strong
learner, thus increasing the accuracy of the model. Ensemble learning usually
produces more accurate solutions than a single model would. Ensemble learning
methods is applied to regression as well as classification. Ensemble learning for
regression creates multiple repressors i.e. multiple regression models such as linear,
polynomial, etc.

Types of Ensemble Methods

1. Boosting

2. Stacking

3. Bagging

3.1 Boosting, that often considers homogeneous weak learners, learns them
sequentially in a very adaptative way (a base model depends on the previous
ones) and combines them following a deterministic strategy.
3.2 Stacking, that often considers heterogeneous weak learners, learns them in
parallel and combines them by training a meta-model to output a prediction
based on the different weak models predictions.
3.3 Bagging, that often considers homogeneous weak learners, learns them
independently from each other in parallel and combines them following some
kind of deterministic averaging process.

5
4. BOOSTING

Boosting proposed by (Schapire 1990). Boosting is an iterative technique


which adjusts the weight of an observation based on the last classification. If an
observation was classified incorrectly, it tries to increase the weight of this
observation and vice versa. Boosting in general decreases the bias error and builds
strong predictive models. However, they may sometimes over fit on the training data
see the example Figure 4.1.

Figure 4.1: Example of Boosting.

Boosting consists in, iteratively, fitting a weak learner, aggregate it to the ensemble
model and “update” the training dataset to better take into account the strengths and
weakness of the current ensemble model when fitting the next base model.

6
5. STACKED GENARALIZATION

In Wolpert's stacked generalization (or stacking), an ensemble of classifiers


is first trained using bootstrapped samples of the training data, creating Tier 1
classifiers, whose outputs are then used to train a Tier 2 classifier (meta-
classifier) (Wolpert 1992). The underlying idea is to learn whether training data have
been properly learned. For example, if a particular classifier incorrectly learned a
certain region of the feature space, and hence consistently misclassifies instances
coming from that region, then the Tier 2 classifier may be able to learn this behavior,
and along with the learned behaviors of other classifiers, it can correct such improper
training. Cross validation type selection is typically used for training the Tier 1
classifiers: the entire training dataset is divided into T blocks, and each Tier-1
classifier is first trained on (a different set of) T-1 blocks of the training data. Each
classifier is then evaluated on the Tth (pseudo-test) block, not seen during training.
The outputs of these classifiers on their pseudo-training blocks, along with the actual
correct labels for those blocks constitute the training dataset for the Tier 2 classifier
(see Figure 5.1).

Figure 5.1: Stacked generalization.

7
6. BAGGING
Bagging, which stands for bootstrap aggregating, is one of the earliest, most
intuitive and perhaps the simplest ensemble based algorithms, with a surprisingly
good performance (Breiman 1996). Diversity of classifiers in bagging is obtained by
using bootstrapped replicas of the training data. That is, different training data
subsets are randomly drawn – with replacement – from the entire training dataset.
Each training data subset is used to train a different classifier of the same type.
Individual classifiers are then combined by taking a simple majority vote of their
decisions. For any given instance, the class chosen by most number of classifiers is
the ensemble decision. Since the training datasets may overlap substantially,
additional measures can be used to increase diversity, such as using a subset of the
training data for training each classifier, or using relatively weak classifiers (such as
decision stumps). The example of Bagging is provided in Figure 6.1.

Figure 6.1: Example of Bagging.

8
7. CONCLUSION

Although ensemble methods can help you win machine learning competitions
by devising sophisticated algorithms and producing results with high accuracy, it is
often not preferred in the industries where interpretability is more important.
Nonetheless, the effectiveness of these methods is undeniable, and their benefits in
appropriate applications can be tremendous. In fields such as healthcare, even the
smallest amount of improvement in the accuracy of machine learning algorithms can
be something truly valuable.

9
References
[1] N. J. Nilsson. ‘Learning Machines: Foundations of trainable pattern-classifying
systems. McgrawHill”, New York, 1965.

[2] Joseph Rocca. “Ensemble methods: bagging, boosting and stacking”, Accessed:
Augest. 2019. [Online]. Available: https://towardsdatascience.com/ensemble-
methods-bagging-boosting-and-stacking-c9214a10a205
[3] Robert E. Schapire. “The strength of weak learnability. Machine Learning”,
5(2):197– 227, 1990.

[4] Wolpert, D. H. (1992b). “Stacked generalization. Neural Networks”, 5:241 – 259.

[5] BREIMAN, L. Ž1996b. “Bagging predictors. Machine Learning”, 26 123 – 140.

10

You might also like