WINSEM2021-22 CSE4020 ETH VL2021220502460 Reference Material I 02-03-2022 Ensemble Classifier Intro 22

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Ensembles of Classifiers

Ensemble methods combine several decision trees classifiers to produce better


predictive performance than a single decision tree classifier. The main principle
behind the ensemble model is that a group of weak learners come together to form a
strong learner, thus increasing the accuracy of the model.When we try to predict the
target variable using any machine learning technique, the main causes of difference
in actual and predicted values are noise, variance, and bias. Ensemble helps to
reduce these factors (except noise, which is irreducible error).
Another way to think about Ensemble learning is Fable of blind men and
elephant. All of the blind men had their own description of the elephant. Even though
each of the description was true, it would have been better to come together and
discuss their undertanding before coming to final conclusion. This story perfectly
describes the Ensemble learning method.
Using techniques like Bagging and Boosting helps to decrease the variance and
increased the robustness of the model. Combinations of multiple classifiers decrease
variance, especially in the case of unstable classifiers, and may produce a more reliable
classification than a single classifier.
Different Classifiers

• Performance
– None of the classifiers is perfect
– Complementary
• Examples which are not correctly classified
by one classifier may be correctly classified
by the other classifiers
• Potential Improvements?
– Utilize the complementary property
Ensembles of Classifiers

• Idea
– Combine the classifiers to improve the
performance
• Ensembles of Classifiers
– Combine the classification results from
different classifiers to produce the final
output
• Unweighted voting
• Weighted voting
Ensembles of Classifiers
• Basic idea is to learn a set of
classifiers (experts) and to allow them
to vote.
• Advantage: improvement in
predictive accuracy.
• Disadvantage: it is difficult to
understand an ensemble of classifiers.
Why Majority Voting works?
• Suppose there are 25
base classifiers
– Each classifier has
error rate,  = 0.35
– Assume errors made
by classifiers are
uncorrelated
– Probability that the
ensemble classifier makes
a wrong prediction:
25
 25  i
P( X  13)     (1   ) 25i  0.06
i 13  i 
Methods for Independently
Constructing Ensembles
One way to force a learning algorithm to construct
multiple hypotheses is to run the algorithm several
times and provide it with somewhat different data in
each run. This idea is used in the following methods:
• Majority Voting
•Bagging
• Randomness Injection
• Feature-Selection Ensembles
• Error-Correcting Output Coding
Majority Vote

Original
D Training data

Step 1:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 2:
Combine C*
Classifiers
Hard Voting Classifier : Aggregate predections of each classifier and predict the class
that gets most votes. This is called as “majority – voting” or “Hard –
voting” classifier.
Soft Voting Classifier : In an ensemble model, all classifiers (algorithms) are able to estimate class
probabilities (i.e., they all have predict_proba() method), then we can specify Scikit-Learn to predict the class
with the highest probability, averaged over all the individual classifiers.
Why does bagging work?
• Bagging reduces variance by voting/
averaging, thus reducing the overall expected
error
– In the case of classification there are pathological
situations where the overall error might increase
– Usually, the more classifiers the better
Bagging
• Employs simplest way of combining predictions that
belong to the same type.
• Combining can be realized with voting or averaging
• Each model receives equal weight
• “Idealized” version of bagging:
– Sample several training sets of size n (instead of just having
one training set of size n)
– Build a classifier for each training set
– Combine the classifier’s predictions
• This improves performance in almost all cases if
learning scheme is unstable (i.e. decision trees)
Bagging classifiers
Classifier generation
Let n be the size of the training set.
For each of t iterations:
Sample n instances with replacement from the
training set.
Apply the learning algorithm to the sample.
Store the resulting classifier.

classification
For each of the t classifiers:
Predict class of instance using classifier.
Return class that was predicted most often.
Example: Weather Forecast

Reality
1 X X X
2 X X X
3 X X X
4 X X
5 X X
Combine
Different Learners: Bagging

24
Different Learners: Boosting
• Boosting: train next learner on mistakes
made by previous learner(s)
• Want “weak” learners
– Weak: P(correct) > 50%, but not necessarily
by a lot
– Idea: solve easy problems with simple model
– Save complex model for hard problems

25
Original Boosting
1. Split data X into {X1, X2, X3}
2. Train L1 on X1
– Test L1 on X2
3. Train L2 on L1’s mistakes on X2 (plus some right)
– Test L1 and L2 on X3
4. Train L3 on disagreements between L1 and L2
• Testing: apply L1 and L2; if disagree, use L3

• Drawback: need large X

26
Methods for Coordinated Construction of
Ensembles

The key idea is to learn complementary classifiers so


that instance classification is realized by taking an
weighted sum of the classifiers. This idea is used in
two methods:
• Boosting
• Stacking.
Boosting
• Also uses voting/averaging but models are
weighted according to their performance
• Iterative procedure: new models are influenced
by performance of previously built ones
– New model is encouraged to become expert for
instances classified incorrectly by earlier models
– Intuitive justification: models should be experts
that complement each other
• There are several variants of this algorithm
AdaBoost.M1
classifier generation
Assign equal weight to each training instance.
For each of t iterations:
Learn a classifier from weighted dataset.
Compute error e of classifier on weighted dataset.
If e equal to zero, or e greater or equal to 0.5:
Terminate classifier generation.
For each instance in dataset:
If instance classified correctly by classifier:
Multiply weight of instance by e / (1 - e).
Normalize weight of all instances.

classification
Assign weight of zero to all classes.
For each of the t classifiers:
Add -log(e / (1 - e)) to weight of class predicted
by the classifier.
Return class with highest weight.
Remarks on Boosting
• Boosting can be applied without weights using re-
sampling with probability determined by weights;
• Boosting decreases exponentially the training error in
the number of iterations;
• Boosting works well if base classifiers are not too
complex and their error doesn’t become too large too
quickly!
• Boosting reduces the bias component of the error of
simple classifiers!
Some Practical Advices
•If the classifier is unstable (high variance), then apply
bagging!
•If the classifier is stable and simple (high bias) then
apply boosting!
•If the classifier is stable and complex then apply
randomization injection!
•If you have many classes and a binary classifier then
try error-correcting codes! If it does not work then use a
complex binary classifier!

You might also like