Lec 08 Ensembles

Lecture 8
Ensembles
Machine Learning
Andrey Filchenkov
17.06.2016
Lecture plan
• Composition of algorithms
• Algorithm diversity
• Random forest
• Gradient Boosting
• AdaBoost and its theoretical properties
• The presentation is prepared with

materials of the K.V. Vorontsov’s
course “Machine Leaning”.
Machine learning. Lecture 8. Ensembles. 17.06.2016. 2
Lecture plan
• Random forest
• AdaBoost and its theoretical
properties

Weak and strong learnability
Weak learnability means that one can find in polynomial

time an algorithm, performance of which would be more
than 0.5.
Strong learnability means that one can find in polynomial
time an algorithm, performance of which would be any
high.
What is true: weak or strong learnability?

Weak and strong learnability
Weak learnability means that one can find in polynomial

time an algorithm, performance of which would be more
than 0.5.
Strong learnability means that one can find in polynomial
time an algorithm, performance of which would be any
high.
Theorem (Schapire, 1990)

Strong learnability is equivalent to weak learnability,
because any model can be strengthen with algorithm
composition.

Simple example
We have n algorithms with probabilities of correct

classification answer � , � , … , � ≈ �. These probabilities
are independent.
Resulting algorithm will choose class label with respect to
the most preferable class within these algorithms.
Then probability of correct classification answer is:
−
−
� �� = � + � −� + � − − � + ⋯+
/ / /
+� � −� .

Problem formulation
Object set , answer set .

Training sample � = � ��= with known labels � ��= .
Family of basic algorithms
= ℎ , : → ∈� ,
is parameter vector which describes an algorithm, is
values space (usually ℝ or ℝ ).
Problem: find (synthesize) algorithm which is the most
precise in forecasting label of object of .

Composition of algorithms
Composition of basic algorithms

ℎ , … , ℎ : → is
=� ℎ ,…,ℎ
where �: → is decision rule, : → is
adjusting function.
is usually wider than .

Voting
The simplest example of adjusting function is

voting. Different algorithm should vote.
Two types of voting:
• majority voting (count votes)
• soft voting (count probabilities)
We can add weights for voters (better with soft
voting).

Implementation (in Python)
Python: VotingClassifier
eclf1 = VotingClassifier (estimators=
[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
voting='hard')
previously
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()

Implementation (in WEKA)
WEKA: Vote
Vote voter = new Vote();
voter.setClassifiers(classifiers)
previously
Classifier[] classifiers = {
new J48(),
new NaiveBayes(),
new RandomForest()
};

Lecture plan
• Random forest
properties

Key idea
Idea: we can build diverse algorithms by learning

one model in different conditions.
Precise idea: we can use different bites of dataset.

Synthesis of random algorithms
• Subsampling: learn algorithm on subsample.

• Bagging: learn algorithm on subsamples of the
same length with bootstrap (random choice with
returns)
• Random subspace method: learn algorithms of
subspaces of features
• Filtering (next slide)

Filtering
Let we have a sample of infinite size.

Learn first algorithm on , which are first objects.
Then toss a coin times:
• head: add in first incorrectly classified object;
• tail: add in first correctly classified object.
Learn second algorithm on .
Add in first object, on which first two classifiers give
different answers.
Learn third algorithm on .

Implementation
Python: BaggingClassifier
BaggingRegressor
WEKA: Bagging

Lecture plan
• Random forest
properties

Random Forest
For sample � = � , � ��= with features.

1. Choose a subsample size of � with bootstrap.
2. Synthesize decision tree for that sample, for
each vertex choose ′ (usually ′ = ) random
features.
3. No pruning is applied.

Implementation
Python: RandomForestClassifier
RandomForestRegressor
ExtraTreesClassifier
ExtraTreesRegressor
WEKA: RandomForest

Lecture plan
• Random forest
properties

Boosting problem formulation
Let’s synthesize an algorithm described as
= �ℎ , � ,
�=
where � ∈ ℝ are the coefficients minimizing
empirical risk
�
= � � , � → min
�
for a loss function � , � .

Gradient descent
�
It is hard to find exact solution �, � �= .
We will develop function step by step
� = �− + �ℎ , �
To do that, we estimate gradient of error function

= ��= � � , � increment.

Gradient
Gradient:
� � �
� � � �, �
= � = � =
� �− �=
� �− �=
�
� �− , �
= � .
� �− �=
Thus, we will add

� = �− − � .

Parameters selection
� = �− − �
�
� = argmin �− � − , � .
�=
Vector is not a basic algorithm, so

�
� = argmin ∈� �=
ℎ �, , � ≡
� �
≡ LEARN � �= , � �= .
Find with linear search
�
� = argmin �− � − ℎ �, � , � .
�=

Generalized algorithm
�,
Input:
� �
=LEARN � �= , � �=
1. for = to do
� ��−1 ,��
2. = �
��−1 �=
� �
3. � =LEARN � �= , � �=
�
4. � = argmin �= � , ℎ�− � − ℎ �, �
5. � = �− + �ℎ , �
6. return

Decision rule
Decision rule: � → :
• for regression, =ℝ
� = , or with a transformation.
• for classification on � classes, = { , … , �}
� ℎ , … , ℎ� = argmax�∈� ℎ�
• for binary classification, = − , +
� = sign
Usually this function is used:
, =

Smoothness of �
Typical is piecewise linear:
� �
� = = � � � � <
�= �= �=
Smooth approximation of margian loss function[ ≤ ]:

• = exp − — exponential (in AdaBoost)
• = log + �− — logarithmic (in LogitBoost)
• = − — квадратичная (in GentleBoost)
• = exp − + — Gaussian (in
BrownBoost)

Well-known algorithms
• AdaBoost
• AnyBoost
• LogitBoost
• BrownBoost
• ComBoost
• Stochastic gradient boosting

Implementations
Python: GradientBoostingClassifier
GradientBoostingRegressor
AdaBoostClassifier
AdaBoostRegressor
WEKA: AdaBoostM1
LogitBoost
AdditiveRegression

Lecture plan
• Random forest
properties

AdaBoost Basis
= �ℎ , � ,
�=
It is classification, therefore , = .
Loss function is = exp −
Term “weights” is historically earlier, than “gradient”.
For weight vector � :

• ℎ, � is number of correctly classified objects
(TP+TN)
• ℎ, � is number of incorrectly classified objects
(FP+FN)
Objects weights
For , = .
� �− � � �− �
= � = � � = � � �= ,
� �− �=
� �− � �=
� ��−1 ��
where � = � is weight of object �.
� ��−1 ��
Then the forth algorithm step is � =LEARN � �= , � �= :
ℎ , � = argmin ∈� ℎ �, � , � =
�=
= argmin ∈� � �ℎ �, � .
�=

AdaBoost
Input: �,
1. for � = to � do
2. � =
�
3. for = to � do
�
4. � = argmin� ℎ , � ,
�
5. � = �= �[ �ℎ �. � < ]
− �
6. � = ln
�
7. for � = to � do
8. � = � exp − � �ℎ �, �
�
9. NORMALIZE( � �= )
10. return = ��= �ℎ , �
Main boosting theorem
Theorem (Freund, Schapire, 1995)

For all normalized weights vector � such an algorithm
= ℎ , exist, which classifies sample better than
randomly:
= , � < / .
Then minimum of � is reached with
= argmin� , � ,
�
− �
�,
� = ln �
.
�,

Classification refusals
Let + ≠ �. The algorithm can refuse to classify.
Theorem (Freund, Schapire, 1996)

Let for every normalized weight vector � an algorithm
= ℎ , exists such that it classifies a sample at least a bit
better than randomly:
, � < , � .
The minimum of � is reached with
= argmin� , � − , � ,
�
�
�,
� = ln �
.
�,

Boosting foundamentals
�
νθ , � = [ ≤ ]
� �
�
�
Theorem (Freund, Schapire, Barlett, 1998)

If < ∞, then ∀ > , ∀ ∈ , with probability
−
Pr AdaBoost <
�
ln ln �
≤ νθ , +� + ln .
� �
It does not depend on .

Boosting discussion
Advantages:
• hard to get overfitted
• can be applied for different loss functions
Disadvantages:
• no noise processing
• cannot be applied for powerful algorithm
• it is hard to explain result

Lec 08 Ensembles

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 08 Ensembles

Uploaded by

Copyright:

Available Formats

Lecture 8

• The presentation is prepared with

Machine learning. Lecture 8. Ensembles. 17.06.2016. 3

Weak learnability means that one can find in polynomial

What is true: weak or strong learnability?

Machine learning. Lecture 8. Ensembles. 17.06.2016. 4

Weak learnability means that one can find in polynomial

Theorem (Schapire, 1990)

Machine learning. Lecture 8. Ensembles. 17.06.2016. 5

We have n algorithms with probabilities of correct

Machine learning. Lecture 8. Ensembles. 17.06.2016. 6

Object set , answer set .

Machine learning. Lecture 8. Ensembles. 17.06.2016. 7

Composition of basic algorithms

is usually wider than .

Machine learning. Lecture 8. Ensembles. 17.06.2016. 8

The simplest example of adjusting function is

Machine learning. Lecture 8. Ensembles. 17.06.2016. 9

Machine learning. Lecture 8. Ensembles. 17.06.2016. 10

Machine learning. Lecture 8. Ensembles. 17.06.2016. 11

Machine learning. Lecture 8. Ensembles. 17.06.2016. 12

Idea: we can build diverse algorithms by learning

Machine learning. Lecture 8. Ensembles. 17.06.2016. 13

• Subsampling: learn algorithm on subsample.

Machine learning. Lecture 8. Ensembles. 17.06.2016. 14

Let we have a sample of infinite size.

Machine learning. Lecture 8. Ensembles. 17.06.2016. 15

Machine learning. Lecture 8. Ensembles. 17.06.2016. 16

Machine learning. Lecture 8. Ensembles. 17.06.2016. 17

For sample � = � , � ��= with features.

Machine learning. Lecture 8. Ensembles. 17.06.2016. 18

Machine learning. Lecture 8. Ensembles. 17.06.2016. 19

Machine learning. Lecture 8. Ensembles. 17.06.2016. 20

Let’s synthesize an algorithm described as

Machine learning. Lecture 8. Ensembles. 17.06.2016. 21

To do that, we estimate gradient of error function

Machine learning. Lecture 8. Ensembles. 17.06.2016. 22

Thus, we will add

Machine learning. Lecture 8. Ensembles. 17.06.2016. 23

Vector is not a basic algorithm, so

Machine learning. Lecture 8. Ensembles. 17.06.2016. 24

Machine learning. Lecture 8. Ensembles. 17.06.2016. 25

Machine learning. Lecture 8. Ensembles. 17.06.2016. 26

Smooth approximation of margian loss function[ ≤ ]:

Machine learning. Lecture 8. Ensembles. 17.06.2016. 27

Machine learning. Lecture 8. Ensembles. 17.06.2016. 28

Machine learning. Lecture 8. Ensembles. 17.06.2016. 29

Machine learning. Lecture 8. Ensembles. 17.06.2016. 30

For weight vector � :

Machine learning. Lecture 8. Ensembles. 17.06.2016. 32

Theorem (Freund, Schapire, 1995)

Machine learning. Lecture 8. Ensembles. 17.06.2016. 34

Theorem (Freund, Schapire, 1996)

Machine learning. Lecture 8. Ensembles. 17.06.2016. 35

Theorem (Freund, Schapire, Barlett, 1998)

It does not depend on .

Machine learning. Lecture 8. Ensembles. 17.06.2016. 37

You might also like