Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Lecture 8

Ensembles

Machine Learning
Andrey Filchenkov

17.06.2016
Lecture plan
• Composition of algorithms
• Algorithm diversity
• Random forest
• Gradient Boosting
• AdaBoost and its theoretical properties

• The presentation is prepared with


materials of the K.V. Vorontsov’s
course “Machine Leaning”.
Machine learning. Lecture 8. Ensembles. 17.06.2016. 2
Lecture plan
• Composition of algorithms
• Algorithm diversity
• Random forest
• Gradient Boosting
• AdaBoost and its theoretical
properties

Machine learning. Lecture 8. Ensembles. 17.06.2016. 3


Weak and strong learnability

Weak learnability means that one can find in polynomial


time an algorithm, performance of which would be more
than 0.5.
Strong learnability means that one can find in polynomial
time an algorithm, performance of which would be any
high.

What is true: weak or strong learnability?

Machine learning. Lecture 8. Ensembles. 17.06.2016. 4


Weak and strong learnability

Weak learnability means that one can find in polynomial


time an algorithm, performance of which would be more
than 0.5.
Strong learnability means that one can find in polynomial
time an algorithm, performance of which would be any
high.

Theorem (Schapire, 1990)


Strong learnability is equivalent to weak learnability,
because any model can be strengthen with algorithm
composition.

Machine learning. Lecture 8. Ensembles. 17.06.2016. 5


Simple example

We have n algorithms with probabilities of correct


classification answer � , � , … , � ≈ �. These probabilities
are independent.
Resulting algorithm will choose class label with respect to
the most preferable class within these algorithms.
Then probability of correct classification answer is:


� �� = � + � −� + � − − � + ⋯+
/ / /
+� � −� .

Machine learning. Lecture 8. Ensembles. 17.06.2016. 6


Problem formulation

Object set , answer set .


Training sample � = � ��= with known labels � ��= .
Family of basic algorithms
= ℎ , : → ∈� ,
is parameter vector which describes an algorithm, is
values space (usually ℝ or ℝ ).
Problem: find (synthesize) algorithm which is the most
precise in forecasting label of object of .

Machine learning. Lecture 8. Ensembles. 17.06.2016. 7


Composition of algorithms

Composition of basic algorithms


ℎ , … , ℎ : → is
=� ℎ ,…,ℎ
where �: → is decision rule, : → is
adjusting function.

is usually wider than .

Machine learning. Lecture 8. Ensembles. 17.06.2016. 8


Voting

The simplest example of adjusting function is


voting. Different algorithm should vote.
Two types of voting:
• majority voting (count votes)
• soft voting (count probabilities)
We can add weights for voters (better with soft
voting).

Machine learning. Lecture 8. Ensembles. 17.06.2016. 9


Implementation (in Python)

Python: VotingClassifier
eclf1 = VotingClassifier (estimators=
[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
voting='hard')
previously
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()

Machine learning. Lecture 8. Ensembles. 17.06.2016. 10


Implementation (in WEKA)

WEKA: Vote
Vote voter = new Vote();
voter.setClassifiers(classifiers)
previously
Classifier[] classifiers = {
new J48(),
new NaiveBayes(),
new RandomForest()
};

Machine learning. Lecture 8. Ensembles. 17.06.2016. 11


Lecture plan
• Composition of algorithms
• Algorithm diversity
• Random forest
• Gradient Boosting
• AdaBoost and its theoretical
properties

Machine learning. Lecture 8. Ensembles. 17.06.2016. 12


Key idea

Idea: we can build diverse algorithms by learning


one model in different conditions.
Precise idea: we can use different bites of dataset.

Machine learning. Lecture 8. Ensembles. 17.06.2016. 13


Synthesis of random algorithms

• Subsampling: learn algorithm on subsample.


• Bagging: learn algorithm on subsamples of the
same length with bootstrap (random choice with
returns)
• Random subspace method: learn algorithms of
subspaces of features
• Filtering (next slide)

Machine learning. Lecture 8. Ensembles. 17.06.2016. 14


Filtering

Let we have a sample of infinite size.


Learn first algorithm on , which are first objects.
Then toss a coin times:
• head: add in first incorrectly classified object;
• tail: add in first correctly classified object.
Learn second algorithm on .
Add in first object, on which first two classifiers give
different answers.
Learn third algorithm on .

Machine learning. Lecture 8. Ensembles. 17.06.2016. 15


Implementation

Python: BaggingClassifier
BaggingRegressor

WEKA: Bagging

Machine learning. Lecture 8. Ensembles. 17.06.2016. 16


Lecture plan
• Composition of algorithms
• Algorithm diversity
• Random forest
• Gradient Boosting
• AdaBoost and its theoretical
properties

Machine learning. Lecture 8. Ensembles. 17.06.2016. 17


Random Forest

For sample � = � , � ��= with features.


1. Choose a subsample size of � with bootstrap.
2. Synthesize decision tree for that sample, for
each vertex choose ′ (usually ′ = ) random
features.
3. No pruning is applied.

Machine learning. Lecture 8. Ensembles. 17.06.2016. 18


Implementation

Python: RandomForestClassifier
RandomForestRegressor
ExtraTreesClassifier
ExtraTreesRegressor
WEKA: RandomForest

Machine learning. Lecture 8. Ensembles. 17.06.2016. 19


Lecture plan
• Composition of algorithms
• Algorithm diversity
• Random forest
• Gradient Boosting
• AdaBoost and its theoretical
properties

Machine learning. Lecture 8. Ensembles. 17.06.2016. 20


Boosting problem formulation

Let’s synthesize an algorithm described as

= �ℎ , � ,
�=
where � ∈ ℝ are the coefficients minimizing
empirical risk

= � � , � → min

for a loss function � , � .

Machine learning. Lecture 8. Ensembles. 17.06.2016. 21


Gradient descent


It is hard to find exact solution �, � �= .
We will develop function step by step
� = �− + �ℎ , �

To do that, we estimate gradient of error function


= ��= � � , � increment.

Machine learning. Lecture 8. Ensembles. 17.06.2016. 22


Gradient

Gradient:
� � �
� � � �, �
= � = � =
� �− �=
� �− �=

� �− , �
= � .
� �− �=

Thus, we will add


� = �− − � .

Machine learning. Lecture 8. Ensembles. 17.06.2016. 23


Parameters selection

� = �− − �

� = argmin �− � − , � .
�=

Vector is not a basic algorithm, so



� = argmin ∈� �=
ℎ �, , � ≡
� �
≡ LEARN � �= , � �= .
Find with linear search

� = argmin �− � − ℎ �, � , � .
�=

Machine learning. Lecture 8. Ensembles. 17.06.2016. 24


Generalized algorithm

�,
Input:
� �
=LEARN � �= , � �=

1. for = to do
� ��−1 ,�� �
2. = �
���−1 �=
� �
3. � =LEARN � �= , � �=

4. � = argmin �= � , ℎ�− � − ℎ �, �
5. � = �− + �ℎ , �
6. return

Machine learning. Lecture 8. Ensembles. 17.06.2016. 25


Decision rule
Decision rule: � → :
• for regression, =ℝ
� = , or with a transformation.
• for classification on � classes, = { , … , �}
� ℎ , … , ℎ� = argmax�∈� ℎ�
• for binary classification, = − , +
� = sign
Usually this function is used:
, =

Machine learning. Lecture 8. Ensembles. 17.06.2016. 26


Smoothness of �
Typical is piecewise linear:
� �

� = = � � � � <
�= �= �=

Smooth approximation of margian loss function[ ≤ ]:


• = exp − — exponential (in AdaBoost)
• = log + �− — logarithmic (in LogitBoost)
• = − — квадратичная (in GentleBoost)
• = exp − + — Gaussian (in
BrownBoost)

Machine learning. Lecture 8. Ensembles. 17.06.2016. 27


Well-known algorithms

• AdaBoost
• AnyBoost
• LogitBoost
• BrownBoost
• ComBoost
• Stochastic gradient boosting

Machine learning. Lecture 8. Ensembles. 17.06.2016. 28


Implementations

Python: GradientBoostingClassifier
GradientBoostingRegressor
AdaBoostClassifier
AdaBoostRegressor
WEKA: AdaBoostM1
LogitBoost
AdditiveRegression

Machine learning. Lecture 8. Ensembles. 17.06.2016. 29


Lecture plan
• Composition of algorithms
• Algorithm diversity
• Random forest
• Gradient Boosting
• AdaBoost and its theoretical
properties

Machine learning. Lecture 8. Ensembles. 17.06.2016. 30


AdaBoost Basis

= �ℎ , � ,
�=
It is classification, therefore , = .
Loss function is = exp −
Term “weights” is historically earlier, than “gradient”.

For weight vector � :


• ℎ, � is number of correctly classified objects
(TP+TN)
• ℎ, � is number of incorrectly classified objects
(FP+FN)
Machine learning. Lecture 8. Ensembles. 17.06.2016. 31
Objects weights

For , = .
� �− � � �− �
= � = � � = � � �= ,
� �− �=
� �− � �=
� ��−1 ��
where � = � is weight of object �.
� ��−1 ��
Then the forth algorithm step is � =LEARN � �= , � �= :

ℎ , � = argmin ∈� ℎ �, � , � =
�=

= argmin ∈� � �ℎ �, � .
�=

Machine learning. Lecture 8. Ensembles. 17.06.2016. 32


AdaBoost
Input: �,

1. for � = to � do
2. � =

3. for = to � do

4. � = argmin� ℎ , � ,

5. � = �= �[ �ℎ �. � < ]
− �
6. � = ln

7. for � = to � do
8. � = � exp − � �ℎ �, �

9. NORMALIZE( � �= )
10. return = ��= �ℎ , �
Machine learning. Lecture 8. Ensembles. 17.06.2016. 33
Main boosting theorem

Theorem (Freund, Schapire, 1995)


For all normalized weights vector � such an algorithm
= ℎ , exist, which classifies sample better than
randomly:
= , � < / .
Then minimum of � is reached with
= argmin� , � ,

− �
�,
� = ln �
.
�,

Machine learning. Lecture 8. Ensembles. 17.06.2016. 34


Classification refusals
Let + ≠ �. The algorithm can refuse to classify.

Theorem (Freund, Schapire, 1996)


Let for every normalized weight vector � an algorithm
= ℎ , exists such that it classifies a sample at least a bit
better than randomly:
, � < , � .
The minimum of � is reached with
= argmin� , � − , � ,


�,
� = ln �
.
�,

Machine learning. Lecture 8. Ensembles. 17.06.2016. 35


Boosting foundamentals

νθ , � = [ ≤ ]
� �

Theorem (Freund, Schapire, Barlett, 1998)


If < ∞, then ∀ > , ∀ ∈ , with probability

Pr AdaBoost <


ln ln �
≤ νθ , +� + ln .
� �

It does not depend on .


Machine learning. Lecture 8. Ensembles. 17.06.2016. 36
Boosting discussion

Advantages:
• hard to get overfitted
• can be applied for different loss functions
Disadvantages:
• no noise processing
• cannot be applied for powerful algorithm
• it is hard to explain result

Machine learning. Lecture 8. Ensembles. 17.06.2016. 37

You might also like