Professional Documents
Culture Documents
03-Hypothesis-Spaces - Commented4
03-Hypothesis-Spaces - Commented4
03-Hypothesis-Spaces - Commented4
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 1 / 44
CAPACITY
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 2 / 44
HYPOTHESIS SPACES
Recall, the representation of a learner is the space of allowed models
(also hypothesis space). The hypothesis space H is a space of
functions that have a certain functional form:
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 3 / 44
HYPOTHESIS SPACES
Examples of hypothesis spaces :
Linear regression: f (x | θ0 , θ) = xT θ + θ0 with θ ∈ Rp , θ0 ∈ R
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 4 / 44
HYPOTHESIS SPACES
Pm
Decision trees: f (x) = i =1 ci 1(x ∈ Qi ) (recursively divides the
feature space into axis-aligned rectangles.)
Pm
Ensemble methods: f (x | β [j ] ) = j =1 β [j ] b[j ] (x) (The
predictions of several models are aggregated in some manner.)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 5 / 44
HYPOTHESIS SPACES
Neural networks:
f (x) = τ ◦ φ ◦ σ (h) ◦ φ(h) ◦ σ (h−1) ◦ φ(h−1) ◦ . . . ◦ σ (1) ◦ φ(1) (x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 6 / 44
HYPOTHESIS SPACES
Each neuron in a given layer performs a two-step computation: a
weighted sum of its inputs from the previous layer
φ(j ) (z ) = wj> z + bj , wj ∈ Rp , bj ∈ R
followed by a non-linear transformation (σ ) of the sum
σ (j ) φ(j ) (z ) = σ (j ) wj> z + bj
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 7 / 44
OVERFITTING
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 8 / 44
OVERFITTING: POLYNOMIAL REGRESSION
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 9 / 44
OVERFITTING: POLYNOMIAL REGRESSION
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 10 / 44
OVERFITTING: DECISION TREES
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 11 / 44
OVERFITTING: DECISION TREES
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 12 / 44
OVERFITTING: K-NEAREST NEIGHBOURS
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 13 / 44
OVERFITTING: K-NEAREST NEIGHBOURS
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 14 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
A general measure of the complexity of a function space is the
Vapnik-Chervonenkis (VC) dimension.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 15 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Example:
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 16 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Theorem: The VC dimension of the class of homogeneous halfspaces,
H = {h : Rp → {−1, 1} | h(x) = sign(xT θ)} in Rp is p.
p as an upper bound:
Let x(1) , x(2) , . . . , x(p+1) be a set of p + 1 vectors in Rp .
Because any set of p + 1 vectors in Rp is linearly dependent,
there must exist real numbers a1 , a2 , . . . , ap+1 ∈ R, not all of them
Pp+1
zero, such that i =1 ai x(i ) = 0.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 17 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Let I = {i : ai > 0} and J = {j : aj < 0}. Either I or J is nonempty. If
we assume both I and J are nonempty, then:
ai x(i ) = |aj |x(j )
P P
i ∈I j ∈J
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 18 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
This implies
!>
X X
> (i ) (i )
0< ai · θ x = ai x θ
i ∈I i ∈I
>
X X
= |aj | x(j ) θ = |aj | · θ > x(j ) < 0
j ∈J j ∈J
which is a contradiction.
On the other hand, if we assume J (respectively, I) is empty, then the
right-most (respectively, left-most) inequality should be replaced by an
equality, which is still a contradiction.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 19 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Theorem: The VC dimension of the class of non-homogeneous
halfspaces, H = {h : Rp → {−1, 1} | h(x | θ) = sign(xT θ + θ0 )} in
Rp is p + 1.
p + 1 as an upper bound:
Let’s assume that p + 2 vectors x(1) , . . . , x(p+2) are shattered by
the class of non-homogeneous halfspaces.
If we denote θ̃ = (θ0 , θ1 , θ2 , . . . , θp )> ∈ Rp+1 , where θ0 is the
bias/intercept, and x̃ = (1, x1 , x2 , .. . xp )> p +1
∈ R , then
h(x | θ) = sgn x> θ + θ0 = sgn x̃ > θ̃ . Any affine function in
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 20 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
By the previous theorem, the set of homogeneous halfspaces in
Rp+1 cannot shatter any set of p + 2 points. This leads us to a
contradiction.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 21 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Example: For k -nearest neighbours with k = 1, the VC dimension is
infinite.
where
(
1 a1 ≤ x1 ≤ a2 and b1 ≤ x2 ≤ b2
h(a1 ,a2 ,b1 ,b2 ) (x ) =
0 otherwise
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 22 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
4 as a lower bound: There exists a set of 4 points that can be shattered.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 23 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
4 as an upper bound: For any set of 5 points
x(1) , x(2) , x(3) , x(4) , x(5) ∈ R2 :
Assign the leftmost point (lowest x1 ) , rightmost point (highest x1 ) ,
highest point (highest x2 ) and lowest point (lowest x2 ) to class 1.
The point not chosen, x(5) , is assigned to class 0.
The rectangle must contain x(1) , x(2) , x(3) and x(4)
x(5) is classified as 1 as well since its coordinates are within the
intervals defined by the other 4
s
1 2|Dtrain | δ
P(errtest ≤ errtrain + d log + 1 − log ) = 1 − δ,
|Dtrain | d 4
| {z }
for δ ∈ [0, 1] if the training data set is large enough (d < |Dtrain | required).
The probability of the true error, errtest , being greater than errtrain + is
bounded by some constant δ .
This probability is over all possible datasets of size |Dtrain | drawn from
Pxy (which can be arbitrary).
If d is finite, both δ and can be made arbitrarily small by increasing the
sample size.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 25 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Recall that the training error is an optimistic estimate of the generalization (or
test) error. For a classification model with VC dimension d, the VC dimension
can predict a probabilistic upper bound on the test error.
s
1 2|Dtrain | δ
P(errtest ≤ errtrain + d log + 1 − log ) = 1 − δ,
|Dtrain | d 4
| {z }
for δ ∈ [0, 1] if the training data set is large enough (d < |Dtrain | required).
As a corollary, if the training error is low, the test error is also (probably)
low.
In other words, an algorithm which minimizes training error reliably picks
a "good" hypothesis from the hypothesis set.
Such an algorithm is known as a Probably Approximately Correct (or
PAC) algorithm.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 26 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
In general, the VC dimension of a hypothesis space increases as the number
of learnable parameters increases. However, it is not always possible to judge
the capacity of a model based solely on the number of parameters.
0.50 0
0.25 1
0.00
−2 0 2 4
x
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 27 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
A single-parametric sine classifier h(x ) = I[sin(θ x ) > 0], for x ∈ R, however,
has infinite VC dimension, since it can shatter any set of points if the
frequency θ is chosen large enough.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 28 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
The bounds derived by the VC analysis are extremely loose and
pessimistic.
Because they have to hold for an arbitrary Pxy , tightening the
bounds to a desired level requires the training sets to be extremely
large.
In practice, complex models (such as neural networks) often
perform much better than these bounds suggest.
Other measures of model capacity such as Rademacher
complexity may be easier to compute or provide tighter bounds.
In addition to the hypothesis space, the effective capacity of a
learning algorithm also depends in a complicated way on the
optimizer.
A better estimate of the generalization error can be obtained
simply by evaluating the learned hypothesis on the test set.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 29 / 44
BIAS-VARIANCE DECOMPOSITION
Let’s take a closer look at the generalization error of a learning
algorithm IL,O :
GEn (IL,O ) = EDn ∼Pxy ,(x,y )∼Pxy L y , f̂Dn (x) = EDn ,xy L y , f̂Dn (x) .
y = ftrue (x) + ,
with normal distributed error ∼ N (0, σ 2 ) independent of x.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 30 / 44
BIAS-VARIANCE DECOMPOSITION
By plugging in the L2-loss L(y , f (x)) = (y − f (x))2 we get
2
GEn (IL,O ) = EDn ,xy L y , f̂Dn (x) = EDn ,xy y − f̂Dn (x)
2
= Exy EDn y − f̂Dn (x) | x, y
| {z }
(∗)
Let us consider the error (∗) conditioned on one fixed test observation
(x, y ) first. (We omit the | x, y for better readability for now. )
2
(∗) = EDn y − f̂Dn (x)
= EDn y 2 + EDn f̂Dn (x)2 −2 EDn y f̂Dn (x)
| {z } | {z } | {z }
=y 2 (1 ) (2)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 31 / 44
BIAS-VARIANCE DECOMPOSITION
By using that Var(y ) = E(y 2 ) − E2 (y ), we see that
(1) = EDn f̂Dn (x)2 = VarDn f̂Dn (x) + E2Dn f̂Dn (x)
(2) = EDn (ftrue (x) + ) · f̂Dn (x) = EDn ftrue (x)f̂Dn (x) + f̂Dn (x)
= EDn ftrue (x)f̂Dn (x) + EDn f̂Dn (x)
| {z }
=E()E(f̂ )=0
= EDn ftrue (x)f̂Dn (x) = EDn (ftrue (x)) · EDn f̂Dn (x)
2 2
(∗) = y + VarDn f̂Dn (x) + EDn f̂Dn (x) − 2 · EDn (ftrue (x)) · EDn f̂Dn (x)
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 32 / 44
BIAS-VARIANCE DECOMPOSITION
Let’s come back to the generalization error by taking the expectation
over all possible random observations (x, y ) ∼ Pxy :
h i
GEn (IL,O ) = Exy (y 2 − ftrue (x)2 | x, y ) +Exy VarDn f̂Dn (x) | x, y
| {z }
=σ 2
h i
2
+ Exy EDn ftrue (x) − f̂Dn (x) | x, y
h i
2
= σ +Exy VarDn f̂Dn (x) | x, y
|{z}
Variance of the data | {z }
Variance of inducer at (x,y )
h i
2
+ Exy EDn ftrue (x) − f̂Dn (x) | x, y
| {z }
Squared bias of inducer at (x,y )
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 33 / 44
BIAS-VARIANCE DECOMPOSITION
h i
GEn (IL,O ) = σ2
|{z} +Exy VarDn f̂Dn (x) | x, y
Variance of the data | {z }
Variance of inducer at (x,y )
h i
+ Exy E2Dn ftrue (x) − f̂Dn (x) | x, y
| {z }
Squared bias of inducer at (x,y )
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 34 / 44
BIAS-VARIANCE DECOMPOSITION
If one test observation is fixed, how does the prediction vary
w.r.t. the training data used?
We then take the expectation over all test points (x, y ) ∼ Pxy
Expresses also the learner’s tendency to learn random things
irrespective of the real signal (overfitting)
3 The third term E2Dn ftrue (x) − f̂Dn (x) | x, y is the squared bias at
a (single) random observation (x, y ).
Learner’s tendency to consistently misclassify certain
instances (underfitting)
Models with high capacity have low bias and models with low
capacity have high bias
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 35 / 44
BIAS-VARIANCE DECOMPOSITION
Figure: Left: A model with high bias is unable to fit the curved relationship
present in the data. Right: A model with no bias and high variance can, in
principle, learn the true pattern in the data. However, in practice, the learner
outputs wildly different hypotheses for different training sets.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 36 / 44
LEARNING CURVES
credit: digitalvidya
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 37 / 44
LEARNING CURVES
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 38 / 44
LEARNING CURVES
↓ Reduce underfitting by
decreasing bias. Make
model more flexible (or
choose another).
← Reduce overfitting by
decreasing variance. Make
model less flexible
(regularization), or add
more data.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 39 / 44
ML AS AN ILL-POSED PROBLEM
source: USC
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 40 / 44
ML AS AN ILL-POSED PROBLEM
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 41 / 44
NO FREE LUNCH
For a specific problem, that is, for a specific distribution over target
functions, it is possible for some algorithms to perform better than
some others.
When averaged over all possible problems, however, no algorithm
is better than any other (including random guessing). This
important result is known as the No Free Lunch theorem.
The essence of the NFL theorem is that an algorithm that performs
well on one set of problems has to necessarily perform badly on
another set of problems. There is no "free lunch".
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 42 / 44
NO FREE LUNCH
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 43 / 44
NO FREE LUNCH
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 44 / 44