Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Advanced Statistical Learning

Chapter 3: Hypothesis Spaces and Capacity


Bernd Bischl, Julia Moosbauer, Andreas Groll

Department of Statistics – TU Dortmund


Winter term 2020/21
CAPACITY

credit: Ian Goodfellow

The performance of a learner depends on its ability to:


Minimize the training error
Generalize well to new data
Failure to obtain a sufficiently low training error is known as
underfitting.
On the other hand, if there is a large difference in training and test
error, this is known as overfitting.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 1 / 44
CAPACITY

credit: Ian Goodfellow

The tendency of a model to over/underfit is a function of its


capacity, determined by the type of hypotheses it can learn.
Loosely speaking, a model with low capacity can only learn a few
simple hypotheses, whereas a model with a large capacity can
learn many, possibly complex hypotheses.
As the figure shows, the test error is minimized when the model
neither underfits nor overfits, that is, when it has the right capacity.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 2 / 44
HYPOTHESIS SPACES
Recall, the representation of a learner is the space of allowed models
(also hypothesis space). The hypothesis space H is a space of
functions that have a certain functional form:

H := {f : X → Rg | f has a specific form }.


Often f is parametrized by θ ∈ Θ. We write f (x) = f (x | θ).

Note: If we are explicitly talking about hard classifiers outputting a


discrete class, we write h instead of f . If a model class can output
discrete classes, scores, or probabilities (e.g. neural networks or
ensemble methods), which needs to be specified, we just write f which
subsumes all those cases.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 3 / 44
HYPOTHESIS SPACES
Examples of hypothesis spaces :
Linear regression: f (x | θ0 , θ) = xT θ + θ0 with θ ∈ Rp , θ0 ∈ R

Separating hyperplanes: h(x | θ0 , θ) = 1(xT θ − θ0 > 0)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 4 / 44
HYPOTHESIS SPACES
Pm
Decision trees: f (x) = i =1 ci 1(x ∈ Qi ) (recursively divides the
feature space into axis-aligned rectangles.)

Pm
Ensemble methods: f (x | β [j ] ) = j =1 β [j ] b[j ] (x) (The
predictions of several models are aggregated in some manner.)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 5 / 44
HYPOTHESIS SPACES
Neural networks:
f (x) = τ ◦ φ ◦ σ (h) ◦ φ(h) ◦ σ (h−1) ◦ φ(h−1) ◦ . . . ◦ σ (1) ◦ φ(1) (x)

A neural network consists of layers of simple computational units


known as “neurons” (there are h + 1 such layers in the formula
above).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 6 / 44
HYPOTHESIS SPACES
Each neuron in a given layer performs a two-step computation: a
weighted sum of its inputs from the previous layer

φ(j ) (z ) = wj> z + bj , wj ∈ Rp , bj ∈ R
followed by a non-linear transformation (σ ) of the sum
   
σ (j ) φ(j ) (z ) = σ (j ) wj> z + bj

The network as a whole represents a nested composition of such


operations.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 7 / 44
OVERFITTING

The capacity (or “complexity”) of a model can be increased by


increasing the size of the hypothesis space.
This (usually) also increases the number of learnable parameters.
Examples: Increasing the degree of the polynomial in linear
regression, increasing the depth of a decision tree or a neural
network, adding additional predictors, etc.
As the size of the hypothesis space increases, the tendency of a
model to overfit also increases.
Such a model might fit even the random quirks in the training data,
thereby failing to generalize.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 8 / 44
OVERFITTING: POLYNOMIAL REGRESSION

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 9 / 44
OVERFITTING: POLYNOMIAL REGRESSION

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 10 / 44
OVERFITTING: DECISION TREES

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 11 / 44
OVERFITTING: DECISION TREES

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 12 / 44
OVERFITTING: K-NEAREST NEIGHBOURS

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 13 / 44
OVERFITTING: K-NEAREST NEIGHBOURS

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 14 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
A general measure of the complexity of a function space is the
Vapnik-Chervonenkis (VC) dimension.

The VC dimension of a class of binary-valued functions


H = {h : X → {0, 1}} is defined to be the largest number of points in
X (in some configuration) that can be “shattered” by members of H.
We write VCp (H), where p denotes the dimension of the input space.

A set of points is said to be shattered by a class of functions if a


member of this class can perfectly separate them no matter how we
assign binary labels to the points.

Note: If the VC-dimension of a hypothesis class is d, it doesn’t mean


that all sets of size d can be shattered. Rather, it simply means that
there is at least one such set which can be shattered and no set of size
d + 1 which can be shattered.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 15 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Example:

For x ∈ R2 , the class of linear indicator functions


H = {h : R2 → {0, 1} | h(x | θ0 , θ) = 1(xT θ − θ0 > 0)}
can shatter 3 points: No matter how we assign labels to the
configuration of three points shown above, we can find a linear line
separating them perfectly;
cannot shatter a configuration of 4 points;
Hence, VC2 (H) = 3.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 16 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Theorem: The VC dimension of the class of homogeneous halfspaces,
H = {h : Rp → {−1, 1} | h(x) = sign(xT θ)} in Rp is p.

Proof: p as a lower bound:


Consider the set of standard basis vectors e(1) , e(2) , . . . , e(p) in Rp . For
every possible labelling y (1) , y (2) , . . . , y (p) ∈ {−1, +1}, if we set
θ = (y (1) , y (2) , . . . , y (p) )> , then 
h(e(i ) ) = sgn θ > e(i ) = sgn y (i ) = y (i ) for all i. Therefore, the p
points are shattered.

p as an upper bound:
Let x(1) , x(2) , . . . , x(p+1) be a set of p + 1 vectors in Rp .
Because any set of p + 1 vectors in Rp is linearly dependent,
there must exist real numbers a1 , a2 , . . . , ap+1 ∈ R, not all of them
Pp+1
zero, such that i =1 ai x(i ) = 0.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 17 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Let I = {i : ai > 0} and J = {j : aj < 0}. Either I or J is nonempty. If
we assume both I and J are nonempty, then:
ai x(i ) = |aj |x(j )
P P
i ∈I j ∈J

Let’s assume x(1) , x(2) , . . . , x(p+1) are shattered by H.


There must exist a vector θ ∈ Rp such that

h(x(i ) | θ) = 1 ⇔ θ > x(i ) > 0 for all i ∈ I


(j ) > (j )
h (x | θ) = −1 ⇔ θ x <0 for all j ∈ J

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 18 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
This implies

!>
X X
> (i ) (i )
0< ai · θ x = ai x θ
i ∈I i ∈I
 >
X X
= |aj | x(j )  θ = |aj | · θ > x(j ) < 0
j ∈J j ∈J

which is a contradiction.
On the other hand, if we assume J (respectively, I) is empty, then the
right-most (respectively, left-most) inequality should be replaced by an
equality, which is still a contradiction.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 19 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Theorem: The VC dimension of the class of non-homogeneous
halfspaces, H = {h : Rp → {−1, 1} | h(x | θ) = sign(xT θ + θ0 )} in
Rp is p + 1.

Proof: p + 1 as a lower bound: Similar to the proof of the previous


theorem, the set of basis vectors and the origin, that is, 0, e(1) , . . . , e(p)
can be shattered by non-homogenous halfspaces.

p + 1 as an upper bound:
Let’s assume that p + 2 vectors x(1) , . . . , x(p+2) are shattered by
the class of non-homogeneous halfspaces.
If we denote θ̃ = (θ0 , θ1 , θ2 , . . . , θp )> ∈ Rp+1 , where θ0 is the
bias/intercept, and x̃ = (1, x1 , x2 , .. . xp )> p +1
 ∈ R , then
h(x | θ) = sgn x> θ + θ0 = sgn x̃ > θ̃ . Any affine function in


Rp can be rewritten as a homogeneous linear function in Rp+1 .

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 20 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
By the previous theorem, the set of homogeneous halfspaces in
Rp+1 cannot shatter any set of p + 2 points. This leads us to a
contradiction.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 21 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Example: For k -nearest neighbours with k = 1, the VC dimension is
infinite.

Example: Let H be the class of axis-aligned rectangles in R2

H = h(a1 ,a2 ,b1 ,b2 ) : R2 → {0, 1} : a1 ≤ a2 and b1 ≤ b2




where
(
1 a1 ≤ x1 ≤ a2 and b1 ≤ x2 ≤ b2
h(a1 ,a2 ,b1 ,b2 ) (x ) =
0 otherwise

Claim: VC2 (H) = 4

Proof: (next slide)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 22 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
4 as a lower bound: There exists a set of 4 points that can be shattered.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 23 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
4 as an upper bound: For any set of 5 points
x(1) , x(2) , x(3) , x(4) , x(5) ∈ R2 :
Assign the leftmost point (lowest x1 ) , rightmost point (highest x1 ) ,
highest point (highest x2 ) and lowest point (lowest x2 ) to class 1.
The point not chosen, x(5) , is assigned to class 0.
The rectangle must contain x(1) , x(2) , x(3) and x(4)
x(5) is classified as 1 as well since its coordinates are within the
intervals defined by the other 4

credit: Shai Shalev-Shwartz and Shai Ben-David

Therefore, the VC-dimension of axis-aligned rectangles is 4.


Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 24 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Recall that the training error is an optimistic estimate of the generalization (or
test) error. For a classification model with VC dimension d, the VC dimension
can predict a probabilistic upper bound on the test error:

s    
1 2|Dtrain | δ
P(errtest ≤ errtrain + d log + 1 − log ) = 1 − δ,
|Dtrain | d 4
| {z }


for δ ∈ [0, 1] if the training data set is large enough (d < |Dtrain | required).
The probability of the true error, errtest , being greater than errtrain +  is
bounded by some constant δ .
This probability is over all possible datasets of size |Dtrain | drawn from
Pxy (which can be arbitrary).
If d is finite, both δ and  can be made arbitrarily small by increasing the
sample size.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 25 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
Recall that the training error is an optimistic estimate of the generalization (or
test) error. For a classification model with VC dimension d, the VC dimension
can predict a probabilistic upper bound on the test error.

s    
1 2|Dtrain | δ
P(errtest ≤ errtrain + d log + 1 − log ) = 1 − δ,
|Dtrain | d 4
| {z }


for δ ∈ [0, 1] if the training data set is large enough (d < |Dtrain | required).
As a corollary, if the training error is low, the test error is also (probably)
low.
In other words, an algorithm which minimizes training error reliably picks
a "good" hypothesis from the hypothesis set.
Such an algorithm is known as a Probably Approximately Correct (or
PAC) algorithm.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 26 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
In general, the VC dimension of a hypothesis space increases as the number
of learnable parameters increases. However, it is not always possible to judge
the capacity of a model based solely on the number of parameters.

Example: A single-parametric threshold classifier (h(x ) = 1(x ≥ θ)) has VC


dimension 1:
It can shatter a single point.
It cannot shatter any set of 2 points (for every set of 2 numbers, if the
smaller is labeled 1, the larger must also be labeled 1).
1.00
0.75
Class
I[x>θ]

0.50 0
0.25 1

0.00
−2 0 2 4
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 27 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
A single-parametric sine classifier h(x ) = 1(sin(θ x ) > 0), for x ∈ R, however,
has infinite VC dimension, since it can shatter any set of points if the
frequency θ is chosen large enough.

Hastie, The Elements of Statistical Learning, 2009

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 28 / 44
THE COMPLEXITY OF HYPOTHESIS SPACES
The bounds derived by the VC analysis are extremely loose and
pessimistic.
Because they have to hold for an arbitrary Pxy , tightening the
bounds to a desired level requires the training sets to be extremely
large.
In practice, complex models (such as neural networks) often
perform much better than these bounds suggest.
Other measures of model capacity such as Rademacher
complexity may be easier to compute or provide tighter bounds.
In addition to the hypothesis space, the effective capacity of a
learning algorithm also depends in a complicated way on the
optimizer.
A better estimate of the generalization error can be obtained
simply by evaluating the learned hypothesis on the test set.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 29 / 44
BIAS-VARIANCE DECOMPOSITION
Let’s take a closer look at the generalization error of a learning
algorithm IL,O :

     
GEn (IL,O ) = EDn ∼Pxy ,(x,y )∼Pxy L y , f̂Dn (x) = EDn ,xy L y , f̂Dn (x) .

We assume that the data is generated by

y = ftrue (x) +  ,
with normal distributed error  ∼ N (0, σ 2 ) independent of x.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 30 / 44
BIAS-VARIANCE DECOMPOSITION
By plugging in the L2-loss L(y , f (x)) = (y − f (x))2 we get

    2 
GEn (IL,O ) = EDn ,xy L y , f̂Dn (x) = EDn ,xy y − f̂Dn (x)
  2 
= Exy EDn y − f̂Dn (x) | x, y
| {z }
(∗)

Let us consider the error (∗) conditioned on one fixed test observation
(x, y ) first. (We omit the | x, y for better readability for now. )
 2 
(∗) = EDn y − f̂Dn (x)
     
= EDn y 2 + EDn f̂Dn (x)2 −2 EDn y f̂Dn (x)
| {z } | {z } | {z }
=y 2 (1 ) (2)

by using the linearity of the expectation.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 31 / 44
BIAS-VARIANCE DECOMPOSITION
By using that Var(y ) = E(y 2 ) − E2 (y ), we see that
     
(1) = EDn f̂Dn (x)2 = VarDn f̂Dn (x) + E2Dn f̂Dn (x)
   
(2) = EDn (ftrue (x) + ) · f̂Dn (x) = EDn ftrue (x)f̂Dn (x) + f̂Dn (x)
   
= EDn ftrue (x)f̂Dn (x) + EDn f̂Dn (x)
| {z }
=E()E(f̂ )=0
   
= EDn ftrue (x)f̂Dn (x) = EDn (ftrue (x)) · EDn f̂Dn (x)

     
2 2
(∗) = y + VarDn f̂Dn (x) + EDn f̂Dn (x) − 2 · EDn (ftrue (x)) · EDn f̂Dn (x)

+ E2Dn (ftrue (x)) − E2Dn (ftrue (x))


| {z }
=ftrue (x)2
  h i
2 2 2
= y − ftrue (x) + VarDn f̂Dn (x) + EDn ftrue (x) − f̂Dn (x) .

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 32 / 44
BIAS-VARIANCE DECOMPOSITION
Let’s come back to the generalization error by taking the expectation
over all possible random observations (x, y ) ∼ Pxy :
h  i
GEn (IL,O ) = Exy (y 2 − ftrue (x)2 | x, y ) +Exy VarDn f̂Dn (x) | x, y
| {z }
=σ 2
h  i
2
+ Exy EDn ftrue (x) − f̂Dn (x) | x, y
h  i
2
= σ +Exy VarDn f̂Dn (x) | x, y
|{z}
Variance of the data | {z }
Variance of inducer at (x,y )
h  i
2
+ Exy EDn ftrue (x) − f̂Dn (x) | x, y
| {z }
Squared bias of inducer at (x,y )

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 33 / 44
BIAS-VARIANCE DECOMPOSITION

h  i
GEn (IL,O ) = σ2
|{z} +Exy VarDn f̂Dn (x) | x, y
Variance of the data | {z }
Variance of inducer at (x,y )
h  i
+ Exy E2Dn ftrue (x) − f̂Dn (x) | x, y
| {z }
Squared bias of inducer at (x,y )

1 The first term expresses the variance of the data


This is noise present in the data
This error source is also called intrinsic or unavoidable error
No matter which learner we use, we will never get below this
error
2 The second
 term expresses
 the expected variance
VarDn f̂Dn (x) | x, y at a (single) random test observation (x, y )

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 34 / 44
BIAS-VARIANCE DECOMPOSITION
If one test observation is fixed, how does the prediction vary
w.r.t. the training data used?
We then take the expectation over all test points (x, y ) ∼ Pxy
Expresses also the learner’s tendency to learn random things
irrespective of the real signal (overfitting)
 
3 The third term E2Dn ftrue (x) − f̂Dn (x) | x, y is the squared bias at
a (single) random observation (x, y ).
Learner’s tendency to consistently misclassify certain
instances (underfitting)
Models with high capacity have low bias and models with low
capacity have high bias

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 35 / 44
BIAS-VARIANCE DECOMPOSITION

Figure: Left: A model with high bias is unable to fit the curved relationship
present in the data. Right: A model with no bias and high variance can, in
principle, learn the true pattern in the data. However, in practice, the learner
outputs wildly different hypotheses for different training sets.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 36 / 44
LEARNING CURVES

credit: digitalvidya

As the capacity of the model increases,


the error due to variance increases.
the error due to bias decreases.
in the overfitting region, the generalization error increases because
the increase in variance is much greater than the decrease in bias.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 37 / 44
LEARNING CURVES

As the size of the training set grows,


the error due to variance vanishes.
both the generalization error and training error converge to the
bias of the algorithm (assuming noise is zero).
As a result, the generalization gap also vanishes.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 38 / 44
LEARNING CURVES
↓ Reduce underfitting by
decreasing bias. Make
model more flexible (or
choose another).

← Reduce overfitting by
decreasing variance. Make
model less flexible
(regularization), or add
more data.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 39 / 44
ML AS AN ILL-POSED PROBLEM

Recall that a learning algorithm must perform well on previously


unseen data.
Let’s assume we’re trying to learn a boolean-valued function over 4
boolean features. There are 216 = 65536 possible functions.
The training set (below) contains 7 examples.

source: USC

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 40 / 44
ML AS AN ILL-POSED PROBLEM

Even after observing 7


examples, there are still 29
possible functions left.
Therefore, machine
learning is an ill-posed
problem.
Given that the unseen
datapoints can have any
labels, can a machine
learning algorithm really
perform better than
random guessing?

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 41 / 44
NO FREE LUNCH

credit: Leon Fedden

For a specific problem, that is, for a specific distribution over target
functions, it is possible for some algorithms to perform better than
some others.
When averaged over all possible problems, however, no algorithm
is better than any other (including random guessing). This
important result is known as the No Free Lunch theorem.
The essence of the NFL theorem is that an algorithm that performs
well on one set of problems has to necessarily perform badly on
another set of problems. There is no "free lunch".

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 42 / 44
NO FREE LUNCH

In other words, there is no algorithm that is universally better than


all other algorithms.
The main takeaway is that a learning algorithm must be tailored to
a specific prior. Without making assumptions about the problem at
hand, machine learning simply isn’t possible!
If these assumptions are very specific, it’s possible for an algorithm
to perform very well on a narrow set of problems. If these
assumptions are very broad, we end up with an algorithm that is a
"jack of all trades, master of none".

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 43 / 44
NO FREE LUNCH

In the real world, however, we encounter only an infinitesimal


subset of all possible problems.
This is because data generating distributions are constrained by
the laws of physics.
Additionally, for a given task, one also has domain-specific
knowledge about the nature of the target function.
Therefore, by incorporating assumptions about the task (such as
smoothness, linearity, etc), in practice, machine learning
algorithms indeed perform quite well.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 3 – 44 / 44

You might also like