Chapter 5 Univariate-Modelling - Commented6

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Advanced Statistical Learning

Chapter 5: Univariate and Linear Modelling


Bernd Bischl, Julia Moosbauer, Andreas Groll

Department of Statistics – TU Dortmund


Winter term 2020/21
UNIVARIATE AND LINEAR MODELING
In this chapter we will formally introduce some very basic hypothesis
spaces.

For regression, we will introduce


uni- and multivariate linear model classes
polynomial model classes
splines
regression stumps,
and for classification,
decision stumps
uni- and multivariate logistic regression.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 1 / 80
UNIVARIATE AND LINEAR MODELING
We will talk about those model classes for two reasons:
1 to deepen the understanding about the risk minimization principle;
2 to use those basic model classes as components of other models:
simple model classes can compose the building blocks of
“stronger” models (bagging, boosting)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 2 / 80
Univariate and Linear Regression Modelling

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 3 / 80
UNIVARIATE REGRESSION
Let us consider a univariate regression task with one-dimensional input
x 2 R and continuous response y 2 R.

6
y

2 4 6
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 4 / 80
HYPOTHESIS SPACE: LINEAR MODELS
Assume we would like to find the straight line that fits our data “best”.
We consider the hypothesis space of univariate linear models:

H = {f ( x ) = ✓ 0 + ✓ 1 x | ✓0 , ✓1 2 R}

6
y

2 4 6
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 5 / 80
LINEAR MODEL WITH L2-LOSS
Let us consider the L2-loss as loss function:

L(y , f (x )) = (y f (x ))2 = (y ✓0 ✓1 x ) 2
(y(i) − f(x(i)))2
9
8
7
6
y

y^
4
3
2

2 3 4 5 6 7

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 6 / 80
LINEAR MODEL WITH L2-LOSS
Given D = x (1) , y (1) , . . . , x (n) , y (n) ⇢ (X ⇥ Y)n we minimize
the empirical risk
n
1 X ⇣ (i ) ⇣ (i ) ⌘⌘
min R̄emp = min L y ,f x | ✓
✓2R2 ✓2R2 n
i =1
n
1 X ⇣ (i ) ⌘2
= min y ✓0 ✓ 1 x (i )
✓0 ,✓1 2R n
i =1

We minimize the expression analytically by setting the first derivatives


(w.r.t. ✓0 , ✓1 ) to 0 and re-arranging terms (derivation is left for exercise).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 7 / 80
LINEAR MODEL WITH L2-LOSS
We get
(⇤)
✓ˆ0 = ȳ ✓1 x̄
Pn
(⇤) i =1 x (i ) x̄ y (i ) ȳ sy
✓ˆ1 = Pn 2
= ⇢xy ·
x (i ) x̄ sx
i =1

with
Pn Pn
ȳ := n1 i =1 y (i ) and x̄ := n1 i =1 x (i ) denoting the empirical
mean of y and x, respectively
Pn
i =1 ( x (i ) x̄ )(y (i ) ȳ )
⇢xy := n·sx sy
being the sample correlation coefficient
q P q P
2 2
sx := 1 n
n i =1 ( x (i ) denoting the
x̄ ) and sy := 1
n
n
i =1 ( y (i ) ȳ )
uncorrected sample standard deviations of x and y .

(⇤)
Remark: Derivation of the above statements is left as an exercise (optional).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 8 / 80
LINEAR MODEL WITH L2-LOSS AND
CORRELATION
In turn, we see that the correlation coefficient captures linear correlation
between the random variables x and y :

✓ˆ1 · sx
⇢xy =
sy

where ✓ˆ1 is the slope of the “best” linear line w.r.t. the L2 loss
sy describes how the data points are scattered along the y -axis
sx describes how the data points are scattered along the x-axis

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 9 / 80
EVALUATION MEASURES: R-SQUARED
Let us evaluate the performance of a (univariate) linear model
f x (i ) | ✓ . One common measure is the R 2 -measure:
Pn ⇣ ⇣ ⌘⌘2
i =1 y (i ) f x (i ) | ✓
2
R = 1 Pn 2
i =1 ( y (i ) ȳ )

The R 2 measures the quality of a fit by measuring how data points are
scattered around model predictions f (x ) = ✓0 + ✓1 x (orange lines) vs.
how the data points are scattered along the y -axis (blue lines).
5 5
4 4
scatter along the y−axis scatter along the y−axis
3 3
y

2
scatter around the model f(x) y 2
scatter around the model f(x)
1 1
0 0
0 1 2 3 4 5 0 1 2 3 4 5
x x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 10 / 80
EVALUATION MEASURES: R-SQUARED

Pn 2
2 i =1 y (i ) f x (i ) | ✓
R = 1 Pn 2
i =1 y (i ) ȳ

What we described informally as “how data points are scattered” is


measured by a sample variance.
Thus, the R 2 is usually introduced as fraction of variance
explained by the model.
Equivalently, we can interpret it as the SSE w.r.t. f x (i ) | ✓ vs.
the SSE w.r.t. ȳ , respectively
We arrive at a much simpler explanation.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 11 / 80
EVALUATION MEASURES: R-SQUARED
The R 2 compares the SSE of a constant model fitted w.r.t. the
L2-loss c (x ) = ȳ (baseline), with a more complex model (linear
model), on some data, usually the same as used for model fitting
Pn 2
2 i =1 y (i ) f x (i ) | ✓ Remp (✓)
R =1 Pn 2
=1
y (i ) ȳ Remp (c )
i =1
9

9
8

8
7

7
6

6
y
y

y
y^
5

5
4

4
3

3
2

2 3 4 5 6 7 2 3 4 5 6 7

x x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 12 / 80
EVALUATION MEASURES: R-SQUARED
Pn 2
2 i =1 y (i ) f x (i ) | ✓ Remp (✓)
R =1 Pn 2
=1
y (i ) ȳ Remp (c )
i =1

R 2 = 1 implies: all residuals are 0, we predict perfectly; R 2 = 0


implies we predict as badly as a naked constant
If measured on the training data, R 2 2 [0, 1], as the linear model
(LM) must be at least as good as the constant, and both SSEs are
non-negative
On other data (test data), it could even be negative, as there is no
guarantee that the LM generalizes better than a constant
(overfitting possible)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 13 / 80
R-SQUARED AND THE LINEAR MODEL
Claim: ⇢2xy = R 2 .
Let us plug in the estimates of the linear model

(i ) sy sy
f̂ (x | ✓0 , ✓ 1 ) = ✓ˆ0 + ✓ˆ1 x (i ) = ȳ ⇢xy x̄ + ⇢xy x (i )
s sx
sy
⇣ ⌘ x
(i )
= ȳ + ⇢xy x x̄
sx

into the formula for the R 2 :


Pn ⇣ ⌘2
i =1 y (i ) f ( x (i ) | ✓0 , ✓ 1 )
2
R = 1 Pn 2
i =1 ( y (i ) ȳ )
Pn ⇣ ⌘2
s
i =1 y (i ) ȳ ⇢xy syx (x (i ) x̄ )
= 1
nsy2

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 14 / 80
R-SQUARED AND THE LINEAR MODEL

Pn s s2
i =1 ( y (i ) ȳ )2 2⇢xy syx (x (i ) x̄ )(y (i ) ȳ ) + ⇢2xy sy2 (x (i ) x̄ )2
2 x
R = 1
nsy2
n· sy2 2n · ⇢ 2 2
xy sy +n·⇢ 2 2
xy sy
= 1
nsy2
nsy2 n⇢2xy · sy2
= 1 = ⇢2xy
nsy2

In the context of linear regression, the R 2 equals the squared


correlation.

That allows us to interpret / define correlation as a measure of how


good a linear model fits the data points.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 15 / 80
R-SQUARED AND THE LINEAR MODEL
Remark:
The R 2 was originally introduced in the normal linear regression
context (linear model with L2 loss)
However, the measure

Remp (f )
1
Remp (c )

could be used for any model f , any baseline model c (c is typically


less complex than f ) and any loss function L (though loosing some
interpretability and the nice equivalence of R 2 and correlation).
The R 2 can also be computed on the test data, if one is more
interested in the generalization error than in the fit.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 16 / 80
LINEAR MODEL WITH L1-LOSS

Let us consider the L1-Loss as cost-function:

L (y , f (x)) = |y f (x )| = |y ✓0 ✓1 x |

Given D we minimize the empirical risk


n
1 X ⇣ (i ) ⇣ (i ) ⌘⌘
min R̄emp (f ) = min L y ,f x
f 2H f 2H n
i =1
n
1 X (i )
= min y ✓0 ✓ 1 x (i )
✓0 ,✓1 2R n
i =1

Minimization of the expression results in the conditional median


of the response variables given x
Optimization is harder due to the non-differentiability of the loss
function and there is no closed-form solution
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 17 / 80
LINEAR MODEL WITH HUBER LOSS
The linear model fitted w.r.t. the Huber Loss (as introduced in
Chapter 4)
(
1
2
(y f (x ))2 for |y f (x )| 
L (y f (x )) = 1 2
|y f (x )| 2
otherwise

is used in robust regression. Huber loss makes the loss quadratic for
small values of residuals, however it behaves like a linear function for
large values.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 18 / 80
LINEAR MODEL: L1 VS. L2 LOSS

Optimization: L2 loss can be differentiated and the empirical risk


minimization problem has a closed form solution; L1 is not
differentiable and has no closed-form solution
Robustness: L1 loss penalizes large residuals less than L2 loss;
thus, L1 loss is more robust to outliers
Huber loss has robustness of L1 loss where residuals are large
and flexibility of L2 loss where residuals are small.

8
Regression
6
Huber loss
y

L1−loss
4 L2−loss

2
2 4 6
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 19 / 80
MULTIVARIATE REGRESSION
In a multivariate regression task, we have x 2 Rp and continuous
response y 2 R. We aim at “laying” a plane through the observations.

Multivariate Regression
35
30
25
y

20

x2
8
7
15

6
5
10

4
0 100 200 300 400 500

x1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 20 / 80
MULTIVARIATE REGRESSION
We define:
0 ( 1) (1) 1 0 ( 1) (1) 1
x1 · · · xp 1 x1 · · · xp
B .. .. .. C B .. .. .. .. C
X=@ . . . A, or X = @. . . . A
(n ) (n ) (n ) (n )
x1 · · · xp 1 x1 · · · xp
⇣ ⌘
( 1) (n ) T
xj = xj , . . . , xj j-th observed feature vector.
T
y = y ( 1) , . . . , y ( n ) vector of target values.
The right design matrix demonstrates the trick to encode the
intercept via an additional constant-1 feature, so the feature space
will be (p + 1)-dimensional. This allows to simplify notation, e.g.,
to write f (x) = ✓ T x, instead of f (x) = ✓ T x + ✓0 .

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 21 / 80
MULTIVARIATE REGRESSION
Note: Here we assume that ✓0 is included in ✓ .

Using matrix notation the L2 loss can written down compactly as:

L (y, f (x)) = (y X✓)T (y X✓).

Differentiating w.r.t ✓ yields the so-called normal equations:

XT (y X✓) = 0 .

The optimal ✓ is
✓ˆ = (XT X) 1 T
X y.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 22 / 80
MULTIVARIATE REGRESSION

Multivariate Regression

35
30
25
y

20

x2
8
7
15

6
5
10

4
0 100 200 300 400 500

x1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 23 / 80
Univariate nonlinear modeling

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 24 / 80
POLYNOMIAL REGRESSION
Let x 2 R and y 2 R (univariate regression).
In the beginning, we fitted a linear model to model the relationship
between x and y
However, it is unlikely that the relationship is always linear
! Use a more complex hypothesis space
Idea: Transform the predictor variable to increase the flexibility in
modeling
The hypothesis space of polynomial regression models of degree d
is:

H = {f : R ! R | f (x ) = ✓0 + ✓1 x 1 + ... + ✓d x d , ✓j 2 R}

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 25 / 80
FITTING POLYNOMIAL REGRESSION
Using matrix notation provides easiness in fitting polynomial regression
using L2-loss.

Design matrix and intercept term:


0 ⇣ ⌘2 ⇣ ⌘d 1
( 1) ( 1) (1)
1 x1 x1 ··· x1
B C
B. .. .. .. .. C
X = B .. . . . . C
@ ⇣ ⌘2 ⇣ ⌘d A
(n ) (n ) (n )
1 x1 x1 ··· x1

L2 loss is then

L (y, f (x)) = (y X✓)T (y X✓),

and the optimal ✓ is, as shown before:

✓ˆ = (XT X) 1 T
X y

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 26 / 80
EXAMPLE POLYNOMIAL REGRESSION

0.75

0.50
y

0.25

0.00

0.00 0.25 0.50 0.75 1.00


x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 27 / 80
POLYNOMIAL REGRESSION: PROBLEMS
Influence of the choice of d?
Higher degree: more flexibility and better fit
But:
Too much flexibility might result in fitting the noise rather than
the actual relationship.
If the degree is too high, the polynomial might oscillate
strongly, in particular at the edges of the domain.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 28 / 80
EXAMPLE POLYNOMIAL REGRESSION

0.75

Degrees
0.50
y

degree = 2
degree = 20

0.25

0.00

0.00 0.25 0.50 0.75 1.00


x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 29 / 80
PIECEWISE POLYNOMIALS
Goal: Keep flexibility, but don’t fit models with too high degree d

Idea:
Divide the domain of x into K intervals
(
1 if ⇣k 1  x  ⇣k
Ik (x ) := 1(⇣k 1  x  ⇣k ) = ,
0 otherwise

k = 1, ..., K (⇣0 and ⇣K being the left and the right border of the
domain of X ).
⇣k , k = 1, ..., K 1 are called knots.
Fit a polynomial model fk of degree d on each interval

H = {f : R ! R | f (x ) = f1 (x )I1 (x ) + ... + fK (x )IK (x )}

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 30 / 80
PIECEWISE POLYNOMIALS
Example: Piecewise constant model (d = 0)
Divide X into three equal-spaced intervals (K = 3)
Fit a constant model to each interval w.r.t. a loss function
Here: We use L2-loss, which means that we fit the mean of the
target variables per interval

0.75

0.50
y

0.25

0.00
0.00 0.25 0.50 0.75 1.00
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 31 / 80
PIECEWISE POLYNOMIALS
Example: Piecewise linear model (d = 1)
Divide X into three equal-spaced intervals (K = 3)
Fit a linear model fk (x ) = ✓k 0 + ✓k 1 x to each interval w.r.t. a loss
function
n o
PK
H = f : R ! R | f (x ) = k =1 fk (x )Ik (x )

1.00

0.75

0.50
y

0.25

0.00
0.00 0.25 0.50 0.75 1.00
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 32 / 80
PIECEWISE POLYNOMIALS
Example: Piecewise quadratic model (d = 2)
Divide X into three equal-spaced intervals (K = 3)
Fit a linear model fk (x ) = ✓k 0 + ✓k 1 x + ✓k 2 x 2 to each interval w.r.t.
a loss function
PK
H = {f : R ! R | f ( x ) = k =1 fk (x )Ik (x )}

0.75

0.50
y

0.25

0.00
0.00 0.25 0.50 0.75 1.00
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 33 / 80
PIECEWISE POLYNOMIALS
Instead of defining the hypothesis space of piecewise polynomials as

K
X
H = {f : R ! R | f (x ) = fk (x )Ik (x ) , }
k =1

with fk (x ) = ✓k 0 + ✓k 1 x 1 + ... + ✓kd x d we could rearrange the


summands and get

f (x ) = ✓10 I1 (x ) + ... + ✓K 0 IK (x )
+ ✓11 · x · I1 (x ) + ... + ✓K 1 · x · IK (x )
+ ...
+ ✓1d · x d · I1 (x ) + ... + ✓Kd · x d · IK (x ).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 34 / 80
PIECEWISE POLYNOMIALS
We can write the hypothesis space as
( )
X
[m ] [m ]
H= f : R ! R | f (x ) = b (x ) ,
m

with [j · K + k ] := ✓kj 2 R and

b[j ·K +k ] (x ) = x j · Ik (x )

for k = 1, ..., K , j = 0, ..., d.

The functions b[m] (x ) are called basis functions.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 35 / 80
SPLINES

Problem: Prediction functions are in general not continuous


Therefore, constraints should be introduced to guarantee
continuity at the knots

fk (⇣k ) = fk +1 (⇣k )
✓k 0 + ✓k 1 ⇣k + ... + ✓kd ⇣kd = ✓(k +1)0 + ✓(k +1)1 ⇣k + ... + ✓(k +1)d ⇣kd

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 36 / 80
SPLINES
Example: For d = 1 and one split point ⇣1 2 X we aim at finding a
model with
(
✓10 + ✓11 x if x < ⇣1
f (x ) =
✓20 + ✓21 x if x ⇣1
that fulfills ✓10 + ✓11 ⇣1 = ✓20 + ✓21 ⇣1 .

A more direct way to proceed in this case is to use a basis that directly
incorporates the constraints

b[1] (x ) = 1,
b [ 2] ( x ) = x ,
b [ 3] ( x ) = ( x ⇣1 ) +
(
x if x > 0
with (x )+ = .
0 otherwise
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 37 / 80
SPLINES
P3
We see that f (x ) = k =1 [k ] b[k ] (x ) is equivalent to the above
formulation with the constraint ✓10 + ✓11 ⇣1 = ✓20 + ✓21 ⇣1 :

(
[0] [1]
+ x if x  ⇣1
f (x ) = [0] [1] [2]
+ x+ (x ⇣1 ) if x > ⇣1
0 1
0 1
B ⌘ C ⇣
@ [0] + [1] x A I1 + B [0] [2]
⇣1 + [1] + [2] x C
=
|{z} |{z} @ | {z } | A I2
:=✓ :=✓ :=✓
{z }
10 11 20
:=✓21

= (✓10 + ✓11 x ) I1 + (✓20 + ✓21 x ) I2

fulfilling the constraint:


⇣ ⌘
[0] [2] [1] [2]
✓20 + ✓21 ⇣1 = ⇣1 + + ⇣1
[0] [1]
= + ⇣1 = ✓10 + ✓11 ⇣1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 38 / 80
SPLINES
This equivalent formulation can be generalized to higher degrees d and
more split points. We can reformulate the hypothesis space as follows:
( )
X
[m ] [m ]
H= f : R ! R | f (x ) = b (x )
m

with

b [j ] ( x ) = x j , j = 0, ..., d
b [d + k ]
(x ) = ( x ⇣k )d+ , k = 1, ..., K 1,

where d is the order of the piece-wise functions, K is the number of


knots, and (.)+ denotes the positive part. The functions b[j ] are the
basis functions.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 39 / 80
SPLINES
So for d = 1 and two knots ⇣1 < ⇣2 this would be:

b[1] (x ) = 1, b[2] (x ) = x ,
b [ 3] ( x ) = ( x ⇣ 1 ) + , b [ 4] ( x ) = ( x ⇣2 ) +

Those fixed-knot splines are also known as regression splines.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 40 / 80
SPLINES
1.00

0.75

0.50
y

0.25

0.00

0.00 0.25 0.50 0.75 1.00


x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 41 / 80
CUBIC SPLINES

It is claimed that cubic splines are the lowest-order spline for which
“non-smoothness” is not visible to the human eye
Cubic splines yield models f (x ) that have first and second order
derivatives at the knots
There is seldom any good reason to go beyond cubic-splines,
unless one is interested in smooth derivatives
The following basis functions represent a cubic spline with knots ⇣1
and ⇣2 :

b[1] (x ) = 1, b [ 2] ( x ) = x ,
b [ 3] ( x ) = x 2 , b [ 4] ( x ) = x 3 ,
b [ 5] ( x ) = ( x ⇣1 )3+ , b [ 6] ( x ) = ( x ⇣2 )3+

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 42 / 80
EXAMPLE: CUBIC SPLINES

0.75

0.50
y

0.25

0.00

0.00 0.25 0.50 0.75 1.00


x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 43 / 80
NATURAL CUBIC SPLINES

The behavior of polynomials tends to be erratic near the


boundaries; extrapolation can be dangerous.
A natural cubic spline adds additional constraints, namely that
the function is linear beyond the boundary knots.
The basis for a natural cubic spline with K knots can be written as

b[1] (x ) = 1, b[2] (x ) = x
b[k +2] (x ) = d [k ] (x ) d [K 1]
(x ), k = 1, ..., K 2
3
(x ⇣k ) + (x ⇣K )3+
d [k ] ( x ) = .
⇣K ⇣k
Proof that this is the basis function for cubic splines fulfilling the additional
constraints is left as an exercise to the reader.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 44 / 80
SMOOTHING SPLINES

It is hard to choose the knots “appropriately”.


A method to avoid the knot selection problem is by starting with a
maximal set of knots first.
Introduce a “penalty” for the complexity of the model
Consider the problem: Among all functions f (x ) (with two
continuous derivatives), find one that minimizes the penalized sum
of squares
n ⇣
X ⌘2 Z
(i ) (i )
Rreg = y f (x ) + {f 00 (t )}2 dt
i =1

The first term is the empirical risk; the second term penalizes
curvature in the function; is called smoothing (or regularization)
parameter and establishes a trade-off between the two.
= 0: f (·) can be any function that interpolates the data
= 1: normal linear regression model

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 45 / 80
SMOOTHING SPLINES
Remarkably, it can be shown that the problem has an explicit,
unique minimizer which is a natural cubic spline with knots at
x (i ) , i = 1, ..., n (for more details, see Hastie et al. (2009), The
Elements of Statistical Learning, 5.4)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 46 / 80
SMOOTHING SPLINES
Different values of lead to different fits on data. Finding the right
can be done by cross validation over , for example. The measure
is often called regularization parameter. For more details on
regularization, see also the lecture Introduction to Statistical Learning.
lambda=0.001 lambda=0.01 lambda=0.1

0.75 0.75 0.75

0.50 0.50 0.50


y

y
0.25 0.25 0.25

0.00 0.00 0.00


0.000.250.500.751.00 0.000.250.500.751.00 0.000.250.500.751.00
x x x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 47 / 80
(UNIVARIATE) REGRESSION STUMPS
Scenario: Univariate regression scenario with X ⇢ R and Y = R.

Idea:
We divide our data into two groups N1 and N2 , depending on the
value of the input variable x
⇣ ⌘
(i ) (i ) (i )
x ,y 2 N1 , x t
⇣ ⌘
(i ) (i ) (i )
x ,y 2 N2 , x > t.

The two groups N1 and N2 are called nodes. The variable x is


called split variable and t is called split point.
If we have multidimensional inputs, i.e. x 2 X with
dim(X ) = p, p > 1, we have to decide for a splitting variable xj ,
j = 1, ..., p, as well.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 48 / 80
(UNIVARIATE) REGRESSION STUMPS
In each of the nodes, we make a constant prediction
⇣ ⌘
(i ) (i ) (i )
f (x ) = c1 , x ,y 2 N1
⇣ ⌘
(i ) (i ) (i )
f (x ) = c2 , x ,y 2 N2 .

with c1 , c2 2 R.
The hypothesis space of regression stumps is:
( ( )
c1 if x  t
H= f : X ! R f (x ) = , c1 , c2 2 Y, t 2 X
c2 if x > t
Regression stumps can thus predict two values c1 , c2 only.
Regression stumps are regression trees with a single split.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 49 / 80
REGRESSION STUMPS
The figure shows an example regression tree for the cars dataset.
When the feature speed is smaller than 18, the predicted stopping
distance ( distance) is 29. When it is greater than 18, the predicted
stopping distance is 65.

120
yes speed < 18 no
43

20 40 60 80
n=50

dist
29 65
n=31 n=19

0 5 10 15 20 25

speed

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 50 / 80
REGRESSION STUMPS: RISK MINIMIZATION
Let D be the training data. “Fitting” a regression stump means finding
an “optimal” splitting variable xj and an “optimal” splitting point t
n⇣ ⌘ o n⇣ ⌘ o
(i ) (i ) (i ) (i ) (i ) (i )
N1 = x ,y 2 D : xj t and N2 = x ,y 2 D : xj >t

by trying all possible split points t. In order to quantify the (negative)


quality of a considered split we compute the empirical risks of both child
nodes and sum them up
R(j , t ) = R(N1 ) + R(N2 ).
The risk R(N ) for a node is simply the summed loss for the data
contained in that node under a selected loss function L
X
R(N ) = L(y , c ),
(x,y )2N

if we model the data in that node with an optimal constant


c = arg minc R(N ).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 51 / 80
REGRESSION STUMPS: SPLITTING CRITERIA

For regression, we usually use L2 loss


X
R(N ) = (y c )2
(x,y )2N

1
P
The best constant under L2 is the mean c = ȳN = |N | y
(x,y )2N
Up to a constant, we just computed the variance of the label
distribution in N ; we can also interpret this as a way of measuring
the impurity of the distribution / fluctuation around the constant.

1.2 mean +/− 1 std


1.1
1.0
0.9
0.8
−2 −1 0 1 2
y

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 52 / 80
REGRESSION STUMPS: SPLITTING CRITERIA
We could use any other type of regression loss to measure the
quality of our nodes
X
R(N ) = L(y , c ).
(x,y )2N

c is then the optimal constant model w.r.t. the loss function chosen.
In Chapter 4, we have derived the optimal constant models w.r.t.
different loss functions.
Accordingly, c would be the median for L1-loss, a quantile for the
pinball loss, etc. within each node N
If there is no closed form solution (e.g. for the Huber loss), we can
get the optimal constant model for each node by using numerical
optimization methods such as gradient descent.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 53 / 80
REGRESSION TREES

If we recursively split the nodes again and again, we and up with


regression trees (see lecture Introduction to Statistical Learning)

Visualization of a regression tree for the ozone data set.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 54 / 80
Classification

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 55 / 80
DECISION STUMPS

Similar to the idea of regression stumps, decision stumps can only


predict two scores / probabilities c1 , c2 2 Rg .
Based on a splitting variable xj and a splitting criterion t, the data
is split into two groups.
In each of the two groups, we make a constant prediction.
The hypothesis space of decision stumps is
( ( )
g c1 if xj t g
H= f :X !R f ( x) = , c1 , c2 2 R , j 2 {1, ..., p}
c2 if xj < t

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 56 / 80
DECISION STUMPS

7
setosa
Petal.Length < 2 no versicolor

6
yes
virginica (unused)
setosa

Petal.Length

5
50 50 50

4
3
2
setosa versicolor
50 0 0 0 50 50

1
0.5 1.0 1.5 2.0 2.5

Petal.Width

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 57 / 80
DECISION STUMPS: SPLITTING CRITERIA

Common choices for the loss function are either Brier score
(L2-loss on probabilities) or the Bernoulli loss as loss function
We usually model constant predictions in node N by simply
calculating the class proportions: The probability for class k is

(N ) 1 X
⇡k = 1{y = k }
|N |
(x,y )2N

This is the optimal constant under the 2 mentioned losses above.


Class prob.

0.6
0.4
0.2
0.0
1 2 3
Label

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 58 / 80
SPLITTING CRITERIA: COMMENTS

Tree splitting is usually introduced under the concept of “impurity


reduction”
Based on an impurity measure I, we maximize:
nN1 nN2
max I (N ) I (N1 ) I (N2 ),
N1 ,N2 nN nN

where nN := |N | denotes the number of observations in N .


As nN and I (N ) are constant, impurity reduction is equivalent to
our approach that is more in line with empirical risk minimization
and our previous concepts:

nN1 I (N1 ) + nN2 I (N2 ) = R(N1 ) + R(N2 ).

Whilst the risk does grow with the number of observations in a


node according to our definition (they are simply summed up), the
impurity of a node is a measure of “average”.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 59 / 80
SPLITTING CRITERIA: COMMENTS
A typical impurity measure is the Gini impurity
g
X (N ) (N )
X (N ) (N )
I (N ) = ⇡k ⇡k 0 = ⇡k (1 ⇡k ),
k 6=k 0 k =1

which is equivalent to risk minimization w.r.t. Brier score.


Another typical impurity measure is the Entropy
g
X (N ) (N )
I (N ) = ⇡k log ⇡k ,
k =1

which is equivalent to risk minimization w.r.t. log loss.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 60 / 80
GINI SPLITTING = BRIER SCORE MINIMIZATION
Let us denote nN ,k as the number of class k observations in node N
and nN the overall number of observations in node N .

We show that

R(N ) = nN I (N ) ,
where I is the Gini impurity and R(N ) is calculated w.r.t. the Brier score
g
X 2
L(y , ⇡(x)) = 1{ y = k } ⇡k (x) .
k =1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 61 / 80
GINI SPLITTING = BRIER SCORE MINIMIZATION

g ⇣ ⌘2
X X
R(N ) = 1 { y (i ) = k } ⇡k (x(i ) )
k =1
(x(i ) ,y (i ) )2N
g ⇣ ⌘2
X X
= 1 { y (i ) = k } ⇡k (x(i ) )
k =1
(x(i ) ,y (i ) )2N
g
X ✓ ◆2 ✓ ◆2
nN ,k nN , k
= nN , k 1 + (nN nN , k )
nN nN
k =1

In the last step we plugged in the optimal prediction w.r.t. the Brier
score (the fraction of class k observations)

(N ) nN ,k
ˆk (x) = ⇡k
⇡ = .
nN

We further simplify the expression to

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 62 / 80
GINI SPLITTING = BRIER SCORE MINIMIZATION

g
X ✓ ◆2 ✓ ◆2
nN nN ,k nN , k
R(N ) = nN , k + (nN nN ,k )
nN nN
k =1
g
X nN ,k nN nN k
= (nN nN ,k + nN ,k )
nN nN
k =1
g
X (N ) (N )
= nN ⇡k · (1 ⇡k ) = nN I (N ).
k =1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 63 / 80
ENTROPY SPLITTING = LOG LOSS MINIMIZATION
The logarithmic loss for multiple classes y 2 {1, 2, ..., g } is defined as
g
X
L(y , ⇡k (x)) = 1{y =k } · log (⇡k (x)) .
k =1

The risk of a node N w.r.t. the logarithmic loss is then

g
X X
R(N ) = 1{y (i ) =k } log ⇡k (x(i ) )
(x(i ) ,y (i ) )2N k =1

g
X X (N )
= 1{y (i ) =k } log ⇡k
k =1 (x(i ) ,y (i ) )2N

g g
X (N )
X (N ) (N )
= nN ,k log ⇡k = nN ⇡k log ⇡k = nN I (N ) ,
k =1 k =1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 64 / 80
ENTROPY SPLITTING = LOG LOSS MINIMIZATION
(N )
plugging in ⇡k (x) = ⇡k for k = 1, ..., g (which is optimal for the log
loss) in the second step.

So splitting w.r.t. the logarithmic loss is equivalent to entropy splitting.

The entropy will be introduced in a later chapter on “Information


Theory”.

Remark: Stumps/trees with entropy splitting use the same loss function
as logistic regression (binary) / softmax regression (multiclass). While
logistic regression is based on the hypothesis space of linear
functions, stumps/trees use step functions as hypothesis spaces.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 65 / 80
LOGISTIC REGRESSION
Multivariate logistic regression combines the hypothesis space of
linear functions
n o
H = f : X ! R | f (x) = ✓ > x

with the logistic loss

L (y , f (x)) = log [1 + exp ( yf (x))] .

We transform scores into probabilities by

1
⇡(x) = P(y = 1 | x) = s(f (x)) = ,
1 + exp( f (x))
with s(.) being the logistic sigmoid function as introduced in Chapter 2.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 66 / 80
LOGISTIC REGRESSION
As already shown before, an equivalent approach that directly outputs
probabilities ⇡(x) is minimizing the Bernoulli loss

L (y , f (x)) = y log (⇡(x)) (1 y ) log (1 ⇡(x))

for ⇡(x) in the hypothesis space

n o
H = ⇡ : X ! [0, 1] | ⇡(x) = s(✓ > x)

with s(.) again being the logistic function.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 67 / 80
LOGISTIC REGRESSION
Logistic regression with one feature x 2 R. The figure shows how
x 7! ⇡(x ).
1.00

0.75
π(x)

0.50

0.25

0.00
0 2 4 6
x

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 68 / 80
LOGISTIC REGRESSION
Logistic regression with two features:

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 69 / 80
LOGISTIC REGRESSION

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 70 / 80
LOGISTIC REGRESSION: DISCRIMINANT
FUNCTIONS AND DECISION BOUNDARY
Logistic regression gives us a linear classifier.
⇣ ⌘
y
If we apply g (y ) = log 1 y
(which is a monotone, rank preserving
1
function) on ⇡(x) = 1+exp( ✓ > x)
, we get

g (⇡(x)) = log (⇡(x))) log (1 ⇡(x))


✓ ◆
1
= log(1 + exp( f (x))) log 1
1 + exp( f (x))
✓ ◆
exp( f (x))
= log(1 + exp( f (x))) log
1 + exp( f (x))
= log(1 + exp( f (x))) log (exp( f (x))) + log (1 + exp( f (x)))
>
= f (x) = ✓ x.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 71 / 80
LOGISTIC REGRESSION: DISCRIMINANT
FUNCTIONS AND DECISION BOUNDARY
f (x) = ✓ > x is the discriminant function of logistic regression and
f (x) = ✓ > x = 0 represents the decision boundary, which is a
hyperplane in the p dimensional space.

The discriminant function can be interpreted as log-odds ratio:

⇡(x) P(y = 1 | x)
log = log = ✓ > x = f (x).
1 ⇡(x) P(y = 0 | x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 72 / 80
EXCURSUS: HYPERPLANES
A hyperplane in X = Rp is a p 1 dimensional linear subspace
defined by a normal vector ✓ (usually with ||✓|| = 1), perpendicular to
the hyperplane, and an offset ✓0 .

For f (x) := ✓ > x + ✓0 , the hyperplane is defined as


{x 2 X : ✓ > x + ✓0 = 0} = {x 2 X | f (x) = 0}
>

x2

0.8
++
− ++
2
− +θ
1

++

0.6

0
>

++

x1

-2
1

y
-4
0.4
− +
x3

x2
0


-6

-2 ●

-4
0.2

-8

-6
-8 ●
-10

-10
0 2 4 6 8 10
0.25 0.50 0.75
x
x1

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 73 / 80
EXCURSUS: HYPERPLANES
We call {x 2 X : f (x) > 0}
(the half-space ✓ is pointing at) positive halfspace and

{x 2 X : f (x) < 0}
negative halfspace.

The distance between point x 2 X and a hyperplane f is

|✓ > x + ✓0 | |f (x)|
d ( f , x) = = ,
k✓k k✓k
i.e., d (f , 0) = |✓0 |/||✓||.

For unit length ✓ , these simplify to

d (f , x) = |f (x)|

and d (f , 0) = |✓0 | .
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 74 / 80
LOGISTIC REGRESSION: OPTIMIZATION
Optimization:
There is no closed form solution to the risk minimization problem
defined by the above loss function and hypothesis space. Optimization
could be done via gradient descent, or second-order methods such as
BFGS.

Claim: The problem of fitting a linear hyperplane f (x) = ✓ > x w.r.t.


logistic loss has not a bounded solution, if the data is linearly separable.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 75 / 80
LOGISTIC REGRESSION: OPTIMIZATION
Logistic Loss
4

3
L(yf(x))

0
−4 −2 0 2 4
yf(x)

More generally, the problem of fitting a linear hyperplane w.r.t. a loss


function that never becomes exactly 0 (logistic loss, exponential loss, ...)

L (y , f (x)) ! 0 for yf (x) ! 1,


L (y , f (x)) > 0 for all y , f (x)

has not a bounded solution.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 76 / 80
LOGISTIC REGRESSION: OPTIMIZATION
Proof:
Assume we are fitting a linear hyperplane f (x) = ✓ > x + ✓0 .

If the data is linearly separable, i.e. there exists ✓, ✓0 such that


⇣ ⌘
(i )
y ✓ > x(i ) + ✓0 > 0 8i 2 {1, 2, ..., n}.

3
L(yf(x))

0 y(1)f(x(1))y(2)f(x(2)) y(3)f(x(3))

−4 −2 0 2 4
yf(x)

Multiplying coefficients by a constant c > 1 preserves linear separability

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 77 / 80
LOGISTIC REGRESSION: OPTIMIZATION

⇣ ⌘
>
c · ✓ x + c · ✓0
⇣ ⌘
>
= c · ✓ x + ✓0 >0

However, since y (i ) c · ✓ > x(i ) + c · ✓0 > y (i ) ✓ > x(i ) + ✓0 ,


multiplying the parameters by a constant c > 1 will decrease the
pointwise loss for all i 2 {1, ..., n}
4
3
L(yf(x))

2
1
0 y(1)f(x(1))y(2)f(x(2)) y(3)f(x(3))

−4 −2 0 2 4
yf(x)

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 78 / 80
LOGISTIC REGRESSION: OPTIMIZATION
Thus
n
X
Remp (f ) = L(y , c · f (x)) ! 0 for c ! 1.
i =1

The solution to the problem is not bounded. A standard optimizer such


as gradient descent will not converge.

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 79 / 80
LOGISTIC REGRESSION: OPTIMIZATION
Ways out:
Instead of logistic loss we could use a loss function the becomes 0
for large margins yf (x). This however implies that the function is
not smooth and more difficult to optimize.
An example for such a loss function is the Hinge loss (see SVM
chapter in the lecture Introduction to Statistical Learning).
3
L(yf(x))

0
−2 −1 0 1 2
yf(x)

Another possibility is to penalize for large parameters ✓ . This


option is further discussed in the context of regularization (see
also the lecture Introduction to Statistical Learning).

Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 80 / 80

You might also like