Chapter 5 Univariate-Modelling - Commented6

Advanced Statistical Learning
Chapter 5: Univariate and Linear Modelling

Bernd Bischl, Julia Moosbauer, Andreas Groll
Department of Statistics – TU Dortmund

Winter term 2020/21
UNIVARIATE AND LINEAR MODELING
In this chapter we will formally introduce some very basic hypothesis
spaces.
For regression, we will introduce

uni- and multivariate linear model classes
polynomial model classes
splines
regression stumps,
and for classification,
decision stumps
uni- and multivariate logistic regression.
Bernd Bischl, Julia Moosbauer, Andreas Groll c Winter term 2020/21 Advanced Statistical Learning – 5 – 1 / 80
UNIVARIATE AND LINEAR MODELING
We will talk about those model classes for two reasons:
1 to deepen the understanding about the risk minimization principle;
2 to use those basic model classes as components of other models:
simple model classes can compose the building blocks of
“stronger” models (bagging, boosting)
Univariate and Linear Regression Modelling
UNIVARIATE REGRESSION
Let us consider a univariate regression task with one-dimensional input
x 2 R and continuous response y 2 R.
6
y
2 4 6
x
HYPOTHESIS SPACE: LINEAR MODELS
Assume we would like to find the straight line that fits our data “best”.
We consider the hypothesis space of univariate linear models:
H = {f ( x ) = ✓ 0 + ✓ 1 x | ✓0 , ✓1 2 R}
6
y
2 4 6
x
LINEAR MODEL WITH L2-LOSS
Let us consider the L2-loss as loss function:
L(y , f (x )) = (y f (x ))2 = (y ✓0 ✓1 x ) 2
(y(i) − f(x(i)))2
9
8
7
6
y
y^
4
3
2
2 3 4 5 6 7
Given D = x (1) , y (1) , . . . , x (n) , y (n) ⇢ (X ⇥ Y)n we minimize
the empirical risk
n
1 X ⇣ (i ) ⇣ (i ) ⌘⌘
min R̄emp = min L y ,f x | ✓
✓2R2 ✓2R2 n
i =1
n
1 X ⇣ (i ) ⌘2
= min y ✓0 ✓ 1 x (i )
✓0 ,✓1 2R n
i =1
We minimize the expression analytically by setting the first derivatives

(w.r.t. ✓0 , ✓1 ) to 0 and re-arranging terms (derivation is left for exercise).
We get
(⇤)
✓ˆ0 = ȳ ✓1 x̄
Pn
(⇤) i =1 x (i ) x̄ y (i ) ȳ sy
✓ˆ1 = Pn 2
= ⇢xy ·
x (i ) x̄ sx
i =1
with
Pn Pn
ȳ := n1 i =1 y (i ) and x̄ := n1 i =1 x (i ) denoting the empirical
mean of y and x, respectively
Pn
i =1 ( x (i ) x̄ )(y (i ) ȳ )
⇢xy := n·sx sy
being the sample correlation coefficient
q P q P
2 2
sx := 1 n
n i =1 ( x (i ) denoting the
x̄ ) and sy := 1
n
n
i =1 ( y (i ) ȳ )
uncorrected sample standard deviations of x and y .
(⇤)
Remark: Derivation of the above statements is left as an exercise (optional).
LINEAR MODEL WITH L2-LOSS AND
CORRELATION
In turn, we see that the correlation coefficient captures linear correlation
between the random variables x and y :
✓ˆ1 · sx
⇢xy =
sy
where ✓ˆ1 is the slope of the “best” linear line w.r.t. the L2 loss
sy describes how the data points are scattered along the y -axis
sx describes how the data points are scattered along the x-axis
EVALUATION MEASURES: R-SQUARED
Let us evaluate the performance of a (univariate) linear model
f x (i ) | ✓ . One common measure is the R 2 -measure:
Pn ⇣ ⇣ ⌘⌘2
i =1 y (i ) f x (i ) | ✓
2
R = 1 Pn 2
i =1 ( y (i ) ȳ )
The R 2 measures the quality of a fit by measuring how data points are
scattered around model predictions f (x ) = ✓0 + ✓1 x (orange lines) vs.
how the data points are scattered along the y -axis (blue lines).
5 5
4 4
scatter along the y−axis scatter along the y−axis
3 3
y
2
scatter around the model f(x) y 2
scatter around the model f(x)
1 1
0 0
0 1 2 3 4 5 0 1 2 3 4 5
x x
Pn 2
2 i =1 y (i ) f x (i ) | ✓
R = 1 Pn 2
i =1 y (i ) ȳ
What we described informally as “how data points are scattered” is

measured by a sample variance.
Thus, the R 2 is usually introduced as fraction of variance
explained by the model.
Equivalently, we can interpret it as the SSE w.r.t. f x (i ) | ✓ vs.
the SSE w.r.t. ȳ , respectively
We arrive at a much simpler explanation.
The R 2 compares the SSE of a constant model fitted w.r.t. the
L2-loss c (x ) = ȳ (baseline), with a more complex model (linear
model), on some data, usually the same as used for model fitting
Pn 2
2 i =1 y (i ) f x (i ) | ✓ Remp (✓)
R =1 Pn 2
=1
y (i ) ȳ Remp (c )
i =1
9
9
8
8
7
7
6
6
y
y
y
y^
5
5
4
4
3
3
2
2 3 4 5 6 7 2 3 4 5 6 7
x x
Pn 2
2 i =1 y (i ) f x (i ) | ✓ Remp (✓)
R =1 Pn 2
=1
y (i ) ȳ Remp (c )
i =1
R 2 = 1 implies: all residuals are 0, we predict perfectly; R 2 = 0

implies we predict as badly as a naked constant
If measured on the training data, R 2 2 [0, 1], as the linear model
(LM) must be at least as good as the constant, and both SSEs are
non-negative
On other data (test data), it could even be negative, as there is no
guarantee that the LM generalizes better than a constant
(overfitting possible)
R-SQUARED AND THE LINEAR MODEL
Claim: ⇢2xy = R 2 .
Let us plug in the estimates of the linear model
(i ) sy sy
f̂ (x | ✓0 , ✓ 1 ) = ✓ˆ0 + ✓ˆ1 x (i ) = ȳ ⇢xy x̄ + ⇢xy x (i )
s sx
sy
⇣ ⌘ x
(i )
= ȳ + ⇢xy x x̄
sx
into the formula for the R 2 :

Pn ⇣ ⌘2
i =1 y (i ) f ( x (i ) | ✓0 , ✓ 1 )
2
R = 1 Pn 2
i =1 ( y (i ) ȳ )
Pn ⇣ ⌘2
s
i =1 y (i ) ȳ ⇢xy syx (x (i ) x̄ )
= 1
nsy2
Pn s s2
i =1 ( y (i ) ȳ )2 2⇢xy syx (x (i ) x̄ )(y (i ) ȳ ) + ⇢2xy sy2 (x (i ) x̄ )2
2 x
R = 1
nsy2
n· sy2 2n · ⇢ 2 2
xy sy +n·⇢ 2 2
xy sy
= 1
nsy2
nsy2 n⇢2xy · sy2
= 1 = ⇢2xy
nsy2
In the context of linear regression, the R 2 equals the squared

correlation.
That allows us to interpret / define correlation as a measure of how

good a linear model fits the data points.
Remark:
The R 2 was originally introduced in the normal linear regression
context (linear model with L2 loss)
However, the measure
Remp (f )
1
Remp (c )
could be used for any model f , any baseline model c (c is typically

less complex than f ) and any loss function L (though loosing some
interpretability and the nice equivalence of R 2 and correlation).
The R 2 can also be computed on the test data, if one is more
interested in the generalization error than in the fit.
Let us consider the L1-Loss as cost-function:
L (y , f (x)) = |y f (x )| = |y ✓0 ✓1 x |
Given D we minimize the empirical risk

n
1 X ⇣ (i ) ⇣ (i ) ⌘⌘
min R̄emp (f ) = min L y ,f x
f 2H f 2H n
i =1
n
1 X (i )
= min y ✓0 ✓ 1 x (i )
✓0 ,✓1 2R n
i =1
Minimization of the expression results in the conditional median

of the response variables given x
Optimization is harder due to the non-differentiability of the loss
function and there is no closed-form solution
LINEAR MODEL WITH HUBER LOSS
The linear model fitted w.r.t. the Huber Loss (as introduced in
Chapter 4)
(
1
2
(y f (x ))2 for |y f (x )| 
L (y f (x )) = 1 2
|y f (x )| 2
otherwise
is used in robust regression. Huber loss makes the loss quadratic for
small values of residuals, however it behaves like a linear function for
large values.
LINEAR MODEL: L1 VS. L2 LOSS
Optimization: L2 loss can be differentiated and the empirical risk

minimization problem has a closed form solution; L1 is not
differentiable and has no closed-form solution
Robustness: L1 loss penalizes large residuals less than L2 loss;
thus, L1 loss is more robust to outliers
Huber loss has robustness of L1 loss where residuals are large
and flexibility of L2 loss where residuals are small.
8
Regression
6
Huber loss
y
L1−loss
4 L2−loss
2
2 4 6
x
MULTIVARIATE REGRESSION
In a multivariate regression task, we have x 2 Rp and continuous
response y 2 R. We aim at “laying” a plane through the observations.
Multivariate Regression
35
30
25
y
20
x2
8
7
15
6
5
10
4
0 100 200 300 400 500
x1
We define:
0 ( 1) (1) 1 0 ( 1) (1) 1
x1 · · · xp 1 x1 · · · xp
B .. .. .. C B .. .. .. .. C
X=@ . . . A, or X = @. . . . A
(n ) (n ) (n ) (n )
x1 · · · xp 1 x1 · · · xp
⇣ ⌘
( 1) (n ) T
xj = xj , . . . , xj j-th observed feature vector.
T
y = y ( 1) , . . . , y ( n ) vector of target values.
The right design matrix demonstrates the trick to encode the
intercept via an additional constant-1 feature, so the feature space
will be (p + 1)-dimensional. This allows to simplify notation, e.g.,
to write f (x) = ✓ T x, instead of f (x) = ✓ T x + ✓0 .
Note: Here we assume that ✓0 is included in ✓ .
Using matrix notation the L2 loss can written down compactly as:
L (y, f (x)) = (y X✓)T (y X✓).
Differentiating w.r.t ✓ yields the so-called normal equations:
XT (y X✓) = 0 .
The optimal ✓ is
✓ˆ = (XT X) 1 T
X y.
Multivariate Regression
35
30
25
y
20
x2
8
7
15
6
5
10
4
0 100 200 300 400 500
x1
Univariate nonlinear modeling
POLYNOMIAL REGRESSION
Let x 2 R and y 2 R (univariate regression).
In the beginning, we fitted a linear model to model the relationship
between x and y
However, it is unlikely that the relationship is always linear
! Use a more complex hypothesis space
Idea: Transform the predictor variable to increase the flexibility in
modeling
The hypothesis space of polynomial regression models of degree d
is:
H = {f : R ! R | f (x ) = ✓0 + ✓1 x 1 + ... + ✓d x d , ✓j 2 R}
FITTING POLYNOMIAL REGRESSION
Using matrix notation provides easiness in fitting polynomial regression
using L2-loss.
Design matrix and intercept term:

0 ⇣ ⌘2 ⇣ ⌘d 1
( 1) ( 1) (1)
1 x1 x1 ··· x1
B C
B. .. .. .. .. C
X = B .. . . . . C
@ ⇣ ⌘2 ⇣ ⌘d A
(n ) (n ) (n )
1 x1 x1 ··· x1
L2 loss is then
L (y, f (x)) = (y X✓)T (y X✓),
and the optimal ✓ is, as shown before:
✓ˆ = (XT X) 1 T
X y
EXAMPLE POLYNOMIAL REGRESSION
0.75
0.50
y
0.25
0.00
0.00 0.25 0.50 0.75 1.00

x
POLYNOMIAL REGRESSION: PROBLEMS
Influence of the choice of d?
Higher degree: more flexibility and better fit
But:
Too much flexibility might result in fitting the noise rather than
the actual relationship.
If the degree is too high, the polynomial might oscillate
strongly, in particular at the edges of the domain.
EXAMPLE POLYNOMIAL REGRESSION
0.75
Degrees
0.50
y
degree = 2
degree = 20
0.25
0.00
0.00 0.25 0.50 0.75 1.00

x
PIECEWISE POLYNOMIALS
Goal: Keep flexibility, but don’t fit models with too high degree d
Idea:
Divide the domain of x into K intervals
(
1 if ⇣k 1  x  ⇣k
Ik (x ) := 1(⇣k 1  x  ⇣k ) = ,
0 otherwise
k = 1, ..., K (⇣0 and ⇣K being the left and the right border of the
domain of X ).
⇣k , k = 1, ..., K 1 are called knots.
Fit a polynomial model fk of degree d on each interval
H = {f : R ! R | f (x ) = f1 (x )I1 (x ) + ... + fK (x )IK (x )}
Example: Piecewise constant model (d = 0)
Divide X into three equal-spaced intervals (K = 3)
Fit a constant model to each interval w.r.t. a loss function
Here: We use L2-loss, which means that we fit the mean of the
target variables per interval
0.75
0.50
y
0.25
0.00
0.00 0.25 0.50 0.75 1.00
x
Example: Piecewise linear model (d = 1)
Fit a linear model fk (x ) = ✓k 0 + ✓k 1 x to each interval w.r.t. a loss
function
n o
PK
H = f : R ! R | f (x ) = k =1 fk (x )Ik (x )
1.00
0.75
0.50
y
0.25
0.00
0.00 0.25 0.50 0.75 1.00
x
Example: Piecewise quadratic model (d = 2)
Fit a linear model fk (x ) = ✓k 0 + ✓k 1 x + ✓k 2 x 2 to each interval w.r.t.
a loss function
PK
H = {f : R ! R | f ( x ) = k =1 fk (x )Ik (x )}
0.75
0.50
y
0.25
0.00
0.00 0.25 0.50 0.75 1.00
x
Instead of defining the hypothesis space of piecewise polynomials as
K
X
H = {f : R ! R | f (x ) = fk (x )Ik (x ) , }
k =1
with fk (x ) = ✓k 0 + ✓k 1 x 1 + ... + ✓kd x d we could rearrange the

summands and get
f (x ) = ✓10 I1 (x ) + ... + ✓K 0 IK (x )
+ ✓11 · x · I1 (x ) + ... + ✓K 1 · x · IK (x )
+ ...
+ ✓1d · x d · I1 (x ) + ... + ✓Kd · x d · IK (x ).
We can write the hypothesis space as
( )
X
[m ] [m ]
H= f : R ! R | f (x ) = b (x ) ,
m
with [j · K + k ] := ✓kj 2 R and
b[j ·K +k ] (x ) = x j · Ik (x )
for k = 1, ..., K , j = 0, ..., d.
The functions b[m] (x ) are called basis functions.
SPLINES
Problem: Prediction functions are in general not continuous

Therefore, constraints should be introduced to guarantee
continuity at the knots
fk (⇣k ) = fk +1 (⇣k )
✓k 0 + ✓k 1 ⇣k + ... + ✓kd ⇣kd = ✓(k +1)0 + ✓(k +1)1 ⇣k + ... + ✓(k +1)d ⇣kd
SPLINES
Example: For d = 1 and one split point ⇣1 2 X we aim at finding a
model with
(
✓10 + ✓11 x if x < ⇣1
f (x ) =
✓20 + ✓21 x if x ⇣1
that fulfills ✓10 + ✓11 ⇣1 = ✓20 + ✓21 ⇣1 .
A more direct way to proceed in this case is to use a basis that directly
incorporates the constraints
b[1] (x ) = 1,
b [ 2] ( x ) = x ,
b [ 3] ( x ) = ( x ⇣1 ) +
(
x if x > 0
with (x )+ = .
0 otherwise
SPLINES
P3
We see that f (x ) = k =1 [k ] b[k ] (x ) is equivalent to the above
formulation with the constraint ✓10 + ✓11 ⇣1 = ✓20 + ✓21 ⇣1 :
(
[0] [1]
+ x if x  ⇣1
f (x ) = [0] [1] [2]
+ x+ (x ⇣1 ) if x > ⇣1
0 1
0 1
B ⌘ C ⇣
@ [0] + [1] x A I1 + B [0] [2]
⇣1 + [1] + [2] x C
=
|{z} |{z} @ | {z } | A I2
:=✓ :=✓ :=✓
{z }
10 11 20
:=✓21
= (✓10 + ✓11 x ) I1 + (✓20 + ✓21 x ) I2
fulfilling the constraint:

⇣ ⌘
[0] [2] [1] [2]
✓20 + ✓21 ⇣1 = ⇣1 + + ⇣1
[0] [1]
= + ⇣1 = ✓10 + ✓11 ⇣1
SPLINES
This equivalent formulation can be generalized to higher degrees d and
more split points. We can reformulate the hypothesis space as follows:
( )
X
[m ] [m ]
H= f : R ! R | f (x ) = b (x )
m
with
b [j ] ( x ) = x j , j = 0, ..., d
b [d + k ]
(x ) = ( x ⇣k )d+ , k = 1, ..., K 1,
where d is the order of the piece-wise functions, K is the number of

knots, and (.)+ denotes the positive part. The functions b[j ] are the
basis functions.
SPLINES
So for d = 1 and two knots ⇣1 < ⇣2 this would be:
b[1] (x ) = 1, b[2] (x ) = x ,
b [ 3] ( x ) = ( x ⇣ 1 ) + , b [ 4] ( x ) = ( x ⇣2 ) +
Those fixed-knot splines are also known as regression splines.
SPLINES
1.00
0.75
0.50
y
0.25
0.00
0.00 0.25 0.50 0.75 1.00

x
CUBIC SPLINES
It is claimed that cubic splines are the lowest-order spline for which
“non-smoothness” is not visible to the human eye
Cubic splines yield models f (x ) that have first and second order
derivatives at the knots
There is seldom any good reason to go beyond cubic-splines,
unless one is interested in smooth derivatives
The following basis functions represent a cubic spline with knots ⇣1
and ⇣2 :
b[1] (x ) = 1, b [ 2] ( x ) = x ,
b [ 3] ( x ) = x 2 , b [ 4] ( x ) = x 3 ,
b [ 5] ( x ) = ( x ⇣1 )3+ , b [ 6] ( x ) = ( x ⇣2 )3+
EXAMPLE: CUBIC SPLINES
0.75
0.50
y
0.25
0.00
0.00 0.25 0.50 0.75 1.00

x
NATURAL CUBIC SPLINES
The behavior of polynomials tends to be erratic near the

boundaries; extrapolation can be dangerous.
A natural cubic spline adds additional constraints, namely that
the function is linear beyond the boundary knots.
The basis for a natural cubic spline with K knots can be written as
b[1] (x ) = 1, b[2] (x ) = x
b[k +2] (x ) = d [k ] (x ) d [K 1]
(x ), k = 1, ..., K 2
3
(x ⇣k ) + (x ⇣K )3+
d [k ] ( x ) = .
⇣K ⇣k
Proof that this is the basis function for cubic splines fulfilling the additional
constraints is left as an exercise to the reader.
SMOOTHING SPLINES
It is hard to choose the knots “appropriately”.

A method to avoid the knot selection problem is by starting with a
maximal set of knots first.
Introduce a “penalty” for the complexity of the model
Consider the problem: Among all functions f (x ) (with two
continuous derivatives), find one that minimizes the penalized sum
of squares
n ⇣
X ⌘2 Z
(i ) (i )
Rreg = y f (x ) + {f 00 (t )}2 dt
i =1
The first term is the empirical risk; the second term penalizes
curvature in the function; is called smoothing (or regularization)
parameter and establishes a trade-off between the two.
= 0: f (·) can be any function that interpolates the data
= 1: normal linear regression model
SMOOTHING SPLINES
Remarkably, it can be shown that the problem has an explicit,
unique minimizer which is a natural cubic spline with knots at
x (i ) , i = 1, ..., n (for more details, see Hastie et al. (2009), The
Elements of Statistical Learning, 5.4)
SMOOTHING SPLINES
Different values of lead to different fits on data. Finding the right
can be done by cross validation over , for example. The measure
is often called regularization parameter. For more details on
regularization, see also the lecture Introduction to Statistical Learning.
lambda=0.001 lambda=0.01 lambda=0.1
0.75 0.75 0.75
0.50 0.50 0.50

y
y
0.25 0.25 0.25
0.00 0.00 0.00

0.000.250.500.751.00 0.000.250.500.751.00 0.000.250.500.751.00
x x x
(UNIVARIATE) REGRESSION STUMPS
Scenario: Univariate regression scenario with X ⇢ R and Y = R.
Idea:
We divide our data into two groups N1 and N2 , depending on the
value of the input variable x
⇣ ⌘
(i ) (i ) (i )
x ,y 2 N1 , x t
⇣ ⌘
(i ) (i ) (i )
x ,y 2 N2 , x > t.
The two groups N1 and N2 are called nodes. The variable x is

called split variable and t is called split point.
If we have multidimensional inputs, i.e. x 2 X with
dim(X ) = p, p > 1, we have to decide for a splitting variable xj ,
j = 1, ..., p, as well.
(UNIVARIATE) REGRESSION STUMPS
In each of the nodes, we make a constant prediction
⇣ ⌘
(i ) (i ) (i )
f (x ) = c1 , x ,y 2 N1
⇣ ⌘
(i ) (i ) (i )
f (x ) = c2 , x ,y 2 N2 .
with c1 , c2 2 R.
The hypothesis space of regression stumps is:
( ( )
c1 if x  t
H= f : X ! R f (x ) = , c1 , c2 2 Y, t 2 X
c2 if x > t
Regression stumps can thus predict two values c1 , c2 only.
Regression stumps are regression trees with a single split.
REGRESSION STUMPS
The figure shows an example regression tree for the cars dataset.
When the feature speed is smaller than 18, the predicted stopping
distance ( distance) is 29. When it is greater than 18, the predicted
stopping distance is 65.
120
yes speed < 18 no
43
20 40 60 80
n=50
dist
29 65
n=31 n=19
0 5 10 15 20 25
speed
REGRESSION STUMPS: RISK MINIMIZATION
Let D be the training data. “Fitting” a regression stump means finding
an “optimal” splitting variable xj and an “optimal” splitting point t
n⇣ ⌘ o n⇣ ⌘ o
(i ) (i ) (i ) (i ) (i ) (i )
N1 = x ,y 2 D : xj t and N2 = x ,y 2 D : xj >t
by trying all possible split points t. In order to quantify the (negative)

quality of a considered split we compute the empirical risks of both child
nodes and sum them up
R(j , t ) = R(N1 ) + R(N2 ).
The risk R(N ) for a node is simply the summed loss for the data
contained in that node under a selected loss function L
X
R(N ) = L(y , c ),
(x,y )2N
if we model the data in that node with an optimal constant

c = arg minc R(N ).
REGRESSION STUMPS: SPLITTING CRITERIA
For regression, we usually use L2 loss

X
R(N ) = (y c )2
(x,y )2N
1
P
The best constant under L2 is the mean c = ȳN = |N | y
(x,y )2N
Up to a constant, we just computed the variance of the label
distribution in N ; we can also interpret this as a way of measuring
the impurity of the distribution / fluctuation around the constant.
1.2 mean +/− 1 std

1.1
1.0
0.9
0.8
−2 −1 0 1 2
y
REGRESSION STUMPS: SPLITTING CRITERIA
We could use any other type of regression loss to measure the
quality of our nodes
X
R(N ) = L(y , c ).
(x,y )2N
c is then the optimal constant model w.r.t. the loss function chosen.
In Chapter 4, we have derived the optimal constant models w.r.t.
different loss functions.
Accordingly, c would be the median for L1-loss, a quantile for the
pinball loss, etc. within each node N
If there is no closed form solution (e.g. for the Huber loss), we can
get the optimal constant model for each node by using numerical
optimization methods such as gradient descent.
REGRESSION TREES
If we recursively split the nodes again and again, we and up with

regression trees (see lecture Introduction to Statistical Learning)
Visualization of a regression tree for the ozone data set.
Classification
DECISION STUMPS
Similar to the idea of regression stumps, decision stumps can only

predict two scores / probabilities c1 , c2 2 Rg .
Based on a splitting variable xj and a splitting criterion t, the data
is split into two groups.
In each of the two groups, we make a constant prediction.
The hypothesis space of decision stumps is
( ( )
g c1 if xj t g
H= f :X !R f ( x) = , c1 , c2 2 R , j 2 {1, ..., p}
c2 if xj < t
DECISION STUMPS
7
setosa
Petal.Length < 2 no versicolor
6
yes
virginica (unused)
setosa
Petal.Length
5
50 50 50
4
3
2
setosa versicolor
50 0 0 0 50 50
1
0.5 1.0 1.5 2.0 2.5
Petal.Width
DECISION STUMPS: SPLITTING CRITERIA
Common choices for the loss function are either Brier score
(L2-loss on probabilities) or the Bernoulli loss as loss function
We usually model constant predictions in node N by simply
calculating the class proportions: The probability for class k is
(N ) 1 X
⇡k = 1{y = k }
|N |
(x,y )2N
This is the optimal constant under the 2 mentioned losses above.

Class prob.
0.6
0.4
0.2
0.0
1 2 3
Label
SPLITTING CRITERIA: COMMENTS
Tree splitting is usually introduced under the concept of “impurity

reduction”
Based on an impurity measure I, we maximize:
nN1 nN2
max I (N ) I (N1 ) I (N2 ),
N1 ,N2 nN nN
where nN := |N | denotes the number of observations in N .

As nN and I (N ) are constant, impurity reduction is equivalent to
our approach that is more in line with empirical risk minimization
and our previous concepts:
nN1 I (N1 ) + nN2 I (N2 ) = R(N1 ) + R(N2 ).
Whilst the risk does grow with the number of observations in a

node according to our definition (they are simply summed up), the
impurity of a node is a measure of “average”.
SPLITTING CRITERIA: COMMENTS
A typical impurity measure is the Gini impurity
g
X (N ) (N )
X (N ) (N )
I (N ) = ⇡k ⇡k 0 = ⇡k (1 ⇡k ),
k 6=k 0 k =1
which is equivalent to risk minimization w.r.t. Brier score.

Another typical impurity measure is the Entropy
g
X (N ) (N )
I (N ) = ⇡k log ⇡k ,
k =1
which is equivalent to risk minimization w.r.t. log loss.
GINI SPLITTING = BRIER SCORE MINIMIZATION
Let us denote nN ,k as the number of class k observations in node N
and nN the overall number of observations in node N .
We show that
R(N ) = nN I (N ) ,
where I is the Gini impurity and R(N ) is calculated w.r.t. the Brier score
g
X 2
L(y , ⇡(x)) = 1{ y = k } ⇡k (x) .
k =1
g ⇣ ⌘2
X X
R(N ) = 1 { y (i ) = k } ⇡k (x(i ) )
k =1
(x(i ) ,y (i ) )2N
g ⇣ ⌘2
X X
= 1 { y (i ) = k } ⇡k (x(i ) )
k =1
(x(i ) ,y (i ) )2N
g
X ✓ ◆2 ✓ ◆2
nN ,k nN , k
= nN , k 1 + (nN nN , k )
nN nN
k =1
In the last step we plugged in the optimal prediction w.r.t. the Brier
score (the fraction of class k observations)
(N ) nN ,k
ˆk (x) = ⇡k
⇡ = .
nN
We further simplify the expression to
g
X ✓ ◆2 ✓ ◆2
nN nN ,k nN , k
R(N ) = nN , k + (nN nN ,k )
nN nN
k =1
g
X nN ,k nN nN k
= (nN nN ,k + nN ,k )
nN nN
k =1
g
X (N ) (N )
= nN ⇡k · (1 ⇡k ) = nN I (N ).
k =1
ENTROPY SPLITTING = LOG LOSS MINIMIZATION
The logarithmic loss for multiple classes y 2 {1, 2, ..., g } is defined as
g
X
L(y , ⇡k (x)) = 1{y =k } · log (⇡k (x)) .
k =1
The risk of a node N w.r.t. the logarithmic loss is then
g
X X
R(N ) = 1{y (i ) =k } log ⇡k (x(i ) )
(x(i ) ,y (i ) )2N k =1
g
X X (N )
= 1{y (i ) =k } log ⇡k
k =1 (x(i ) ,y (i ) )2N
g g
X (N )
X (N ) (N )
= nN ,k log ⇡k = nN ⇡k log ⇡k = nN I (N ) ,
k =1 k =1
ENTROPY SPLITTING = LOG LOSS MINIMIZATION
(N )
plugging in ⇡k (x) = ⇡k for k = 1, ..., g (which is optimal for the log
loss) in the second step.
So splitting w.r.t. the logarithmic loss is equivalent to entropy splitting.
The entropy will be introduced in a later chapter on “Information

Theory”.
Remark: Stumps/trees with entropy splitting use the same loss function
as logistic regression (binary) / softmax regression (multiclass). While
logistic regression is based on the hypothesis space of linear
functions, stumps/trees use step functions as hypothesis spaces.
LOGISTIC REGRESSION
Multivariate logistic regression combines the hypothesis space of
linear functions
n o
H = f : X ! R | f (x) = ✓ > x
with the logistic loss
L (y , f (x)) = log [1 + exp ( yf (x))] .
We transform scores into probabilities by
1
⇡(x) = P(y = 1 | x) = s(f (x)) = ,
1 + exp( f (x))
with s(.) being the logistic sigmoid function as introduced in Chapter 2.
LOGISTIC REGRESSION
As already shown before, an equivalent approach that directly outputs
probabilities ⇡(x) is minimizing the Bernoulli loss
L (y , f (x)) = y log (⇡(x)) (1 y ) log (1 ⇡(x))
for ⇡(x) in the hypothesis space
n o
H = ⇡ : X ! [0, 1] | ⇡(x) = s(✓ > x)
with s(.) again being the logistic function.
LOGISTIC REGRESSION
Logistic regression with one feature x 2 R. The figure shows how
x 7! ⇡(x ).
1.00
0.75
π(x)
0.50
0.25
0.00
0 2 4 6
x
LOGISTIC REGRESSION
Logistic regression with two features:
LOGISTIC REGRESSION
LOGISTIC REGRESSION: DISCRIMINANT
FUNCTIONS AND DECISION BOUNDARY
Logistic regression gives us a linear classifier.
⇣ ⌘
y
If we apply g (y ) = log 1 y
(which is a monotone, rank preserving
1
function) on ⇡(x) = 1+exp( ✓ > x)
, we get
g (⇡(x)) = log (⇡(x))) log (1 ⇡(x))

✓ ◆
1
= log(1 + exp( f (x))) log 1
1 + exp( f (x))
✓ ◆
exp( f (x))
= log(1 + exp( f (x))) log
1 + exp( f (x))
= log(1 + exp( f (x))) log (exp( f (x))) + log (1 + exp( f (x)))
>
= f (x) = ✓ x.
LOGISTIC REGRESSION: DISCRIMINANT
FUNCTIONS AND DECISION BOUNDARY
f (x) = ✓ > x is the discriminant function of logistic regression and
f (x) = ✓ > x = 0 represents the decision boundary, which is a
hyperplane in the p dimensional space.
The discriminant function can be interpreted as log-odds ratio:
⇡(x) P(y = 1 | x)
log = log = ✓ > x = f (x).
1 ⇡(x) P(y = 0 | x)
EXCURSUS: HYPERPLANES
A hyperplane in X = Rp is a p 1 dimensional linear subspace
defined by a normal vector ✓ (usually with ||✓|| = 1), perpendicular to
the hyperplane, and an offset ✓0 .
For f (x) := ✓ > x + ✓0 , the hyperplane is defined as

{x 2 X : ✓ > x + ✓0 = 0} = {x 2 X | f (x) = 0}
>
x2
0.8
++
− ++
2
− +θ
1
++
●
0.6
0
>
++
●
x1
-2
1
y
-4
0.4
− +
x3
x2
0
−
●
-6
-2 ●
-4
0.2
−
-8
-6
-8 ●
-10
-10
0 2 4 6 8 10
0.25 0.50 0.75
x
x1
EXCURSUS: HYPERPLANES
We call {x 2 X : f (x) > 0}
(the half-space ✓ is pointing at) positive halfspace and
{x 2 X : f (x) < 0}
negative halfspace.
The distance between point x 2 X and a hyperplane f is
|✓ > x + ✓0 | |f (x)|
d ( f , x) = = ,
k✓k k✓k
i.e., d (f , 0) = |✓0 |/||✓||.
For unit length ✓ , these simplify to
d (f , x) = |f (x)|
and d (f , 0) = |✓0 | .
LOGISTIC REGRESSION: OPTIMIZATION
Optimization:
There is no closed form solution to the risk minimization problem
defined by the above loss function and hypothesis space. Optimization
could be done via gradient descent, or second-order methods such as
BFGS.
Claim: The problem of fitting a linear hyperplane f (x) = ✓ > x w.r.t.

logistic loss has not a bounded solution, if the data is linearly separable.
Logistic Loss
4
3
L(yf(x))
0
−4 −2 0 2 4
yf(x)
More generally, the problem of fitting a linear hyperplane w.r.t. a loss

function that never becomes exactly 0 (logistic loss, exponential loss, ...)
L (y , f (x)) ! 0 for yf (x) ! 1,

L (y , f (x)) > 0 for all y , f (x)
has not a bounded solution.
Proof:
Assume we are fitting a linear hyperplane f (x) = ✓ > x + ✓0 .
If the data is linearly separable, i.e. there exists ✓, ✓0 such that

⇣ ⌘
(i )
y ✓ > x(i ) + ✓0 > 0 8i 2 {1, 2, ..., n}.
3
L(yf(x))
0 y(1)f(x(1))y(2)f(x(2)) y(3)f(x(3))
−4 −2 0 2 4
yf(x)
Multiplying coefficients by a constant c > 1 preserves linear separability
⇣ ⌘
>
c · ✓ x + c · ✓0
⇣ ⌘
>
= c · ✓ x + ✓0 >0
However, since y (i ) c · ✓ > x(i ) + c · ✓0 > y (i ) ✓ > x(i ) + ✓0 ,

multiplying the parameters by a constant c > 1 will decrease the
pointwise loss for all i 2 {1, ..., n}
4
3
L(yf(x))
2
1
0 y(1)f(x(1))y(2)f(x(2)) y(3)f(x(3))
−4 −2 0 2 4
yf(x)
Thus
n
X
Remp (f ) = L(y , c · f (x)) ! 0 for c ! 1.
i =1
The solution to the problem is not bounded. A standard optimizer such

as gradient descent will not converge.
Ways out:
Instead of logistic loss we could use a loss function the becomes 0
for large margins yf (x). This however implies that the function is
not smooth and more difficult to optimize.
An example for such a loss function is the Hinge loss (see SVM
chapter in the lecture Introduction to Statistical Learning).
3
L(yf(x))
0
−2 −1 0 1 2
yf(x)
Another possibility is to penalize for large parameters ✓ . This

option is further discussed in the context of regularization (see
also the lecture Introduction to Statistical Learning).

Chapter 5 Univariate-Modelling - Commented6

Uploaded by

Copyright:

Available Formats

You might also like

Chapter 5 Univariate-Modelling - Commented6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 5 Univariate-Modelling - Commented6

Uploaded by

Copyright:

Available Formats

Advanced Statistical Learning

Chapter 5: Univariate and Linear Modelling

Department of Statistics – TU Dortmund

For regression, we will introduce

We minimize the expression analytically by setting the first derivatives

What we described informally as “how data points are scattered” is

R 2 = 1 implies: all residuals are 0, we predict perfectly; R 2 = 0

into the formula for the R 2 :

In the context of linear regression, the R 2 equals the squared

That allows us to interpret / define correlation as a measure of how

could be used for any model f , any baseline model c (c is typically

Let us consider the L1-Loss as cost-function:

Given D we minimize the empirical risk

Minimization of the expression results in the conditional median

Optimization: L2 loss can be differentiated and the empirical risk

L (y, f (x)) = (y X✓)T (y X✓).

Differentiating w.r.t ✓ yields the so-called normal equations:

Design matrix and intercept term:

L (y, f (x)) = (y X✓)T (y X✓),

and the optimal ✓ is, as shown before:

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

H = {f : R ! R | f (x ) = f1 (x )I1 (x ) + ... + fK (x )IK (x )}

with fk (x ) = ✓k 0 + ✓k 1 x 1 + ... + ✓kd x d we could rearrange the

with [j · K + k ] := ✓kj 2 R and

for k = 1, ..., K , j = 0, ..., d.

The functions b[m] (x ) are called basis functions.

Problem: Prediction functions are in general not continuous

= (✓10 + ✓11 x ) I1 + (✓20 + ✓21 x ) I2

fulfilling the constraint:

where d is the order of the piece-wise functions, K is the number of

Those fixed-knot splines are also known as regression splines.

0.00 0.25 0.50 0.75 1.00

0.00 0.25 0.50 0.75 1.00

The behavior of polynomials tends to be erratic near the

It is hard to choose the knots “appropriately”.

0.75 0.75 0.75

0.50 0.50 0.50

0.00 0.00 0.00

The two groups N1 and N2 are called nodes. The variable x is

by trying all possible split points t. In order to quantify the (negative)

if we model the data in that node with an optimal constant

For regression, we usually use L2 loss

1.2 mean +/− 1 std

If we recursively split the nodes again and again, we and up with

Visualization of a regression tree for the ozone data set.

Similar to the idea of regression stumps, decision stumps can only

This is the optimal constant under the 2 mentioned losses above.

Tree splitting is usually introduced under the concept of “impurity

where nN := |N | denotes the number of observations in N .

nN1 I (N1 ) + nN2 I (N2 ) = R(N1 ) + R(N2 ).

Whilst the risk does grow with the number of observations in a

which is equivalent to risk minimization w.r.t. Brier score.

which is equivalent to risk minimization w.r.t. log loss.

We further simplify the expression to

The risk of a node N w.r.t. the logarithmic loss is then

So splitting w.r.t. the logarithmic loss is equivalent to entropy splitting.

The entropy will be introduced in a later chapter on “Information