Functional Regression Handout

Regression & other methods for Functional Data
Fabian Scheipl
Institut für Statistik
Ludwig-Maximilians-Universität München
adidas - March 2019

Who is he?
I lecturer at LMU Munich

I MCML working group “Functional Data Analysis”
I maintainer CRAN task view “Functional Data Analysis”
I contributor: refund, mboost, lme4, gamm4, ...
I Github: fabian-s
2 / 398
Who are you?
3 / 398
Credits
Slides and material in the following based, amongst others, on slides and
figures by:
Sarah Brockhaus, Jona Cederbaum, Sonja Greven, Clara Happ, Torsten

Hothorn, Thomas Kneib, Eva Maier, David Rügamer, Lisa Steyer, Almond
Stöcker, Alexander Volkmann
(All errors & ommissions mine, obviously....)
4 / 398
Global Outline
I. Background: Functional Data

II. Background: Regression
III. Functional Principal Component Analysis
IV. Background: Boosting
V. Functional Regression Models: Theory
VI. Functional Regression Models: Implementation
VII. Modeling Functional Data: Issues, Outlook & Advanced Topics
VIII. Interlude: Exploratory FDA with tidyfun
IX. Case Studies: DTI & Multiple Sclerosis
X. Case Studies: Canadian Weather
XI. Case Studies: EEG Vigilance & Depression
5 / 398
Part I
Background: Functional Data
6 / 398
Introduction
Descriptive Statistics for Functional Data
Basis Representation of Functional Data
Summary
Introduction
Overview
From high-dimensional to functional data
Summary
8 / 398
Introduction
Overview
Examples of functional data: Berkeley growth study
200
160
Height
120 80
5 10 15
Age
8 / 398
Introduction
Overview
Examples of functional data: Handwriting
0.03
−0.03 −0.01 0.01
y(t)
−0.03 −0.01 0.01 0.03

x(t)
9 / 398
Introduction
Overview
Examples of functional data: Brain scan images
10 / 398
Introduction
Overview
Characteristics of functional data:
200
0.03
160
−0.03 −0.01 0.01

Height
y(t)
120 80
5 10 15 −0.03 −0.01 0.01 0.03

Age x(t)
I Several measurements for the same statistical unit, often over time
I Sampling grid is not necessarily equally spaced, sparse data
I Smooth variation, that could be assessed (in principle) as often as
desired
I Noisy observations
I Many observations of the same data generating process
↔ time series analysis
J. Ramsay and Silverman 2005
11 / 398
Introduction
Overview
Aims of functional data analysis:

I Represent the data → interpolation, smoothing
I Display the data → registration, outlier detection
I Study sources of pattern and variation → functional principal
component analysis, canonical correlation analysis
I Explain variation in a dependent variable by using independent
variable information → functional regression models
I No forecasting / extrapolation ↔ time series analysis
Z
Scalar-on-Function: yi = µ + xi (s)β(s)ds + ε
Function-on-Scalar: yi (t) = µ(t) + xi β(t) + ε(t)

Z
Function-on-Function: yi (t) = µ(t) + xJ.i (s)β(s,
Ramsay t)ds + ε(t) 2005
and Silverman
12 / 398
Introduction
Standard setting in multivariate data analysis:
...
n observations
...
...
...
p variables
I Observations xi = (xi1 , . . . , xip ) for i = 1, . . . , n

I Model complexity increases with p (Curse of Dimensionality )
13 / 398
Introduction
Data with natural ordering:
xi1 xi2 xi3 ... xip
t1 t2 t3 ... tp
t1 t2 t3 tp
I Longitudinal data
I Ordering along time domain (one-dimensional)
Functional data:
xi1 xi2 xi3 ... xip
t1 t2 t3 ... tp
T
I Basic idea: Model discretely observed data by functions on domain T

14 / 398
Introduction
Functional data:
I Observations xi (t), t ∈ T for i = 1, . . . , n

I Number of observable values xi (t1 ), . . . , xi (tp )
I in theory: p → ∞
I in practice: p < ∞
I Domain T
I Realizations x1 , . . . , xn of X are curves (d = 1), images (d = 2), 3D
arrays (d = 3), etc.
15 / 398
Introduction

Pointwise measures
Covariance and Correlation Functions
Summary
16 / 398
Pointwise measures
Example: Growth curves of 54 girls
100 120 140 160 180

height (cm)
80
5 10 15
age (years)
Summary Statistics:
I Based on observed functions x1 (t), . . . , xn (t)
I Characterize location, variability, dependence between time points, ...
16 / 398
Pointwise measures
100 120 140 160 180
deviation from mean height (cm)
20
10
height (cm)
0
−10
80
−20
5 10 15 5 10 15
age (years) age (years)
Sample mean function: Centered curves:

n
1 X xi (t) − µ̂X (t)
µ̂X (t) = xi (t)
n
i=1
I Pointwise calculation for each value t ∈ T
I Analogous to multivariate case 17 / 398
Pointwise measures
50
7
height (cm^2)
40
6
height (cm)
30
5
20
4
10
3
5 10 15 5 10 15
age (years) age (years)
Sample variance function: Standard deviation function:

n
1 X
q
σ̂X2 (t) = (xi (t) − µ̂X (t))2 σ̂X (t) = σ̂X2 (t)
n−1
i=1
18 / 398
Covariance / Correlation functions:

I Measure dependence between different (time) points s, t ∈ T
I Sample covariance function:
n
1 X
v̂X (s, t) = (xi (s) − µ̂X (s)) · (xi (t) − µ̂X (t))
n−1
i=1
I Sample correlation function:
v̂X (s, t)
ĉX (s, t) = q
σ̂X2 (s)σ̂X2 (t)
19 / 398
Sample covariance function
40
50
15
heigh
40
40
40 45
55
30
age (years)
t
(cm^2
50 35
10
20
10
)
30
15 25
ag
15)
5
20
s
e
10
10 ear
(y
15
y
ea
5 (
5 age
rs
10
)
5 10 15
age (years)
20 / 398
Sample correlation function
1.0
auto−
0.55
0.75
0.9
0.6
15
0.8
0.75
correla
0.8 0.8
age (years)
0.7 0.75
10
tion
0.6
0.65
0.85
15
ag
5
15) 0.9
5
s
e
10
10 ear
0.7
(y
0.9
y 0.95
ea
0.8
0.9
5 ( 0.85
5 age
rs
0.8 0.75 0.65

0.7 0.65 0.6 0.55
)
5 10 15
age (years)
21 / 398
Introduction

Regularly and irregularly sampled functional data
Basis functions
Basis representations for functional data
Most popular choices of basis functions
Smoothness and regularization
Other representations of functional data
Summary
22 / 398
Example bacterial growth curve i-th growth curve xi (t)

5 Observed measurements:
4
 
t1 xi (t1 )
3
..  .. 
.  . 

y
2
..  .. 
.  . 
1 tp xi (tp )
0
0 20 40
t
22 / 398
Example bacterial growth curve i-th growth curve xi (t)

5 Observed measurements:
4
 
t1 xi (t1 )
3
..  .. 
.  . 

y
2
..  .. 
.  . 
1 tp xi (tp )
0
0 20 40
t
22 / 398
Example bacterial growth curve Sample of curves x1 (t), . . . , xN (t)
5 Observed measurements
4 in  ’wide format’: 
t1 x1 (t1 ) x2 (t1 ) . . . xN (t1 )
3
..  .. .. 
.  . . 
y
2
..  .. .. 

.  . . 
1
tp x1 (tp ) x2 (tp ) . . . xN (tp )
0
0 20 40
t
⇒ Regular functional data:

I functions observed on common grid (often equi-distant)
I simpler case
I to some extend, methods of multivariate statistics can be directly
applied
23 / 398
4 in  ’wide format’: 
t1 x1 (t1 ) x2 (t1 ) . . . xN (t1 )
3
..  .. .. 
.  . . 
y
2
..  .. .. 

.  . . 
1
tp x1 (tp ) x2 (tp ) . . . xN (tp )
0
0 20 40
t
⇒ Regular functional data:

I functions observed on common grid (often equi-distant)
I simpler case
I to some extend, methods of multivariate statistics can be directly
applied
23 / 398

4 in ’long format’:
 
t1,1 x1 (t1,1 )
3
..  .. 
.  .
y

2  
t1,p1  x1 (t1,p1 ) 

1 ..  .. 
.  . 
0
 
tN,1  xN (tN,1 ) 

0 20 40
..  ..

t

.  . 
⇒ Irregular functional data: tN,pN xN (tN,pN )
I functions observed on different time points
I sometimes only sparsely sampled
I more difficult, but often given in practice
24 / 398

4 in ’long format’:
 
t1,1 x1 (t1,1 )
3
..  .. 
.  .
y

2  
t1,p1  x1 (t1,p1 ) 

1 ..  .. 
.  . 
0
 
tN,1  xN (tN,1 ) 

0 20 40
..  ..

t

.  . 
⇒ Irregular functional data: tN,pN xN (tN,pN )
I functions observed on different time points
I sometimes only sparsely sampled
I more difficult, but often given in practice
24 / 398
Basis functions
Basis representation Construct functions as weighted sum

5
θi k bk ( t ) basis functions bk (t), k = 1, . . . , K :
4 f ( t ) = Σ k θ i k bi k ( t )
K
X
3 f (t) = θk bk (t)
y
2
k=1
1 with basis coefficients θ1 , . . . , θK .

0
0 20 40
t
25 / 398
Basis functions
Basis representation Functional shape determined

5 via basis coefficients:
θ i k bk ( t )
f ( t ) = Σ k θ i k bi k ( t )
 
4 1 θ1
3
2  θ2 

3  θ3 
y

2 ..  .. 

.  . 
1
K θK
0
0 20 40
t
Function given by
K
X
f (t) = θk bk (t)
k=1
26 / 398
Basis functions

θ i k bk ( t )
f ( t ) = Σ k θ i k bi k ( t )
 
4 1 1
3
2 1

3 1
y

2 ..  .. 

. .

1
K 1
0
0 20 40
t
Function given by
K
X
f (t) = θk bk (t)
k=1
26 / 398
Basis functions

θ i k bk ( t )
f ( t ) = Σ k θ i k bi k ( t )
 
4 1 1
3
2 1

3 2
y

2 ..  .. 

. .

1
K 1
0
0 20 40
t
Function given by
K
X
f (t) = θk bk (t)
k=1
26 / 398
Basis functions

θ i k bk ( t )
f ( t ) = Σ k θ i k bi k ( t )
 
4 1 1
3
2 2

3 3
y

2 ..  .. 

. .

1
K K
0
0 20 40
t
Function given by
K
X
f (t) = θk bk (t)
k=1
26 / 398
Basis representation Approximate data with basis functions
4
θ i k bk ( t ) ⇒ seek to specify θ̂i,1 , . . . , θ̂i,K such
f ( t ) = Σ k θ i k bi k ( t ) that
3 K
X
xi (t) ≈ θ̂i,k bk (t) .
2
y
k=1
0
0 20 40
t
⇒ Popular criterion:
Specify θ̂i,1 , . . . , θ̂i,K such that quadratic distance becomes minimal,
i.e. !2
X q K
X
xi (tj ) − θi,k bk (tj ) −→ min
θi,k
j=1 k=1
27 / 398
Basis representation Sample of curves x1 (t), . . . , xN (t)

Basis representations
of  observed measurements:
4
1 θ̂1,1 θ̂2,1 . . . θ̂N,1
..  .. .. 
.  . . 
y
2 ..  .. .. 

.  . . 
K θ̂1,K θ̂2,K . . . θ̂N,K
0
0 20 40 PK
t
Functional observations represented as xi (t) ≈ k=1 θ̂i,k bk (t).
28 / 398
Basis representation B-spline bases:

4
B-Spline of degree 1 I piece-wise polynomials of degree d
3 I basis functions consist of
2
1 (d − 1)-times differentiably
0
connected polynomials
B-Spline of degree 2
4 I connection at knots determining the
3
2
number of basis functions
y
1
0
I cheap to compute & numerically
B-Spline of degree 3 stable
4
3 I local support: sparse matrix of basis
2
1
function evaluations
0
0 20 40
t
29 / 398
Other popular bases:

I Fourier basis: containing harmonics with different frequencies
⇒ periodic functions
I Wavelets:
⇒ for peaked, ragged functions.
I Thin-plate splines
⇒ better theory, also for surfaces.
30 / 398
Basis representation I how many knots for the basis?

5
q i bi(t) I trade-off between over-fitting
S q i bi(t)
4 and
3
under-fitting
y
0
0 20 40
t
31 / 398
Penalization:
I minimize quadratic difference from data
+ a roughness penalty term
Specify θ̂i,1 , . . . , θ̂i,K to minimize
p K
!2
X X
xi (tj ) − θi,k bk (tj ) + λ pen(θi ) −→ min
θi,k
j=1 k=1
I with, e.g., P
quadratic penalty on second order differences, i.e.
pen(θi ) = K 2
k=3 ((θi,k − θi,k−1 ) − (θi,k−1 − θi,k−2 )) and λ > 0 a
smoothing parameter
32 / 398
Fit with λ = 0 Fit with λ = 1
4 4
3 3
2 2
y
y
1 1
0 0
-1 -1
0 20 40 0 20 40
t t
Fit with λ = 1000

4
3
2
y
1
0
-1
0 20 40
t
I λ is typically estimated from the data, e.g. using cross validation
33 / 398
I Functional principal components: (J.-L. Wang et al. 2016)

I basis representation learned from observed data
I “optimal” (low-dimensional) basis
I more on this later
I Gaussian processes: x(t) ∼ GP (µX (t), σX (t, t 0 )) (Shi and Choi 2011)
I Gaussianity assumption
I σX (t, t 0 ) from some parametric family
I µX , σX estimated from data
I Differential equations / dynamics: (J. Ramsay and Hooker 2017)
I represent functional data in terms of differential equations describing
their behavior:
d
dt x(t) = f (x(t))
I seems very useful for physical systems, motion data etc.
I (available literature uses spline representations internally)
34 / 398
Introduction
Summary
35 / 398
Summary
Functional Data:
I Arises in many different contexts and in many applications (curves,
images,...)
I Observation unit represents the full curve, typically discretized, i.e.
observed on a grid
I Important analysis techniques:
I Smoothing and basis representation
I Functional principal component analysis
I Functional regression
Summary Statistics:
I Give insights into location, variability and time dependence in a
sample of curves
I Pointwise calculation, mostly analogous to multivariate case
35 / 398
Summary
Basis representation:
I Different types of raw functional data: regularly and irregularly
sampled
I (Approximate) representation via bases of functions
I ’true functional representation’
I smoothing / vector representation
I Represent a functional datum in terms of a global, fixed, known
dictionary of basis functions and an observation-specific coefficient
vector.
I Different types of basis functions for different purposes / applications
I Obtain desired ’smoothness’ via penalization
36 / 398
Part II
Background: Regression
37 / 398
Recap: Linear Models
Recap: Generalized Linear Models
Recap: Non-Linear Effects
Recap: Mixed Models and Random Effects
Recap: Additive Models and Penalization

Linear Model: Basics
Inference
Model Diagnostics
R-Implementation: LM
39 / 398
Data & Model
Data:
I (yi , xi1 , . . . , xik ); i = 1, . . . , n
I metric target variable y
I metric or categorical covariates x1 , . . . , xp (categorical data in binary
coding)
Model:
I yi = β0 + β1 xi1 + · · · + βp xip + εi ; i = 1, . . . , n
⇒ y = Xβ + ε; X = [1, x1 , . . . , xp ]
I i. i. d. residuals/errors εi ∼ N(0, σ 2 ); i = 1, . . . , n
I estimates ŷi = β̂0 + β̂1 xi1 + · · · + β̂p xip
39 / 398
Interpreting the coefficients
Intercept:
β̂0 : estimate for y if all metric x = 0.
and all categorical x in their reference category.
metric covariates:
β̂m : estimated expected change in y if xm increases by 1 (ceteris
paribus).
categorical covariates: (dummy-/one-hot-encoding)
β̂mc : estimated expected difference in y between observations in
category c and the reference category of xm (ceteris paribus).
40 / 398
Linear Model Estimation
β̂ minimizes sum of quadratic errors (OLS-estimate):

n
!
X
> 2 >
(yi − xi β) → min bzw. (y − Xβ) (y − Xβ) → min
β β
i=1
⇒β̂ = (X> X)−1 X> y
Estimated error variance:

n
1 X 1
σˆε2 2 = (yi − x> 2
i β̂) = ε̂> ε̂
n−p n−p
i=1
41 / 398
Properties of β̂
I unbiased: E(β̂) = β
I Cov(β̂) = σ 2 (X> X)−1
for Gaussian ε:
β̂ ∼ N(β, σ 2 (X> X)−1 )
42 / 398
Tests
Possible settings:
1. Testing for significance of a single coefficient:
H0 : βj = 0 vs HA : βj 6= 0
2. Testing for significance of a subvector βt = (βt1 , . . . , βtr )> :
H0 : βt = 0 vs HA : βt 6= 0
3. Testing for equality: H0 : βj − βr = 0 vs HA : βj − βr 6= 0
General:
Testing linear hypotheses H0 : Cβ = d
43 / 398
Tests
F-Test:
Compare sum of squared errors (SSE) of full model with SSE under
restriction H0 :
n − p SSEH0 − SSE
F =
r SSE
−1
(Cβ̂ − d) σ̂ 2 C(X> X)−1 C>
> (Cβ̂ − d) H0
= ∼ F (r , n − p)
r
t-Test:
Test significance of a single coefficient:
β̂j H0
t=q ∼ t(n − p)
\
Var(β̂j )
2
2 β̂j H0
F =t = ∼ F (1, n − p)
\
Var( β̂ ) j
44 / 398
Residuals in the linear model
Observed errors ε̂ typically not uncorrelated with identical variance:
ŷ = Xβ̂ = X(X> X)−1 X> y

| {z }
hat matrix H
⇒ ε̂ = y − ŷ = (I − H)y
⇒ Cov ε̂ = σ 2 (I − H)
45 / 398
Types of Residuals
I ordinary residuals: ε̂ (not independent, no constant variance)

I standardized residuals: ri = √ ε̂i (constant variance)
σ̂ 1−hii
ε̂i
I studentized residuals: ri∗ = √
σ̂(−i) 1−hii
:
use for anomaly / outlier detection.
I partial residuals: ε̂xj ,i = ε̂i + β̂j xij :
check linearity, additivity.
46 / 398
Graphical model checks:
I model structure: ri vs ŷi

I linearity: ε̂xj ,i vs xj
I variance homogeneity: ri vs ŷi , xj
I autocorrelation: ri , ε̂i vs i (i = time, e.g.)
47 / 398
Linear Model in R:
Linear Models in R: lm model specification:

I m <- lm(y ~ x1 + x2, data=XY)
interactions:
I lm(y ~ x1*x2) equivalent to lm(y ~ x1 + x2 + x1:x2)
methods for lm-objects:
I summary(),anova(),fitted(),predict(),resid()
I coef(), confint(), vcov(), influence()
I plot()
etc...
48 / 398
Example: Munich Rents 1999
I data: 3082 apartments

I target: net rent (DM/sqm)
I metric covariates: size, year of construction (metrisch)
I categorical covariates: area (normal/good/best), central heating
(yes/no), bathroom / kitchen fittings (normal/superior)
49 / 398
Model in R
no interaction:
y = β0 + β1 ∗ x1 + β2 ∗ x2.2 + β3 ∗ x2.3
miet1 <- lm(rentsqm ~ size + area)
(beta.miet1 <- coef(miet1))
## (Intercept) size areagood areabest

## 18.2429185 -0.0715132 0.9059416 3.4196824
with interaction:
y = β0 + β1 ∗ x1 + β2 ∗ x2.2 + β3 ∗ x2.3 + β4 x1 x2.2 + β5 x1 x2.3
miet2 <- lm(rentsqm ~ size * area)
(beta.miet2 <- coef(miet2))
## (Intercept) size areagood areabest

## 18.67890804 -0.07817872 0.11145940 0.87292650
## size:areagood size:areabest
## 0.01182596 0.03302475
Model Visualisation
miet1 miet2
35
35
area
● normal
● good
30
30
● best
25
25
net rent (DM/sqm)
net rent (DM/sqm)

20
20
15
15
10
10
5
5
0
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
size (sqm) size (sqm)

Tests
anova(update(miet2, . ~ -.), miet2)
## Analysis of Variance Table

##
## Model 1: rentsqm ~ 1
## Model 2: rentsqm ~ size * area
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3081 69521
## 2 3076 60064 5 9457.3 96.866 < 2.2e-16 ***
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Tests
round(summary(miet2)$coefficients, 3)
## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 18.679 0.329 56.736 0.000
## size -0.078 0.005 -16.376 0.000
## areagood 0.111 0.494 0.226 0.821
## areabest 0.873 1.542 0.566 0.571
## size:areagood 0.012 0.007 1.716 0.086
## size:areabest 0.033 0.018 1.797 0.072
round(anova(miet2), 3)

##
## Response: rentsqm
## Df Sum Sq Mean Sq F value Pr(>F)
## size 1 8071 8071.3 413.346 <2e-16 ***
## area 2 1284 641.9 32.875 <2e-16 ***
## size:area 2 102 51.1 2.617 0.073 .
## Residuals 3076 60064 19.5
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Model Comparison
anova(miet1, miet2)

##
## Model 1: rentsqm ~ size + area
## Model 2: rentsqm ~ size * area
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3078 60166
## 2 3076 60064 2 102.19 2.6168 0.0732 .
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
complex hypotheses / multiple testing: package multcomp

Model diagnostics: plot.lm()
par(mfrow = c(2, 2))
plot(miet2)
Residuals vs Fitted Normal Q−Q
Standardized residuals
20
● ●
4
● ●
● ●
● ●
● ● ● ● ●
●● ● ●●●●
● ●
Residuals
● ● ●●●●
● ● ● ●
10
● ●
●
● ● ● ● ● ● ● ● ●● ● ●
●●
●●
●●
●
● ● ● ●● ● ● ● ● ● ● ● ● ● ●
●●
●
●●
●
●●
●
●
● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●
●● ●●
●●
●
●
●●
●
●
●●
● ●●● ●●●● ●●● ●●● ● ●● ● ● ●
2
● ● ● ● ● ●● ● ● ●
●
●●
●●
●
●●
●●
● ●●● ● ●● ● ● ● ● ●● ● ● ●● ●●
●
●●
●
● ●● ●● ● ● ● ● ● ●●
●● ●●● ●●● ●● ● ● ● ● ●
●●●
●●● ●
● ● ●● ● ●● ●
●●
●
●
●●
●
●
● ● ●● ●●
● ● ●●
● ●●●●● ● ● ●● ● ●● ● ● ●●
●● ● ●
●●●● ●●●●●●●● ●●
● ● ●●●
● ● ● ●●●● ●
●
●●●●
●
●●
●●● ● ●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
● ● ● ● ● ● ●● ● ● ● ●● ● ● ●
● ● ●●● ●
●●● ● ● ●● ● ●●
●
●
●
● ●
●● ● ●● ● ●● ● ●●● ● ●● ●●● ● ●● ● ● ●●●
●●●
●●●● ●●●● ●● ●●
●●●●●●●●●●
●●●●●●●● ● ●
● ●●●● ● ● ●●
●●
●●●●●● ● ●
●
●
●●
●
●
●●
●
●
●●
●
●
● ●● ●●● ●
● ●●● ●●●●● ●● ●● ●● ●●●● ●●●
● ●● ●●●
● ●● ●●●●
●● ●●
● ●● ●● ● ●● ●
●●
●
●
●●
●
●
●●
●
●
● ●● ● ●●●● ●● ● ● ●●● ●● ● ●
● ●
●●
●●
●●●● ●●
● ●●
● ●●
●
●● ●
●●●● ●●
●
●●● ● ●
●●● ●
●●●
● ●● ●
●● ● ●
●
●●
●
●
●●
●
● ●● ● ● ●● ●●
● ●●● ●●
●●
●
●●●●
●●
● ●●● ●●●●●
●● ●
●●● ●●
●●● ● ●●●●●● ●●● ●●●●●●
●
● ●●●●●● ● ●
●●
●
●
●●
●
●
●●
● ●●● ● ● ●● ● ● ● ● ●● ●●
●●●
● ●●
●●●
●
●
●
●●●
●●●●●
●●●●
● ●●●●●●●
●●●
● ●●●●● ●
●●●
●●●●●
●
●
●● ●●
●●
●
●
●●●●
● ●●
●●●●
●●
●
● ●●●●
● ●●●●●● ●
●
● ●
●● ●●
● ● ●●
●● ●●
●●
● ●
●
●●
● ●● ● ● ●
●●● ●
●●
●
●
●●
●●
●
●
●●
●
● ●
●
●
●
●●●
●
● ● ● ● ●●
●● ● ● ● ● ●●●●●●●●●●● ●●
●●● ●●● ●
●● ●●
●
●● ● ●●●
●● ●
●
●● ●
●●●
● ●●
● ●
●
●●●●
●●●●●
● ●●● ●●
●●● ●
●●● ●●●● ●● ●●●●
●●● ●
●●●●●● ●●●● ● ● ●
●●
●
●
●●
●
●
●●
●
●● ●●●●●● ● ●●●
●●●●● ●●●●
●●
●● ● ●●●●●●
●●●●● ●● ●●
●●●●
● ●●
● ●
●●●●●●● ●
●●● ● ●
●
●●
●
●
●
●●
● ●
●
● ●● ● ● ● ● ●●
●
●
●
●● ●
●● ●●
● ●
●● ●
● ●●
●
●●
●● ●● ●
●
●●
● ●●
●●●
●●
●
●
●
●
●
● ●
●●●
● ●●
●●●●
●
●
●
●
●
●
●
●
●
●
● ●●●
●●●●●
●
●
●●
●
●
●
●
●●
●●●●
●
●
●
● ●
●●●
●
●
●●●
●
●●
●
● ● ● ●●● ●
●●
●●
●
●
●
●
●
●●●
●
●●
●●
●●●● ● ●
●●
●
●●
●
●●
●
●
●
●●
● ●
●
●
●●
●
●
●
● ●●● ● ● ● ●●● ● ●●● ●
●●●●● ●●●●
● ●
●●● ●●●●● ●●
●●●
●● ●●
●● ●● ●●●
●● ●●●
●●●●
●● ●● ●●
● ●●● ● ●
●●
●
● ●● ● ● ● ●● ●● ●●●● ●●●● ●●●●
● ●
0
●● ●● ●●
● ●●●● ●● ●● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ●●
● ●● ●● ●●● ●● ●● ●●●● ●●
●
●● ● ●●
●
● ●●
●●
●
●●
●●
● ●●● ●
●●
●●● ●●
●● ●●
●
●●
●●
●●
●●●
●
●
●●● ●●●●●●
●
●
●● ●●
●● ●●●●●●● ● ●●
●● ●●●●
●●
●
●● ● ●●●● ●
●●
●
●
●●
●
●
●●
●
●
●
● ● ● ●● ●● ● ●●● ●●
●
●● ● ●● ●● ●● ●
●●●
●● ● ●●
●● ●●
●
●●●●
● ●●●●●
●● ●●
●
●●●●●●●
●● ●●●
●●
●●●
●
●●
●●
●
●
●
●
●●
● ●●
●
●
●
●● ●●
●●
● ●
●●●●
●●●●● ● ● ●
●
●
●●●●
●● ●●● ● ● ●●
●
●
●●
●
●
●●
●
●
●●
●
0
● ●●●●●
●● ●●
●● ●
● ●● ●● ● ●● ● ●
●
●
● ● ● ●● ● ● ●●●
●●● ●● ●● ● ●●
●●●●●● ●●
●●
●●● ●●●●●●●●●●●
●● ●●
●●●●●●●
● ●●
●●●●●
●●●●
● ●●●
●
●●
●●● ●
● ●●
●●●●●●
●
●●● ●●
●
●● ●●●●●●●
●● ●●
● ● ●●● ●● ●
●
●●
●
●
●●
● ●
●
●
●●
●
●
●
● ●● ● ●
●● ● ●
●● ●●●
● ●●●
●● ●
●
●●●●
● ●●
●● ●
●●●●● ●
●
● ●●
●●
●●
● ●
●
●●●●●
● ●●
●●●
●●
● ●●●
● ●● ●●● ●● ●
●●●●●●● ● ●● ●●
●
●
●●
●
●
●●
● ●● ●● ●● ● ●●●●●
● ● ● ●
●●
●●●●●
● ●
●●●● ●●
●●●●
●●
●●●●●
●● ●●
●
●●
●●●●
●● ●●
●●●
● ●●●
●●
●●
● ●●
●
●
●●● ●
●●●●●●
●
● ●●●●●
●●
● ●●
●
●●
●●● ● ●● ● ●● ● ●●●
●● ●
●
●
●●
●
●
●●
●
●
● ● ● ●● ● ● ● ●●
●●● ●●
● ●●
●●
●●●
●●●● ● ●● ●●
● ●
●● ●
●●
●●●
●●
●●
● ● ●●
●●
●●●
●●
●●●●
●●
●
● ●●●●●●●●
●●● ● ●
●●● ●● ●● ● ●
● ● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●
● ● ● ● ●● ● ● ●● ● ●●
●● ● ●●
●●●●●
●● ●●●●●●●● ●●●
●●
● ●●●
●●
●●
●
●
● ●
●●●●●
●●
●●
●● ●●
● ●
●●
●●
● ●●●●●
● ●● ●● ● ● ●●
● ● ● ● ●
●
●●
●
●
●●●
●
●
●
●
● ● ● ● ● ●● ●●●●●
● ●● ●●
●●●● ●●
●●
●●●
●●●
●
●●
●●
●●●●
●●● ●●
●
●●●●●●●●
●
● ●
●● ●●
●● ●
● ● ●● ●
●
●●
●
●
●●
●
●
●●
●
●
●
● ● ● ●
●● ●● ● ●● ●
● ●●
● ●●● ●●
●
●●●
●●
●
●●
●●●● ●●● ●● ●●
● ●●●
●●● ●
●
●
●●● ●●
●●●
●
●●
●●●●
●●●
●● ●
●●●●
●● ●● ● ●● ● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
● ● ●●●●● ●● ●●● ● ●●●
● ● ●●●● ●● ●●●●●
●●● ●●
●● ●
●
●●●
●● ●●● ●●
●●
●●●
● ●● ●
● ●●●
●●●●
●●
●●
●●●●●
● ●● ●● ●
●
●●
●
●
●●
●
●
●●
●
●
●
● ●● ●●●● ●●● ● ● ●●●●●●●●
●●●●
●
● ●●● ●●
●●
● ●●
●● ●● ●
●
●●
●
●
●●
●
●
●●
● ●●● ● ● ●● ●●● ● ● ●● ●● ● ●● ●
● ●● ●● ● ● ● ●●
●
●
●●
●
●
●●
−10
● ● ● ●● ● ● ● ● ●●
●●●
● ● ●●●●●
●●● ●● ●● ●● ●●●●● ●●●●●● ●
● ● ●● ●
●●
●
●
●
●●
● ●
●
●
●●
●
●
●●
●
●●
●
●
●
● ● ● ●●●●●
● ●●
●● ● ●●
● ● ●●● ●●●
● ●●●●●● ●
●
● ●
●●
●
●
●●
●
●●
●●
●
●
●
●●
●
● ●
●●
●●
●
● ● ●● ● ●● ● ●●
●
●
●●
−2
● ● ● ●●● ● ● ●
●●
●
●
●● ● ● ●●●● ● ●●
● ● ● ●
● ●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
● ● ● ●
●
●●
●
●●
●
●●
●
●
●● ● ● ●●
●●
●
●●
●
● ●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●●
●
●●
●●
● ●●●
●
●
● ● ●●●●●●
●
●
● ●
●
8 10 12 14 16 18 −3 −2 −1 0 1 2 3
Fitted values Theoretical Quantiles
Scale−Location Residuals vs Leverage
0.0 0.5 1.0 1.5 2.0
● 0.5
●
●
4
●
● ●
● ● ●● ●
● ● ● ●
●● ● ●
● ●
● ● ● ● ●
●
●●●
● ● ● ● ● ● ●
●
● ● ●● ●● ● ●● ●● ● ●● ●●
●● ● ●
● ●● ●● ● ● ●● ● ●
● ● ● ●● ● ●●● ● ● ●● ● ● ●● ● ●● ●
●●●●●●
●
● ●
●
● ●●● ● ● ● ● ●●●● ●●
●● ●● ●●●●
● ● ●
●●●● ●
● ● ● ● ● ● ●●● ●● ● ● ● ●● ●●●
●●
● ●●
● ●● ●● ●
● ●● ●●
●
●
●
●●
● ● ●
● ● ●
2
● ●● ● ●●● ● ●●● ● ● ●● ●
●●
●
● ●
● ● ● ● ● ●●● ● ●●● ●●●●● ●●●
●● ● ●●●●●
● ● ●●
● ●● ●● ●
●
●
●
●
●
●
●
●●
●
●●
●●●
● ●● ●
● ● ●● ● ● ●
● ● ● ●● ●● ●●● ●● ● ●
● ● ●●●●
●● ●●
●● ●●●●● ●●●
●●●●●●●
● ●●● ●● ● ●● ●● ●●
●● ● ● ●
●
●
●
●
●
●
●
●●
●●●●● ● ●
● ● ●
● ● ●●● ●● ●● ●●●● ● ●
●
●●●●●●●●
● ●
●●●●●●● ● ●●●
●
● ●● ● ●●●
● ● ●● ●
●●● ● ●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●● ●● ● ● ● ●
● ● ● ●● ● ●● ●
● ●● ●
●●● ● ●●●●●● ●● ●●
● ● ●● ●●● ● ●●●●● ● ● ● ●
●
●
●●
●
●
●
●●
●
●●●●
●●●● ● ●● ● ●
● ● ● ●● ● ● ● ●● ● ● ●●● ● ●● ●●● ●●● ●●
●
●
● ●● ●
●
●●●●
●● ●●●
● ●●
●
● ●
●●●●
●
●●●●●● ●●●●
●●●●●●
●
●
●●
● ●●●● ● ●●
● ● ● ● ●●●● ● ●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ● ●●
●
● ● ● ● ●● ●●
● ●● ●● ●
●● ●●
●●
●●●●
●
●
●●
●●●●●● ●● ●●●●●● ●
●
●●
●
●●
●●
●
●● ●
● ●●●●
●
● ●●
●●●
●●
● ●
●●●
● ●●
●● ● ●●
●●● ● ●
●
●● ●● ● ●
●●●● ●● ● ●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●●● ●●
● ●● ● ● ● ● ●
●
●
● ● ● ●●●
● ●● ● ●● ●● ●●●
●●
● ● ●●
● ●●
●●●
●●●
●
●●●
●●
●
●●●
●
●
●
●
● ●
●
●
●
●
●
●
●●
● ●●
●●
●●●
●●
●●●●
●
●
●●●
● ●●●
●
●
●●● ●
●●●
● ●●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●
● ● ●● ●●●
● ●
● ● ● ●● ● ●●●● ● ●● ● ● ●
●● ●●●●●●
●● ●
●●●
●● ●●●● ●
●● ● ●
●●●●
●●●●●●● ● ●
●● ●●● ● ● ● ●● ● ●
●
●
●●
●
●
●●
●
●●
●
●●●● ●
● ●● ●● ● ●●●●● ●●
● ●● ●●●●●● ● ●●● ● ● ●● ● ● ●
●
●
●●
●
●
●
●●
●●
●●● ● ● ●
● ● ●● ●● ● ● ●●● ●●●● ● ● ● ● ●● ●●●●
●●●●
●●●
●●●● ●●●
●● ● ●●
●
●●●●●●
● ●●●
●
●
●
●●
● ●
●●
●●
●● ● ●● ●●● ●●●● ●●●● ●
●● ●
● ● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●●●● ● ● ●
● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●●
●● ●●●●
●●
●● ●
●●●●●●●●● ●●
● ●
● ●●
●●
●
●●●●
●●
●●●●●●
● ●●
●
● ●●
●●●● ●● ● ● ●● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●● ● ● ● ●
● ● ● ● ●●
●● ●●●●● ●●
●●●
●●
●
●●
●
●●● ●●● ●●
●●●
●●●●
●●
●
●
●
●●●●
●●
●
●●
●
●●
●
●● ●
●●●●
●● ●●
● ●●
●●●
●
●●
●
●●
● ●●●● ● ● ●●●●
● ●●
●●●● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
● ●● ● ● ● ● ●● ● ●
● ●● ● ● ●●●●●●● ● ●●●●
● ●●
●●●● ● ●● ●●●● ●●●
●● ● ●●
0
●● ●● ● ● ●
●● ●●●
●●●● ● ● ●● ●●
●
● ●●●●● ● ● ●
●●●●●●
● ●●●●
●●●●●●
●●
●●
●
●●●●
●
●●●●● ●●
●●●●●● ● ●●●●
●●●
●● ●● ● ● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●●
●●
● ● ●● ●●● ●● ●●●
● ● ●
● ● ●●●●● ●●●●●
●●●
●●●
●●
● ●● ●● ●
●●●● ●●●●
●●●●●●●
●●●●●
● ●●
● ●●
●●●● ●
●
●●●●● ●●● ●●
●●●● ●● ● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
● ●●●●●●● ●●
● ● ● ●●● ● ● ● ● ●●● ● ●●● ●
●●
●●
●● ●
●
●●●●
●●
●●●
●●● ●● ●●●
●
● ●●● ●● ●
●
●●
●
●●●●
●●●●●●
● ●●
●●● ●● ●●● ●●
●● ● ●● ● ●●● ●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●●
●
● ●●●●●
● ● ● ●
●
●●● ●●●● ●● ●● ●
●●●●●
● ●●●
●● ●
●● ●●●●●●●●
●●●●●●
●
●
●●●
●● ●●●
●
●
●
●●●
● ●●●
●●
●●
●
●●●●
●● ●●●
●
●●●●●
● ●● ●●
●
●● ●●
●●●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●● ● ● ●
●
● ● ●
●● ● ● ●●●● ●●● ● ●●● ●
● ●●●●●
●● ●● ●●
●●●●●
●●
●● ● ●●●●
●●●
●●●
●● ●
●●●●●
●
●●
●● ●● ●●
●●
● ● ●
●●● ●●
●● ●●●
●●●●
●●● ● ●● ●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●●
●●●●●●●
● ●● ●●●● ●●●● ●●●
●● ● ●●●
●●●●● ●●
●●● ●●● ●
●●
●● ●
●●● ●●
●● ●
●
●
●●●●
●
●●●● ●● ●●
● ●●
●●● ●●● ●● ●● ●●●●● ● ● ● ● ● ● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ● ● ●● ● ● ●
●
●● ●●
● ● ● ●●● ●●
● ● ●●●●●● ●● ●●●
●● ●● ●●●● ●
●●●
● ●●●
●●●● ●● ● ● ● ● ●● ● ●
●
●
●●
●
●●
●
●●
● ● ●
● ●●● ● ●● ●●●●●
●
●●●●●●●●●
●●●●
●●
●● ●
●● ●●●●●●
● ●●●
● ● ●●●●
● ●
●●●●
●●● ●● ●●●
●●●● ●●● ●
●●
●●●● ●
● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●● ●● ●
●●
● ●
● ● ●● ● ●●● ●●●
● ●●●● ● ●●
●● ●●
● ●●●
●● ● ●● ●
●
●●● ●● ●●●●
●●
●●
●
● ● ● ●● ●
●
●
●●
●
●
●●
●●
●●● ● ●
● ● ● ●●●● ● ●● ●●●●
●●●● ●●●
● ●●
●●●●● ●
●●
●● ●●● ●
●●●●●●
● ●● ●●●●
● ●●●●●●●●● ●●● ●●●●
●
● ●●●● ● ●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
● ●
●● ● ● ● ● ●●●●●●● ● ● ● ●●
● ●●
●
●●
●
● ●●
●
● ●● ●
● ●●●●
● ●●
●●●●
●●●● ●●●
●●●
●
●●
●●● ●●
● ●●●●● ● ●●
●
●
●
● ●● ●
●●●●
● ●● ● ●● ●●
●●● ●● ●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
● ●●
●
● ●
● ● ●
●●
● ●● ●● ●●●● ● ●●● ● ●● ●●
● ●●●●●●●●●
●● ●
●●● ● ●
● ● ● ●●● ●
● ● ● ●
● ● ●● ● ● ● ●
●
●
●●●●
● ●
● ● ●●● ●●
● ●●
●●●● ● ●
●
●●
●
● ●
●
−2
● ● ●●●●● ●● ● ●●● ● ●● ● ●
●●
● ● ●●● ● ●●●
●● ●● ●●●●
●● ● ●●●●●●● ●● ● ●
●● ● ●
● ● ● ●
●
●
●
●
●
●●
●●
●●● ● ●
● ● ●●● ● ● ●
●● ●●●●●●●●● ● ● ●
● ●● ● ● ● ●● ●●●●●●●
● ●●
●●●
●
●
●
●●●
●●● ●●●● ● ●●●●● ● ●●● ●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ● ● ●
● ● ● ● ●●●●
●● ●● ● ●●
● ●● ● ●● ● ● ●
●
●● ● ●
● ●● ● ●● ● ● ●●● ● ●
●
● ● ●●●
●
●● ●●● ●● ● ●●● ●●● ● ●
●●● ●●●
● ●● ●●
●● ●● ● ●● ● ● ●● ● ●
●
●
●●
●●
●● ●
●● ● ● ● ● ● ● ● ● ●●●● ● ●●● ●● ●● ● ●
● ●●● ●● ●● ●●● ●● ● ● ● ●
●
●
●●
●
●●● ●
● ●● ●
●●● ●● ●● ● ● ● ●●
●● ● ●
● ●● ●
●●● ● ●● ● ●● ●
●●● ●
●● ● ●● ● ● ●●
● ●● ● ● ● ●
● ● ● ●● ● ●●● ●●●●●
● ●●● ●
● ● ●● ● ●●●●● ● ● ●● ●● ● ●
● ● ● ● ●●● ● ● ●●● ●●
●● ● ●● ●● ●●
● ● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●● ●
●●● ●● ● ●●● ● ● ● ●
● ● ● ●●● ● ● ●
●● ● ●● ●● ● ● ●●●●● ●
● ●● ●
● ● ● ●● ● ●● ● ●
● ● ●● ● ● ● ● ●●●● ● ● ● ● ●
●
●
●
●
● ●●● ● ● ● ● ●
● ● ●
● ● ●●●
●●
●
●●●
●●
●● ● ● ● ●
●
● ●● ●●
●●
●● ● ●●●● ● ●●●
● ●
● ● ●● ●
●●
●
● ●
●
Cook's distance
−4
8 10 12 14 16 18 0.00 0.02 0.04 0.06 0.08 0.10

Fitted values Leverage
Model criticism: Linear effect of size?
bla
blub
20
10
5
^ε
0
−10
20 40 60 80 100 120 140 160

size
Model criticism: Linear effect of size?
plot(size, res_size)
points(sort(unique(size)), tapply(res_size, size, mean), col = "red")
20
●
10
●
●
●
●
● ●●
5
● ● ●
● ●
● ● ●●● ●
^ε
● ●
● ●
● ●● ●
● ● ●●
●●● ●●●●●● ● ●
0
● ● ●
● ● ● ●● ●●● ● ● ●
●● ● ●●●● ●● ● ● ●● ● ● ●
● ● ● ●● ●●●●● ● ● ● ● ● ●● ● ●● ● ●
●
● ● ● ● ● ●
●
● ● ● ●● ● ● ● ●
● ● ●
● ● ● ●
●●
●● ●● ● ●
●
●
−10
●
●
●
20 40 60 80 100 120 140 160

size
Alternative Representation of linear models
I y = Xβ + ε
I Gaussian errors: ε ∼ N(0, σ 2 I)
⇒ y ∼ N(Xβ, σ 2 I)
⇒ E(y) = Xβ; Var(y) = σ 2 I
58 / 398

Motivation
GLMs: The General Approach
Inference
59 / 398
Binary Target: Naive Approach
Data:
I binary target y (0 or 1)
I metric and/or categorical x1 , . . . , xp
naive estimates:
ŷi = β̂0 + β̂1 xi1 + · · · + β̂p xip
I ŷi not binary
I could try to interpret ŷi as P̂(yi = 1)
I no variance homogeneity
I ŷi < 0 ? ŷi > 1? ⇒ ŷi must be between 0 and 1
Idea:
P̂(yi = 1) = h(x>
i β̂) with h : (−∞, +∞) → [0, 1]
59 / 398
Binary Target: GLM Approach
I yi ∼ B(1, πi )
I model for E(yi ) = P(yi = 1) = πi
I use response function h: π̂i = h(x> i β̂)
or linkfunktion g : g (π̂i ) = xi β̂ where g () = h−1 ()
>
Logit-Model:
exp(x>
i β̂)
π̂i = h(x>
i β̂) =
1 + exp(x>
i β̂)
60 / 398
Binary Target: Coefficients of the Logitmodel
exp(x>
i β) πi
πi = ⇔ log = x>
i β̂
1 + exp(x>
i β) 1 − πi
πi
⇔ = exp(β0 ) exp(β1 xi1 ) . . . exp(βp xi )
1 − πi
I linear model for log-odds (Logits)

π̂i
⇒ exp(β̂r ) as factor by which odds change 1−π̂i an, if xir increases by 1.
⇒
P̂(y = 1|x)/P̂(y = 0|x)
exp (x − x̃)> β̂ =
P̂(y = 1|x̃)/P̂(y = 0|x̃)
odds ratio between 2 observations with x and x̃.
61 / 398
Binary Target: Probit- & cloglog-Models
Probit-Model:
use standard-Gaussian ECDF as response function:
π̂i = Φ(x>
i β̂)
cloglog-Model:
response function:
π̂i = 1 − exp(− exp(x>
i β̂))
62 / 398
Binary Targets: Expectation and Variance
I no direct connection between expectation (x> β) and variance (σ 2 ) in

linear model
I for binary y ∼ B(1, π):
E(y ) = π = P(y = 1) determines Var(y ) = π(1 − π)
Overdispersion:
observed variability greater than theory assumes:
I unobserved heterogeneity
I positively correlated observations
Solution: add dispersion φ : Var(y ) = φπ(1 − π)
63 / 398
Example: Patent Injunctions
I Data: 4832 European patents (Europäisches Patentamt)

I Target: patent injunctions (ja/nein)
I covariates (metric):
I year of patent (0=1980)
I citations (azit)
I scope (no. of countries; aland)
I patent claims (ansp)
I covariates (categorical):
I sector (Biotech&Pharma, IT&Semiconductor) (branche)
I US patent (uszw)
I patent holder origin (US/D, CH, GB/others; (herkunft))
64 / 398
Binary Target: R-Implementation
## The following objects are masked from patent (pos = 4):

##
## aland, ansp, azit, branche, einspruch, herkunft,
## jahr, uszw
pat1 <- glm(einspruch ~ ., data = patent, family = binomial())

round(summary(pat1)$coefficients, 3)
## Estimate Std. Error z value Pr(>|z|)

## (Intercept) -0.771 0.134 -5.765 0.000
## uszwUSPatent -0.392 0.068 -5.795 0.000
## jahr -0.071 0.009 -8.194 0.000
## azit 0.118 0.014 8.297 0.000
## aland 0.084 0.011 7.915 0.000
## ansp 0.018 0.003 5.219 0.000
## brancheBioPharma 0.681 0.084 8.128 0.000
## herkunftD/CH/GB 0.323 0.083 3.897 0.000
## herkunftUS -0.152 0.076 -2.002 0.045
Binary Target: R-Implementation
round(exp(cbind(coef(pat1), confint(pat1))), 3)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.462 0.355 0.601
## uszwUSPatent 0.676 0.592 0.772
## jahr 0.931 0.915 0.947
## azit 1.125 1.095 1.157
## aland 1.088 1.066 1.111
## ansp 1.018 1.011 1.025
## brancheBioPharma 1.975 1.676 2.328
## herkunftD/CH/GB 1.381 1.174 1.625
## herkunftUS 0.859 0.741 0.997
table(einspruch, estimated = round(fitted(pat1)))
## estimated
## einspruch 0 1
## nein 2223 624
## ja 925 1094
Count Data as Targets
Daten:
I positive, whole number target y (counts, frequencies)
I metric and/or categorical x1 , . . . , xp
⇒ naive estimates Ê(yi ) = x>
i β̂ could become negative
⇒ model log(Ê(yi )), i.e,

Ê(yi ) = exp x> i β̂ = exp(βˆ0 ) exp(βˆ1 xi1 ) . . . exp(βˆp xip )
⇒ exponential-multiplicative covariate effects on target
67 / 398
Count Data as Targets: log-linear Model
Distributional assumption:
I yi |xi ∼ Po (λi ) ; λi = exp(x>
i β)
⇒ E(yi ) = Var(yi ) = λi
Overdispersion:
I Frequently Var(yi ) 6= λi :
⇒ more flexible model with dispersion parameter φ:
Var(yi ) = φλi
⇒ alternative distributions: Tweedie, Negative Binomial
68 / 398
Exampe: Patent Citations
pat2 <- glm(azit ~ ., family = poisson, data = patent)

pat3 <- MASS::glm.nb(azit ~ ., data = patent)
AIC(pat2, pat3)
## df AIC
## pat2 9 21021.23
## pat3 10 16341.48
round(cbind(
summary(pat2)$coefficients[2:5, -c(3, 4)],
summary(pat3)$coefficients[2:5, -c(3, 4)]
), 3)
## Estimate Std. Error Estimate Std. Error

## einspruchja 0.442 0.024 0.422 0.046
## uszwUSPatent -0.079 0.024 -0.047 0.046
## jahr -0.070 0.003 -0.079 0.006
## aland -0.026 0.004 -0.029 0.008
⇒ similar estimates, much bigger variability, better fit.

Definition: GLM
I Structural assumption: Connect conditional expectation and linear

predictor Xβ via link/response function:
E (yi |xi ) = µi = h(x> >

i β) ⇔ g (E (yi |xi )) = g (µi ) = xi β
exp(x>
i β)
I logit regression: E(yi |xi ) = P(yi = 1|xi ) = 1+exp(x>i β)
I log-linear model E(yi |xi ) = exp(xi β)
I Distributional assumption: Given independent (xi , yi ) with
exponential family density f (yi ):

yi θi −b(θi )
⇒ f (yi |θi ) = exp φ ωi − c(yi , φ, ωi ) ; θi = θ(µi )
I E(y |x ) = µ = b 0 (θ ) = h(x> β)
i i i i i
I Var(y |x ) = φb 00 (θ )/ω ; ω = n
i i i i i i
⇒ Connect mean structure and variance structure (and higher moments)
70 / 398
Simple Exponential Families
Distribution θ(µ) b(θ) φ
Normal N(µ, σ 2 ) µ θ2 /2 σ2
µ
Bernoulli B(1, µ) log( 1−µ ) log(1 + exp(θ)) 1
Poisson Po(µ) log(µ) exp(θ) 1
Gamma G (µ, ν) −1/µ − log(−θ)
√ 1/ν
Inverse Gauß IG (µ, σ 2 ) 1/µ2 − −2θ σ2
71 / 398
Simple Exponential Families
Distribution E(y ) = b 0 (θ) b 00 (θ) Var(y ) = b 00 (θ)φ/ω
Normal µ=θ 1 σ 2 /ω
exp(θ)
Bernoulli µ = 1+exp(θ) µ(1 − µ) µ(1 − µ)/ω
Poisson µ = exp(θ) µ µ/ω
Gamma µ=1− √ 1/θ µ2 µ2 /(νω)
Inverse Gauß µ = 1/ −2θ µ3 µ3 σ 2 /ω
72 / 398
R-Implementation: glm()
glm(formula, family , data, ...)

I formula: as in lm
I family: specify distribution (binomial, gamma, etc.)
and link function g (µ) = Xβ
(family=binomial(link=’probit’)).
73 / 398
Advantages of GLM-Formulation
Iunified approach for variety of data situations

⇒ unified methodology for
I estimation
I tests
I model choice and diagnostics
⇒ asymptotics
via Maximum Likelihood approach.
74 / 398
Recent Extensions:
GLM idea in combination with ML inference works similarly for many

other non-exponential family distributions, implemented in mgcv:
I t-distribution
I Tweedie
I Beta
I models for ordinal categorical responses
(Wood, Pya & Säfken, 2016)
75 / 398
ML Estimation: Idea
Pn > 2
OLS estimate in linear model: i=1 (yi − xi β) → min
I
√ Pn
(y −x> β)2

density for y: ni=1 f (yi |β, xi ) = ( 2πσ)−n exp − i=1 2σi 2 i
Q
I
⇒ OLS estimate maximizes joint density of observed data over model

parameters
⇒ Maximum Likelihood principle:
maximize (Log-)Likelihood l(β) = ni=1 log(f (yi |β, xi ))
P
76 / 398
ML Estimation: Procedure
Pn
I log-likelihood l(β) = i=1 log(f (yi |β, xi ))
∂
I score function s(β) = ∂β l(β)
I (iterative) solution for s(β) = 0
via Fisher-Scoring or IWLS
77 / 398
ML Estimation: Fisher-Scoring
I basically Newton method:
−1
β (k+1) = β (k) − ∂β∂> s(β) s(β)
Newton−Verfahren
β(0)
s(β)
β(1)
β(2)
β
78 / 398
ML Estimation: Fisher-Scoring & IWLS
I basically Newton method:

−1
β̂ (k+1) = β̂ (k) − ∂β∂> s(β̂ (k) ) s(β̂ (k) )
−1
I observed information matrix H(β̂ (k) ) = ∂β∂> s(β̂ (k) ) expensive to
compute
⇒ use expected Fisher information F(β) = E(H(β))
very efficiently computable:
represent in terms of iteratively re-weighted LS estimation (IWLS) with a
(k)
diagonal weight matrix W(k) and working observations ỹi .
79 / 398
Properties of ML Estimators
β̂ML is consistent, efficient, asymptotically Gaussian:

a
β̂ML ∼ N(β, F−1 (β))
80 / 398
Tests
Linear hypotheses H0 : Cβ = d vs HA : Cβ 6= d
Estimation for β under restriction H0 : β̃
I LR-Test:
lq = −2(l(β̃) − l(β̂))
I Wald-Test:
w = (Cβ̂ − d)> (CF−1 (β̂)C> )−1 (Cβ̂ − d)
I Score-Test:
u = s(β̃)> F−1 (β̃)s(β̃)
a
under H0 : lq, w , u ∼ χ2r , r = rank(C) (no. of restrictions)
⇒ rejecct H0 if lq, w , u > χ2r (1 − α).
81 / 398
Tests in R
√ a
summary.glm uses w ∼ N(0, 1) for H0 : βj = 0:
round(summary(pat2)$coefficients[8:9, ], 3)
## Estimate Std. Error z value Pr(>|z|)

## herkunftD/CH/GB -0.236 0.031 -7.524 0.000
## herkunftUS 0.061 0.026 2.358 0.018
anova.glm(..., test=’Chisq’) for LR-Tests:

anova(update(pat2, . ~ . - herkunft), pat2, test = "Chisq")
## Analysis of Deviance Table

##
## Model 1: azit ~ einspruch + uszw + jahr + aland + ansp + branche
## Model 2: azit ~ einspruch + uszw + jahr + aland + ansp + branche + herkunft
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 4859 13954
## 2 4857 13859 2 95.155 < 2.2e-16 ***
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Model Choice
Which probabilistic model offers best trade-off between fidelity to training

data (more complexity) and parsimony?
⇒ Information criteria:
I Akaike: AIC = −2l(β̂) + 2p → min (AIC())
I Bayes: BIC = −2l(β̂) + log(n)p → min
83 / 398
Model Diagnostics: Residuals
I Pearson residuals: (resid-Option: type=’pearson’)

yi −µ̂i
I riP = √
v (µ̂i )
I for grouped data approx. N(0, 1)
I deviance residuals (resid-Default)
riD = sgn(yi − µ̂i ) 2(li (yi ) − li (µ̂i ))
p
I
I for grouped data approx. N(0, 1)-verteilt.
I partial residuals (type=’partial’)
I prediction errors yi − ŷi (type=’response’)
84 / 398
Model validation: plot.glm()

plot(pat2)
Residuals vs Fitted Normal Q−Q
Std. deviance resid.

2796
1871 ● ● 2796 ●
● 1871
● 4743 ● ● ●●● 4743
Residuals
●●●●● ●●●●● ●●
●
●
●
●
●●
● ●● ● ● ●● ●● ●● ●● ●
●●
●
●●
●
●
●●●● ●● ●● ●● ● ● ●
● ●
●● ●
● ●●
●
●●
●
●
●
●●
5
5
● ● ●
●● ● ●
●
●
●●●● ●
●
●●
●
●●●●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●●●
●
●●
●●
●
●●●
●
●
●
●● ●
●
●
●●●
●
●
●●●
●
●●
●
●
●
●
●●●●●
● ●
●●● ●
●●
●
●
●●
●
●
●
●●
●
●
●●
● ● ●
●
●
●●●
●
●●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●
●
●●
●
●
● ●
●
●●●
● ● ●● ●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●●●
●●●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●●
●
●●●
●
● ●●●●
●
● ●●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●●●●
●●
●●
● ●
●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●
●●●●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●●
●●
●
●
●●●
●●
●●
●●● ●
● ● ●
●
●●
●
●
●
●
●●
●
●●
● ●
●
●
●
● ●●●
●●
●●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●●
●
●●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●● ●●
●
● ●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●●
●
● ●
●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●●
●● ●
●
●
●●
●
●
●
●
●●
●●
●
●●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●●●
●
●
●●
●
●●
●
●●
●
●●
●●
●●
●
●
●●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●●●
●
●
●
●●
●
●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●●●
● ● ● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●●●
●
●●●
●
●
●
●●
●
●
●
● ●
●●
●●
●
●
●●
● ●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
●●
●●●
●
●●
●
●
●
●● ●
●
●●
●
●●
● ●
●
●●
●●●
●
●●●
●
● ●
●
●
●●
●
●
●●
●
●
●●
● ●
●●
●
●
●●●
●
● ●
●●
●
●●
●
●●
●●
●●
●●
●
●●
●●
●
●
● ●●
●
●
●
●
●●
●
●●
●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●●
●●
●●●●●
●●● ●●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●
● ●●
−5
−5
●●
● ●
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 −4 −2 0 2 4
Predicted values Theoretical Quantiles
Scale−Location Residuals vs Leverage

Std. deviance resid.
Std. Pearson resid.

2796
−5 5 15 25
● 4743 1871 ● ● ● 2796
3.0
● ●
●
● ●●●●● ●
● ●● ●● ● ●
● ● ● ●● ● ●●●● ●●●● ● ●● ●●
● ●
●●●●● ● ●● ●●●●●●●●
● ●
●
●● ●
●● ●●
●
●
●●
●
●
●
●
●
●
●
●●●●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
● ●●
●●●●
●
●
●
●
●●
●
●●●●
●●
● ● ●
●
●
●
●
● ● ●
● ●●●●●●
●●
●●●
●
●●
●●
●●
●
●●
●
●
●●●
●●
●
●●●
●●●
●
●
●●
●●●● ●●●●
●●
●●●
● ●●●● ●●●●●●●●
● ● ●
●
●
1.5
●●
●●●
●● ●●●
●●
●●
●
●●●
●
●●
●●●●
●● ●●
●
●
●●●
●
●●
●
●●●
●
●●● ●
● ●
● ●
●●●
●
●
●●
●
●
●●
●
●●
●●
●
●
●●
●●
●●
●
●●
● ●
●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●●●
●●
●
●●●
●
●
●
●
●●
●
●●
●
● ●
●
●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
●
●
●●●
●●
●●
●
●
●
●●
●
●●
● ●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●
● ●●●●
●
● ●
●●
●●●● ●
●
●
●
●
●
●
●
● ●
●●
● ●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●
●●
●
●
● ●●
●
●
●
●
●
●●
●
●
●●
●●
● ●●
●
●
●
●●
●
●●
●
●●●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
● ●
●
●●
●●
●
●●●
●●
●
●
●
●●
●
●●● ●
●●
●
●●
●
●●
●
●
●●
● ●●
●●
●
●●
●●
●●
●
●●
● ●
●●●
●
●
●●
●
●
●●
●●
●
●●
●● ●● ● ● ●● ●●
●
●
●
●
●●
●●●
●
●
●●
● ●
●
●●
●
●
●●●
●
●●
●
●●
●
●
●●
●
●●
●●
●
● ●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●●
●
●●
●
●
● ●
●
●●
●
●●
●
●
●●●
●
●●●
● ●
●●
●
●●
●
●●
●
●●●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●● ●
●●
●●
●●
● ●
●●●● ●
●
●
●
●
●
●
● 1
● ● ●●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●●●
●●●
●
●●
●
●●
●●
●
●
●
●●
●
●●
●
●●
●
●●
●● ●● ●
●
●
●
●
●
●
●
●
●
●
●● 0.5
●●
●●●●
● ●
●●
●
●
●●
●
●●
●
●●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●● ●
●
●●
●
●
●●
●●
● ●
●
●●
●
●●
●
●●
●●
●● ●●
●
● ● ●
●
●
●
●●
●
●●
●
●
●●
●
●●
●●
●
●
●● ●●
●
●
●●
●
●●
●
●●
●
●
● ●●●
●●
●
●●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
● ●
●●● ●
●● ●●
●● ●●
●
●
●
●
●
●
●
●●
●
●● Cook's distance
●
●
●
●●
●
●
●●
●
●
●
●●
● ●
●
●●
●
●
●●
●
●
●●
●
● ●
●
●●
●
●●
●
●
● ●
●●
●
●●
●●
● ●
●
●●
●
●● ●
●●●●
●
●●
●
●● ●●
● 0.5
1
●
●● ●
●● ●● ●
● ●
● ● ● 3939 ● 4290 ●
0.0
●
●
●●
●●
●
● ●
●●
●
● ●
● ●
● ●
−0.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.1 0.2 0.3 0.4
Predicted values Leverage


Transformation of Covariates
Polynomial Splines
86 / 398
Motivation
bla
blub
20
10
5
^ε
0
−10
20 40 60 80 100 120 140 160

size
Motivation
plot(size, res_size)
points(sort(unique(size)), tapply(res_size, size, mean), col = "red")
plot(log(size, base = 2), res_size)
points(log(sort(unique(size)), base = 2), tapply(res_size, size, mean), col = "red")
20
20
● ●
● ● ● ●
● ●
15
15
● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ●● ●
●● ● ●
● ● ●● ●
● ●● ● ● ●
● ●● ● ●●
● ●● ● ●
●
● ● ● ● ●● ● ●● ● ● ●
● ● ●
● ●● ● ●●
●●
● ● ● ● ●● ● ● ●
● ●
10
10
●● ●● ●● ● ●●● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●
●●● ● ●●● ●●● ● ● ●
●
● ● ● ●● ● ● ●
● ●
● ●●● ●
●● ●
●●● ●
●
● ● ●● ● ● ● ●● ●●●● ● ●● ● ●● ● ● ●● ●● ● ● ●
●●
● ●●●●
● ●●
● ●●● ●●●● ● ● ●● ●
●
● ● ●●●
● ●●●● ● ●● ● ●●●●●● ●●●● ● ● ●
● ●
● ●
●● ●●● ● ● ●● ●● ● ● ● ●● ●
●●
● ● ● ●
●●●●● ●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●
● ● ●● ●●●●●● ● ● ● ● ●
●
●
● ●
● ●●●
●
●●●● ●●
●
● ●
●●
●
●
●● ●
●
● ● ● ●
● ● ●●
●● ●
●● ● ● ● ●●● ●
● ● ●●●●● ●●●●● ●
●
● ● ●
● ● ●●
●● ●
●● ● ● ●
●● ●●●●● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●●●● ●●●
●●●● ● ● ● ● ●● ● ●
●●
● ●
●●●
● ● ●●● ●●
●
●● ●●
●
● ● ● ● ● ●● ●
●●● ● ●
● ●
●● ●●● ●●●●●●●●●●●
●
● ● ● ● ●
●● ●● ●● ● ● ●● ● ● ●● ●● ● ●●●● ● ● ●● ● ● ●● ●
●
● ●●●●●
●
●
●●●
●
●●
●●●
●
●●
●
●●
●●
●● ●●
●
●
●●
● ●●●●●
●●● ●●●
●
●● ●
●●
● ● ● ●●
●●
● ● ● ● ● ●●●● ●●
●● ●●●●●● ●●● ●●●●
●● ● ● ●● ●●●● ● ●●●
●
●● ● ●● ● ●●
●●
● ● ●
●
●● ● ●●●
●●●●●
●●●
● ● ● ●
●●
●
●● ●
●
●●
●
●●
●
●
●●●●●
●● ●● ●
●●●●●●● ● ● ● ● ● ● ●
● ●● ●
●●● ●●●●
●●●●● ● ●●
●●
●●
●
●●●●
● ●●
●● ●●
●●●●
●●
●●● ● ●
●●●●●●●● ● ● ●● ●
5
5
● ●●● ● ●
●●
●●●
● ● ●
●
●●●●●
●●
● ●
●
●● ●
● ●●●●●
● ●●● ●
●
● ●● ● ●● ● ● ● ●
●● ●●● ●●●
●●● ● ●●●●●●
●●
● ●●● ● ● ●●
●
● ●●●
● ●
● ●●● ● ●●● ● ● ●
● ●● ● ●● ● ●● ● ● ●●● ● ● ●● ● ●●●● ● ● ●●●●
● ●● ● ●
●●
● ● ●
●
●
●●●●● ●●
●●●
●●●
●●●●
●●●
●
● ●●
●● ●
●
●●●● ● ●● ●● ● ● ● ● ●●● ●
●● ● ●
●●●●●● ●●
●●●●●●● ●●●●●
●●●●● ●●
●●●
●
●
● ●●
●●● ●●●● ● ●● ●●
● ● ● ●
● ● ●●
●●
●●
●●
●
●●●
●
●●●
● ●
●
● ●
●●
●
●●●
●●●●
●
●●●
●
●●●
●●●
●●
●
●●●
● ● ●● ●
● ●●●● ● ●● ● ● ●●
● ●
●
● ● ● ●●●●●
●● ●●
● ●●●●●●●●
● ●●●
●●●
● ●●●
●
●
●●●● ●
●
●
●●
●●
●
●●●
● ● ●● ●
● ●● ●● ● ●●● ● ●● ● ● ●
●
● ● ●
●●
●●● ●●●
●
●●● ●●● ●
●●●●●●●
●●
●●
●
●● ●
●
●●● ●●●
●
●●●● ●
●● ●
●● ● ● ●
●● ●● ● ●● ● ●●● ● ● ●●
● ● ● ● ●● ●●● ●
●●●● ● ●● ●●
● ●
●●●
●● ●●
●●
● ●●●
●
●●●● ●
●● ●
●● ● ● ●● ● ●
● ● ●● ●
^ε
^ε
●● ●●● ● ●● ●●●● ● ●●
●●
● ●
●●●
●
● ●● ●
●
● ●●●
●
●
●
●●
●●
●
●●
●
●●●
●
●
●●
●
●●●
●
●●
●
● ●
●●●
●
●
●
● ●
●●●
●●
●●
●
● ●● ●●
●● ● ● ●●● ●●●●● ● ● ●●
● ●●●● ●●● ●
● ●●● ●●● ●●●● ●●●
●●
●●
●●
●●
●●
●
●
●
●
●
●●●
●
●
●
● ●
●●●
●●
●●
●
● ●● ● ●
● ● ● ● ●●●●
● ●● ●
●●●●● ●
●●
●● ●●●●
●●
● ●
●● ●● ●●●●●
●●●
●●●● ●
● ●
●●●● ●
●●
● ●
●●● ●
● ● ● ● ●● ● ● ● ● ● ● ●
●●●●● ● ● ●
●● ● ● ●●●●●●●● ●● ● ●●
●●
●●●●●●●● ● ● ●
●● ● ●
●●
● ●●● ●
● ● ●● ●● ● ● ●● ●●
●● ● ● ●●
●●●●
● ●●●●●● ●
●
●● ●●● ●● ●● ● ● ●●
● ● ● ●●
● ● ● ● ●●
● ●●●
●●
●●● ● ●●●●●●●●●
●● ●●●
●● ● ●●● ●●● ● ● ●●● ●●●● ●●
● ●
●●● ●●
● ● ●●
●●
●●
● ●●
●●●●●●●●●● ●
●●● ●●●●●
●●●
●●
●●●
●●
●● ●●● ●●●●● ● ● ● ● ●●●● ●●● ●● ● ●● ●● ●●●●●
●●●
●●
●●●
●●
●● ●● ● ● ●
● ● ●
●●●
● ●● ●●● ● ●●● ●
● ●●●●
●●● ●●●●●●● ●●●●●●● ●●● ● ● ●●●●● ●● ● ● ●●●●●●● ●●●●●●● ●●●
●●●●●●
●●● ●●●●●●● ●●● ●●●●●● ● ●
● ●● ●●
● ● ●●●
● ●● ●
● ●
● ● ●●
●●●● ● ● ●●● ●●● ● ● ● ● ●● ●●●●● ●● ● ●●
●●●● ● ● ●● ● ● ●●● ●● ●
●
●●● ●
●●●●●
●
●
●
●● ●
● ●
● ●●
●●
●●●
●●
●●●●
●●●
●●
●●●
●
●●
●
●
●
●
●
● ●
●
●
●
● ●●
●●
●● ●
●●
●●●
●●●
●
● ●● ●●●●
●
● ● ●
● ● ● ●●●
●●●●●● ●●●●●● ● ●●●●●●●●
●●●●●●●
●● ●●●
● ●
●●
●●●
●
●●
●
●
●
●
●
● ●
●
●
●
● ●●
●●
●● ●
●●●●●●●
●
●●●●
●●●●
●
●● ● ●
●
● ● ●
●
●
●● ●
●
●●●● ●●●●●●●
●●●●
●●●
●
●
●●
●●
● ●●
● ●●●●●
● ●
●
●●●●
●●
●●● ●●● ● ●●
● ● ●
●● ●●●●● ●●●
●●●● ●●● ●●● ●
●●●●●
● ●
●● ●
●
● ● ●●●
● ●
●
●●●●
●●
●●● ●● ● ●●●
●
0
0
● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●
●●●●
●● ●●●●
●●●●● ●●
●●
● ●
●●
●
●●●
●●●●
●●● ●
●●
● ●●
●●
●●●
●● ●● ●●
●
●
●
●
●
●
●●
●●● ● ●●
●
●● ● ●● ● ●●
● ● ●●
● ●●●
● ●
● ●●●● ●● ● ●● ●●●●●●●
● ●●●
●● ●● ●
●●● ●
●
●● ●
●●
●●
●
●●●
●● ●● ●●
●
●
●
●
●●
●●
●●●● ●●●
●● ●● ● ●●●
● ● ●●
●● ● ●
●●
●
●●
●●● ● ● ●●
● ●●
● ●
●
●
●●● ●
●●
●●
● ●
●
●
●
●●●●●●●●
●●
●
● ●●●●
● ● ●
●●
●● ● ●
● ● ●●
●
●●●
●
●●
●●●● ● ●●●●●●● ●
●● ●●
●
●●●●
●
●
● ●
●
●
●
●●●●●●●●
●●
●
● ●●●●
● ● ●
●●
●●●●●
●● ●●
●● ●
● ●● ●● ●
●● ●●●
●
●●●
●●
●●●●●
●●●
●
●●
●●●
●
●●●●
●●●●
●●
●●●●●●
●
●
●●
●●●
●
●●●
● ●●
● ●●●
●● ●
●
●●●
● ●● ● ● ● ●
● ● ●● ● ● ●●●●●●
●●●● ●● ●
●●
●●●●
●● ●●●
● ●
●
●●●●
●●
●●●●●●
●
●
●●
●●●
●
●●●
● ● ●
● ●●
●●
● ●
●
●●
●●●● ●●
● ●
●
●●
●● ●
●
●
●
●●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●● ●
●
●
●●
●●
●
●
●●●
●●
●●●
●
●
●●
●●
●
●
●●
●●
●
●●
●●
●●●●●
●● ● ●●● ● ● ● ● ● ●●● ● ●● ●●
●
●●
●
●
● ● ●●●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●● ●
●
●
●●
●●
●
●
●●●
●●
●●●
●●
●●
●●●
●
●
●●
●
●
●●
●●
● ●●●
●
●● ● ●●
● ● ●● ●
● ● ● ●●● ● ●●
● ●●
●● ●
●● ●●
●
● ● ●● ●
●●●●●
●● ●
● ●
● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●
●●●●●●●●●● ●●●● ● ●●
●
● ● ●● ●
●●●●●
●● ●
● ●
● ●● ● ● ●●●
●
●●
●●
●●●●●
●● ●
●● ●●●●●● ●●
● ●● ● ● ● ●● ● ● ●
●● ● ●●●●●●●●●●
●
●●●●● ●●
● ●● ● ●● ●●● ●●
● ● ● ●●● ●●●●●
●
● ●●● ●
●●
●●●
●
●●●
● ●●
●●● ●●
●
●●●●
●
● ●
●●●
●●
●
●●●●●●●●
●●●●●●
● ●
● ● ● ● ● ●
● ● ● ●● ●●● ● ●●
●●●● ●●●●●●●●●
● ●●
●●● ●●
●
●●●●
●
● ●
●●●
●●●
●●●●●●●
●
●●●●●●
● ●
●● ● ● ● ●
●
●●●●●
●●● ●●
●●●
●
●●
●●●●
●
●
●
●●●●●
●●
●●
●
●●
●
●
●●●
●
●
●● ●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●●
● ●● ●
●● ●● ●●●● ● ●● ● ● ●
● ●● ●●●●
●●●● ●● ●●●●
●●●● ●
●●
●
●●
●●●
●●
●●
●
●●
●
●
●●●
●
●
●●●
●
●
●●●
●
●
●●
●
●
● ●
●
●
●
●●●
●
●● ●●●●● ●
● ●●●●
●● ●●
●
●● ● ●● ● ●
●●●
●●●●
●●●
● ●●
●
●● ●
●●● ●●
●
●●
●●●●●
● ●
●●
●
●
●●●
●●
●
●
●●●
●
● ● ● ●
●
●
●● ●●● ●●●● ●
● ● ●
● ● ● ● ●●● ● ●●●
●●●●●●●●●● ●● ●● ●
● ●
●
●● ●●
●
●●
●●●●●
● ●
●●
●
●
●
●
●●
●●
● ●●●●
●
●● ● ●
●
●
●● ●●
● ●●●
●●●
●● ●
● ●●
●
●● ●
● ●●
●●
●●●●●
●●●●●●●
●
●
●
●●●
●
●●
●
●●
● ●●
●●●
●●●
● ●
●
●
●●●
●
● ●●
●
●●● ●
●●
●
●
●●
●
●
●
●● ●● ●
●●●●
● ● ● ●
● ● ● ●● ●●●●● ●●●●●●●
●● ●●●
●●●●
●● ●
●
●
●●●
● ●●
●●●
●●●
● ●
●
●
●●●
●
● ●●
●
●● ● ●
●
●● ●
●●●
●
●
●
●●●●● ●
●●
● ●●● ● ●
−10 −5
−10 −5
● ● ●● ●● ●●
●●● ●●● ● ●
● ● ● ● ● ●●●●
● ● ● ● ●● ●● ●●●● ● ● ● ●●●●
● ●● ● ●●● ● ●
●●
● ●●●
●● ●
●
●●●
● ●
● ●●●●●●
●● ●
●
● ●
●●
●● ●●● ●● ●
●
●● ● ●
● ● ● ●● ●● ● ●● ● ●●● ●●●
●
● ●
●●●●●●●●
●● ●
●
● ●
●●
●● ● ●●
● ●●●● ●
●●
●
● ●●
●●●●●●● ● ● ●
●
●●● ● ●●● ●
●
●●●●
●●
●
●●
●●●●
●●
●●●●● ●
●●●
●●●
● ●
●
●
●●●
●●●
●
●
●●●
●
● ●●●●●●
●●
●
●
●
● ● ●●
● ●●
●● ●●
●●
● ●
●●●●
● ● ● ●
●● ●
●
● ●●●●●●●●●●●●
●●●
●●
●
●●●●
●●● ●
●●● ●
●●●
●●●
●
●●
●
●●●
●●●
●
●
●●●
●●●
● ●●●●
●●
●
●
●
●●●●
●
●
● ●●
●●
●
● ●●● ● ●
● ● ●
● ●●● ●
●● ●
●●● ●
● ●● ●
● ●●● ●●●● ● ●●
●
● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●
● ●●●● ●●● ●● ●● ● ●●
●
● ●
● ● ● ● ●● ●●●●● ● ●
● ● ● ●●● ● ● ●● ● ● ● ● ●● ●● ● ●● ●
●●●● ● ●
● ● ● ●●● ● ● ● ● ●
● ● ● ●●
●
●● ●●
●
●● ●● ●
●
●
●
●
●
●
●
●
● ●● ●● ●●
●●
● ●
●
●
●
●
●
●
● ●●●●●
● ●
●●●● ● ●● ● ●● ● ● ● ●
● ●● ●●●● ●● ● ● ●●●●●●●
● ●●● ● ●●
●●
● ●
●
●
●
●
●
●
● ●● ●●●
● ●
●●
●● ●●● ● ●●● ● ● ●
●● ● ● ● ● ●● ● ● ●
● ● ● ● ●● ● ●
●● ●● ● ● ●● ● ● ●● ● ● ● ● ●
●●● ●●● ● ● ●● ●
●●● ● ●●●●● ● ● ● ● ●● ● ●●
●●● ● ●
●● ●●
●
● ●● ● ●
●●● ● ●●●●● ● ● ●●● ●●● ● ●●
● ● ●● ● ● ●●● ● ●● ●
●●●
● ●
●● ● ● ● ● ● ● ● ● ●●●●● ●●● ● ●● ● ●● ●●●
●
●●●
●●● ● ● ● ●●
● ● ●
●● ●
●● ●●●●●
● ● ● ● ● ●●
●● ●●
●●●●
● ●●● ●●
● ● ●
●●● ● ●
●●●●● ● ●● ●●●●
●● ●●
●●● ●● ●
●●● ● ●
● ● ●
● ● ● ●●● ●
● ●
●● ●●● ●
● ●● ● ● ● ●● ● ● ● ● ●● ● ●●●●● ●
● ●
●
●● ●●●● ●●●● ● ●
●●
●● ● ● ● ●● ●●
●● ●
●●●
● ● ● ●● ● ●
●● ●
●● ●
●●●
● ● ● ●● ● ●
●● ● ● ●● ●●● ● ● ● ●●●● ●● ● ● ● ●
● ● ● ●● ●●●
● ● ●●● ● ● ●●●● ●● ● ● ● ●
●
● ●●● ●● ●● ● ●
●
●●● ● ● ● ● ●● ●●
● ● ●●● ● ●
● ● ● ●
● ● ● ● ● ●● ●
● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●● ● ●
● ●
● ●
● ● ● ●
Simple Transformation
I linearity often too restrictive assumption

I gain flexibility without complex models by using log or polynomials of
x
⇒ replace y = βx + ε
√
by y = βf (x) + ε; f (x) = log(x), x3 , x etc...
I Issues:
I interpretation of β
I choice/selection of f (x)
88 / 398
Polynomial Transformation
I Polynomial Model
I y = f (x) + ε = β0 + β1 x + β2 x2 + · · · + βl xl + ε
I In R: Use poly(x,degree) to avoid collinearity
89 / 398
Polynomial Transformation: Collinearity
x <- seq(0, 1, l = 200)
X <- outer(x, 1:5, "^")
X.c <- poly(x, 5)
round(cor(X), 2)
## [,1] [,2] [,3] [,4] [,5]

## [1,] 1.00 0.97 0.92 0.87 0.82
## [2,] 0.97 1.00 0.99 0.96 0.93
## [3,] 0.92 0.99 1.00 0.99 0.97
## [4,] 0.87 0.96 0.99 1.00 0.99
## [5,] 0.82 0.93 0.97 0.99 1.00
round(cor(X.c), 2)
## 1 2 3 4 5
## 1 1 0 0 0 0
## 2 0 1 0 0 0
## 3 0 0 1 0 0
## 4 0 0 0 1 0
## 5 0 0 0 0 1
⇒ use orthogonal polynomials

Polynomial Transformation: Synthetic example
x <- seq(0, 1, l = 300)

fx <- function(x) {
sin(2 * (4 * x - 2)) + 2 * exp(-16^2 * (x - 0.5)^2)
}
y <- fx(x) + rnorm(300, sd = .3)
X.c <- poly(x, 15)
m.poly3 <- lm(y ~ X.c[, 1:3])
m.poly7 <- lm(y ~ X.c[, 1:7])
m.poly11 <- lm(y ~ X.c[, 1:11])
m.poly15 <- lm(y ~ X.c)
plot(x, y, pch = 19, col = "grey")
lines(x, fx(x), col = 1, lwd = 2)
91 / 398
Polynomial Transformation: Synthetic example
●
●●
● ●●
2
●
●●
●● ● ●
●
● ●
● ● ●●● ●
● ●●●
● ● ●
● ●● ● ● ●●
● ● ● ●
●
● ● ● ●● ● ●●
● ● ●
● ● ● ● ●● ● ●●● ●
1
● ● ● ● ●
● ● ●
● ● ●● ● ● ●●
● ●● ● ● ●● ●● ●● ●
● ● ●
●● ●● ● ● ● ● ● ● ● ● ●
●● ●● ● ●● ●● ●●● ●● ● ●
●
y
●● ● ●● ● ●
●● ●● ● ● ●
● ● ● ● ●
● ● ●
● ●● ● ● ● ●● ● ●
● ● ● ● ● ●●
● ● ●
● ●● ● ● ●●
0
● ● ●● ● ● ● ●
●● ● ● ●●●
● ● ●● ●●●
●● ● ● ● ●
● ● ● ●● ●●● ●
●● ●
● ● ● ● ● ● ●●
●● ● ●
● ●●
● ●● ● ●
● ●●
●●
●● ● ● ●
● ● ● ● ● ● ●● ● ●
● ● ● ● ● ●● ● ● ● ● ●
●
−1
● ●● ●● ●
●●● ● ●
● ● ● ● ●
● ●● ●● ● ●
● ●
● ● ●● ●● ●
0.0 0.2 0.4 0.6 0.8 1.0
92 / 398
Piecewise Polynomials
I polynomial transformations have problems:

I choice of degree (= flexibility)
I oscillations, boundary effects for higher degrees
⇒ piece-wise polynomials:
Idecompose range of x in sub-intervals
Iapproximate f (x) by low-degree polynomial in each sub-interval
⇒ removes oscillations, boundary effects
93 / 398
●
true f
degree 15
●
5 piecewise ●●
● ●●
2
●
quadratic polynomials ●●
●● ● ●
●
● ●
● ● ●●● ●
● ●●●
● ● ●
● ●● ● ● ●●
● ● ● ●
●
● ● ● ●● ● ●●
● ● ●
● ● ● ● ●● ● ●●● ●
1
● ● ●
● ● ●
● ● ●●●
● ● ● ●
●● ●
● ●● ●●●● ● ●●● ● ● ●●●● ●
●● ●● ● ● ● ● ● ●
●● ●● ● ●● ●
● ● ● ●● ●●
y
●● ● ●● ● ●
●● ●● ● ● ● ●
● ● ● ●
● ● ●
● ●● ● ● ● ● ●●
●
● ● ● ● ● ●●
● ● ●
● ●● ● ● ●●
0
● ● ●● ● ● ● ●
● ●● ● ●●●
● ● ●● ●●●
●● ● ● ● ●
● ● ● ●● ●●● ●
●● ●
● ● ● ● ● ● ●●
● ●●
●
●
●●●●● ● ●
● ●●
●●
●● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ●● ● ●
−1
● ●● ●● ●
●●● ● ●
● ● ●● ● ● ●
● ● ● ●● ●
● ● ● ●● ●● ●
0.0 0.2 0.4 0.6 0.8 1.0
⇒ fˆ(x) for piecewise polynomials not continous

Definition: Polynomial Splines
I better piecewise polynomials

I require continuous differentiability at subinterval boundaries
I formally:
f : [a, b] → R is polynomial spline of degree l ≥ 0 at knots
a = κ1 < · · · < κm = b if
1. f (x) is l − 1-times continuously differentiable
2. f (x) is polynmial with degree l on [κj , κj+1 )
⇒ choice of degree l determines smoothness of function
⇒ knot set κ defines flexibility/complexity f
95 / 398
Polynomial Splines: Example
polynomial spline degree 0 polynomial spline degree 1
● ●
5+2 knots ●●● 5+2 knots ●●●

2
2
● ●● ● ●●
● ●
●● ●●
●● ● ●
● ●● ● ●
●
● ● ● ●
● ● ●●● ● ● ● ●●● ●
● ● ●●● ● ● ● ●●● ●
● ● ● ● ● ●● ● ● ● ● ● ●●
● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ●
● ● ● ●●● ●● ● ● ● ●●● ●●
1
1
● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ●● ●●●
●
● ● ●● ● ● ● ● ●● ●●●
●
● ● ●● ● ●
● ●● ●● ●●●● ● ● ● ● ●● ●● ●●●● ● ● ●
● ● ●● ● ● ●●●● ● ● ● ●● ● ● ●●●● ●
●●●●● ● ● ● ●● ●● ●●● ●● ● ●●●●● ● ● ● ●● ●● ●●● ●● ●
● ● ● ● ● ● ● ● ● ●
y
y
●● ●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ●
● ● ●● ● ● ●● ● ● ●● ● ● ●●
● ●
0
0
● ●● ● ● ●●
● ● ●● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●
● ●
●● ●●● ● ● ●● ● ●●● ●● ●●● ● ● ●● ● ●●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ●●● ● ● ● ●● ●●● ●
●
●●●● ● ● ● ● ●● ● ● ●
●●●● ● ● ● ● ●● ● ●
● ● ●●●
● ● ● ● ●● ● ● ●●●
● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ●
−1
−1
● ●
● ●
●●● ● ● ●● ●● ● ● ●
● ● ●
●●● ● ● ●● ●● ● ● ●
●
● ● ●● ● ● ● ●● ●
●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ●
●● ●● ● ● ●● ● ●● ●● ● ● ●● ●
● ● ● ● ●●● ● ● ● ● ●●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
polynomial spline degree 2 polynomial spline degree 3
● ●
5+2 knots ●●● 5+2 knots ●●●

2
2
● ●● ● ●●
● ●
●● ●●
●● ● ●
● ●● ● ●
●
● ● ● ●
● ● ●●● ● ● ● ●●● ●
● ● ●●● ● ● ● ●●● ●
● ● ● ● ● ●● ● ● ● ● ● ●●
● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ●
● ● ● ●●● ●● ● ● ● ●●● ●●
1
1
● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ●● ●●●
●
● ● ●● ● ● ● ● ●● ●●●
●
● ● ●● ● ●
● ●● ●● ●●●● ● ● ● ● ●● ●● ●●●● ● ● ●
● ● ●● ● ● ●●●● ● ● ● ●● ● ● ●●●● ●
●●●●● ● ● ● ●● ●● ●●● ●● ● ●●●●● ● ● ● ●● ●● ●●● ●● ●
● ● ● ● ● ● ● ● ● ●
y
y
●● ●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●
● ●
●● ● ● ● ● ●● ● ● ● ●
●●●
● ●● ●● ● ● ● ●● ● ● ●●●
● ●● ●● ● ● ● ●● ● ●
● ● ● ● ● ●● ● ● ● ● ● ●●
● ● ● ●
0
●
0
● ●● ● ● ●●
● ● ●● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●
● ●
● ● ●●● ● ● ●● ● ●●● ● ● ●●● ● ● ●● ● ●●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ●●● ● ● ● ●● ●●● ●
●
●●●● ● ● ● ● ●● ● ● ●
●●●● ● ● ● ● ●● ● ●
● ● ●
●● ●
● ● ● ●● ● ● ●
●● ●
● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
−1
−1
● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ●● ●● ● ●
● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ●
●●● ● ●● ● ●●● ● ●● ●
● ● ●● ● ● ● ● ● ●● ● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ●●● ● ● ● ● ●●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
polynomial spline degree 0 polynomial spline degree 1 96 / 398

Polynomial Splines: Discussion
I Standard: cubic splines:

I visually smooth
I twice continuous differentiable (i.e., curvature well defined.)
I knot set:
Isize: trade-off flexibility and overfitting
Ipositioning: equidistant? quantile-based? domain knowledge?
→ more on this in the context of penalization
97 / 398
Truncated Polynomials
I simplest polynomial splines

I basis representation for degree l and knots κ = (κ1 , . . . , κm ):
f (x) =γ1 + γ2 x + · · · + γl+1 x l +

+ γl+2 (x − κ2 )l+ + · · · + γl+m−1 (x − κm−1 )l+
I first l + 1 Koeffizienten determine global polynomial with degree l

I coefficient of highest power can change at each knot κ
⇒ f is of degree l everywhere and continuously differentiable
98 / 398
Truncated Polynomials: Example
basis functions
●
●●
2
●
●● ●
●
●
●
●●● ●
● ●
● ● ●●● ●
● ●●
● ● ● ●
● ●● ● ● ●●
● ●
● ● ● ●● ● ●●●
● ● ● ● ●
● ● ●● ●●
1
● ● ● ●
● ● ● ●
● ● ● ●●
● ● ●
● ●● ● ● ●●
●●
●● ● ●● ● ● ● ● ● ●●
● ● ● ● ● ●
●● ●● ● ● ● ●
●● ●● ● ● ●● ● ● ●●
● ●
●
y
● ● ● ●● ● ●
● ● ● ●
●● ●
● ● ● ● ● ● ●
●●● ● ● ● ● ● ●
● ● ●
● ● ● ● ●●
● ● ●
0
● ●● ●●
●● ● ● ●
●
●
●● ● ● ●● ● ●●●
● ●●
● ●
● ● ● ● ● ●
● ● ● ● ●● ●
● ●● ●
●●● ● ● ●
● ● ● ●●
●● ● ●●
● ●
●● ● ●
●
● ●● ●●
●● ● ●
● ● ●
● ● ● ●
● ●●● ● ● ●
● ● ● ●● ●● ●
−1
● ●
● ●● ● ●
●
● ●
● ● ● ● ●
● ● ●●
● ●● ● ● ●
● ●
● ● ●● ●
●●
0.0 0.2 0.4 0.6 0.8 1.0
scaled basis functions

basis functions
● ●
● ●
●● ●●
2
● ●
●● ● ●● ●
● ●
● ●
● ●
●● ●
● ●● ●
●
● ● ● ●
● ● ●●● ● ● ● ●●● ●
● ●● ● ●●
● ● ● ● ● ● ● ●
● ●● ● ● ●● ● ●● ● ● ●●
● ● ● ●
● ● ● ●● ● ●●● ● ● ● ●● ● ●●●
● ● ● ● ● ● ● ● ● ●
● ● ●● ●● ● ● ●● ●●
1
● ● ● ● ● ●
99 / 398
● ● ● ● ● ● ● ● ● ●
● ●● ● ●● ● ● ● ●● ● ●● ● ●
● ●● ● ● ●●
●●
● ●● ● ● ●●
●●
●● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
Truncated Polynomials: Discussion
Numerical disadvantages
I Basis function values can become very large
I strong colinearity of basis functions
⇒ numerically preferable: B-spline-basis functions
100 / 398
B-splines: Idea
I B-Spline-basis function is itself a piecewise polynomial, connecting

I (l + 1) polynomial fragments
I of degree l
I (l − 1)-times continuously differentiable at connection points.
⇒ weighted sum of such basis functions is degree l and (l − 1)-times
continuously differentiable everywhere
101 / 398
B-Splines: Basis Functions
B−spline basis functions
1.0
degree l=0
0.8
0.6
B(x)
0.4
0.2
0.0
κ1 κ2 κ3 κ4 κ5 κ6 κ7 κ8 κ9 κ10 κ11
B−spline basis functions 102 / 398

B-Splines: Properties
I local Basis: basis functions 6= 0 only between l + 2-knots

I bounded range
⇒ avoids problems of truncated polynomials
I overlap with 2l adjacent basis functions
103 / 398
(B-)Splines as Linear Models
Model: y = f (x) + ε
How to estimate f (x)?
Idefine basis functions bk (x); k = 1, . . . , K
PK
I f (x) ≈
k=1 θk bk (x)
⇒ ŷ = f (x) = K
ˆ
P
k=1 θ̂k bk (x)
⇒ this is a linear model ŷ =Bθ̂ 
b1 (x1 ) . . . bK (x1 )
 .. .. 
with design matrix B =  . . 
b1 (xn ) . . . bK (xn )
I analogously applicable to GLMs: g (µ̂) = Bθ̂
104 / 398
B-Splines: R-Implementation
bs in splines package creates a B-spline Designmatrix B:
library("splines")
B <- bs(x, df = 12, intercept = T)
m_bspline <- lm(y ~ B - 1)
B_scaled <- t(t(B) * coef(m_bspline))
plot(x, y, pch = 19, cex = .5, col = "grey")
matlines(x, B, lty = 1, col = 1, lwd = 2)
●
●●
2
●
● ●●
● ●
● ●
●● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ●● ● ● ●●
● ● ● ● ● ●●● ●
● ● ● ●
● ● ●
● ● ● ●●
1
● ● ● ● ● ●
● ● ● ●●● ● ●● ●
●● ●
● ● ● ●
● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ●● ●●
y
● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ●
●
● ● ● ●
●
●● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
0
● ● ● ●● ●● ●
● ● ●● ● ● ● ●
● ● ● ● ●
● ● ● ●● ●
●
●
● ● ●
● ● ● ● ● ● ●● ●● ● ●
● ●● ● ● ●
● ● ● ●● ●
● ● ●● ● ●● ●
●
● ● ● ●● ● ●
● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ●● ●●
●
−1
● ● ● ● ●
● ● ● ● ● ● ● ●
●● ●●
●● ● ● ●
● ●● ● ● ●● ● ●
● ● ● ●
● ● ● ●
0.0 0.2 0.4 0.6 0.8 1.0
x
library("splines")
matlines(x, B, lty = 1, col = scales::alpha(1, .7), lwd = .5)
matlines(x, B_scaled, lty = 1, col = 2, lwd = 2)
●
●●
2
●
● ●●
● ●
● ●
●● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ●● ● ● ●●
● ● ● ● ● ●●● ●
● ● ● ●
● ● ●
● ● ● ●●
1
● ● ● ● ● ●
● ● ● ●●● ● ●● ●
●● ●
● ● ● ●
● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ●● ●●
y
● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ●
●
● ● ● ●
●
●● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
0
● ● ● ●● ●● ●
● ● ●● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●● ●● ● ●
● ●● ● ● ●
● ● ● ●● ●
● ● ●● ● ●● ●
●
● ● ● ●● ● ●
● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ●● ●●
●
−1
● ● ● ● ●
● ● ● ● ● ● ● ●
●● ●●
●● ● ● ●
● ●● ● ● ●● ● ●
● ● ● ●
● ● ● ●
0.0 0.2 0.4 0.6 0.8 1.0
x
library("splines")
matlines(x, B, lty = 1, col = scales::alpha(1, .7), lwd = .5)
matlines(x, B_scaled, lty = 1, col = scales::alpha(2, .7), lwd = 1)
lines(x, fitted(m_bspline), lty = 1, col = 3, lwd = 2)
●
●●
2
●
● ●●
● ●
● ●
●● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ●● ● ● ●●
● ● ● ● ● ●●● ●
● ● ● ●
● ● ●
● ● ● ●●
1
● ● ● ● ● ●
● ● ● ●●● ● ●● ●
●● ●
● ● ● ●
● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ●● ●●
y
● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ●
●
● ● ● ●
●
●● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
0
● ● ● ●● ●● ●
● ● ●● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●● ●● ● ●
● ●● ● ● ●
● ● ● ●● ●
● ● ●● ● ●● ●
●
● ● ● ●● ● ●
● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ●● ●●
●
−1
● ● ● ● ●
● ● ● ● ● ● ● ●
●● ●●
●● ● ● ●
● ●● ● ● ●● ● ●
● ● ● ●
● ● ● ●
0.0 0.2 0.4 0.6 0.8 1.0
x
Splines: Summary
I basis function representation linearizes problem of function

estimation
I dimension of basis controls maximal complexity
I basis type determines properties of function estimate: continuity,
differentiability, periodicity, . . .
108 / 398

Exemplary Longitudinal Study: Sleep Deprivation
Motivation: From LM to LMM
Advantages of a Mixed Models Representation
Linear Mixed Models
LMM Estimation
Generalized Linear Mixed Models
GLMM Estimation
109 / 398
Example: Sleep Deprivation Data
I laboratory experiment to measure effect of sleep deprivation on

cognitive performance
I 18 subjects, restricted to 3 hours of sleep per night for 10 days
I operationalization of cognitive performance: reaction time
109 / 398
data(sleepstudy, package = "lme4")

summary(sleepstudy)
## Reaction Days Subject

## Min. :194.3 Min. :0.0 308 : 10
## 1st Qu.:255.4 1st Qu.:2.0 309 : 10
## Median :288.7 Median :4.5 310 : 10
## Mean :298.5 Mean :4.5 330 : 10
## 3rd Qu.:336.8 3rd Qu.:7.0 331 : 10
## Max. :466.4 Max. :9.0 332 : 10
## (Other):120
110 / 398
308 309 310 330 331 332

● ●
●
●
400 ●
●
●
●
Average reaction time [ms]
● ●
● ● ● ● ● ● ●
● ● ●
300 ● ● ●
●
● ● ●
●
● ● ●
●
● ● ●
● ● ● ●
●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ●
200 ● ●
333 334 335 337 349 350

● ●
●
400 ●
● ●
● ●
●
●
●
● ●
● ● ● ● ● ●
● ● ●
● ●
300 ● ● ●
● ●
● ●
● ●
●
● ●
●
● ●
● ●
●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
200
351 352 369 370 371 372
400 ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
300 ●
●
● ● ●
●
●
●
●
● ●
●
●
●
●
● ● ● ● ● ●
●
●
●
● ●
● ● ●
● ●
● ● ●
200
1 4 7 1 4 7 1 4 7 1 4 7 1 4 7 1 4 7
Days of sleep deprivation
Model global trend: Reactionij ≈ β0 + β1 Daysij
m_sleep_global <- lm(Reaction ~ Days, data = sleepstudy)

summary(m_sleep_global)
##
## Call:
## lm(formula = Reaction ~ Days, data = sleepstudy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.848 -27.483 1.546 26.142 139.953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 251.405 6.610 38.033 < 2e-16 ***
## Days 10.467 1.238 8.454 9.89e-15 ***
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 47.71 on 178 degrees of freedom
## Multiple R-squared: 0.2865, Adjusted R-squared: 0.2825
## F-statistic: 71.46 on 1 and 178 DF, p-value: 9.894e-15
With estimated global level and trend added:
308 309 310 330 331 332
● ●
●
●
400 ●
●
●
●
● ●
● ● ● ● ● ● ●
● ● ●
300 ● ● ●
●
● ● ●
●
● ● ●
●
● ● ●
● ● ● ●
●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ●
200 ● ●
333 334 335 337 349 350

● ●
●
400 ●
● ●
● ●
●
●
●
● ●
● ● ● ● ● ●
● ● ●
● ●
300 ● ● ●
● ●
● ●
● ●
●
● ●
●
● ●
● ●
●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
200
351 352 369 370 371 372
400 ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
300 ●
●
● ● ●
●
●
●
●
● ●
●
●
●
●
● ● ● ● ● ●
●
●
●
● ●
● ● ●
● ●
● ● ●
200
1 4 7 1 4 7 1 4 7 1 4 7 1 4 7 1 4 7
⇒ obviously inappropriate model
I subjects obviously differ in level and trend for reaction time

I idea: model subject-specific levels and trends
Reactionij ≈ β0i + β1i Daysij
# similar: m_sleep_indiv <- lm(Reaction ~ 0 + Subject + Subject:Days)
m_sleep_indiv <- lmList(Reaction ~ Days | Subject, data = sleepstudy)
head(coef(m_sleep_indiv))
## (Intercept) Days
## 308 244.1927 21.764702
## 309 205.0549 2.261785
## 310 203.4842 6.114899
## 330 289.6851 3.008073
## 331 285.7390 5.266019
## 332 264.2516 9.566768
With estimated individual level and trend added:
308 309 310 330 331 332
500 ● ●
●
400 ●
● ●
● ● ●
● ● ● ● ● ●
● ● ● ● ●
300 ● ● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
200 ● ●
333 334 335 337 349 350

500 ● ●
●
400 ● ● ●
● ● ●
●
● ●
● ● ● ● ● ● ● ● ● ●
300 ● ● ● ● ●
● ●
● ●
● ●
●
●
●
●
●
● ● ●
● ●
●
● ● ● ●
●
● ● ● ● ●
● ● ● ● ● ●
200
351 352 369 370 371 372

500
400 ● ● ● ● ● ●
● ●
● ●
●
● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
300 ●
●
● ● ●
●
●
●
●
● ● ● ●
●
● ● ● ● ● ● ● ● ●
●
● ●
● ● ●
● ● ● ● ●
200
0.02.55.07.5 0.02.55.07.5 0.02.55.07.5 0.02.55.07.5 0.02.55.07.5 0.02.55.07.5
⇒ better fit
I global model yij = β0 + β1 x1ij + εij :

I ignores within-subject correlation
⇒ variability of coefficients underestimated since correlated data
contain less information than independent data
⇒ invalid inference (tests, CIs)
⇒ complete pooling
I subject-specific models yij = β0i + β1i x1ij + εij :
I can be interpreted only with regard to the data in the sample
⇒ no generalization to “typical” subjects / population
I many many parameters to estimate
⇒ estimates may be unstable, imprecise
⇒ no pooling
116 / 398
alternative representation of subject-specific models:
yij = β̄0 + (β0i − β̄0 ) + β̄1 x1ij + (β1i − β̄1 )x1ij + εij
with means of subject-specific parameters

18 18
1 X 1 X
β̄0 = β0i ; β̄1 = β1i
18 18
i=1 i=1
117 / 398
I idea of a random effect model
I β̄ is the population level effect β.
I express subject-specific deviations βi − β̄ as Gaussian random variables
bi ∼ N(0, σb2 ).
I this yields
yij = β0 + β1 x1ij + b0i + b1i x1ij + εij
with
εij ∼ N(0, σ 2 ), (b0i , b1i )> ∼ N2 (0, Σ).
I or alternatively:
yij = b0i + b1i x1ij + εij
with

εij ∼ N(0, σ 2 ), (b0i , b1i )> ∼ N2 (β0 , β1 )> , Σ .
118 / 398
Partial Pooling
I regression coefficients β0 , β1 , . . . in random effects models retain

their interpretation as population level parameters.
I subject-specific deviations from the population mean are modeled by
random effects – the implicit assumption is that subjects are a
random sample from the population of interest
⇒ partial pooling, with strength of pooling determined by random effect
variance.
119 / 398
Advantages of the Random Effects Approach
I decomposition of random variability in data into
I subject-specific deviations from population mean
I deviation of observations from subject means
⇒ more precise estimates of population trends
I some degree of protection against bias caused by drop-out
I random effects serve as surrogates for effects of unobserved
subject-level covariates
⇒ control for unobserved heterogeneity
I distributional assumption bi ∼ N stabilizes estimates b̂i (shrinkage
effect) compared to fixed subject-specific estimates β̂i without
distributional assumption
I intuition: estimates are stabilized by including prior knowledge in the
model, i.e., assuming that subjects from the population are mostly
similar to each other
120 / 398
Advantages of the Random Effects Approach
I random effects model the correlation structure between observations:
yij = β0 + bi + εij
i.i.d. i.i.d.
with bi ∼ N(0, σb2 ); εij ∼ N(0, σε2 )
Cov(bi , bi ) σ2
=⇒ Corr(yij , yij 0 ) = p = 2 b 2.
Var(yij ) Var(yij 0 ) σb + σε
yij = β0 + b0i + b1i tj + εij

i.i.d. i.i.d.
with (b0i , b1i )> ∼ N2 (0, Σ); εij ∼ N(0, σε2 )
2 2 2
=⇒ Var(yij ) = σb0 + 2σb01 tj + σb1 tj + σε2
2 2
=⇒ Cov(yij , yij 0 ) = σb0 + σb01 (tj + tj 0 ) + σb1 tj tj 0
I independence between observations on different subjects is retained

(for the kind of correlation structure we discuss here):
121 / 398
General Form of Linear Mixed Models
Linear Mixed Model:
y = Xβ + Ub + ε
b ∼ N(0, G)
ε ∼ N(0, R)
I U: design matrix for random effects

I independence between ε and b.
I entries in G, R determined by (co-)variance parameters ϑ
I we’ll focus on independent errors with R = σ 2 I
122 / 398
Conditional and Marginal Perspective
Conditional perspective:
y|b ∼ N(Xβ + Ub, R); b ∼ N(0, G)
Interpretation:
random effects b are subject-specific effects that vary across the
population.
Hierarchical formulation:
expected response is a function of population-level effects (fixed effects)
and subject-level effects (random effects).
123 / 398
Conditional and Marginal Perspective
Marginal perspective:
y ∼ N(Xβ, V) V = Cov(y) = UGU> + R
Interpretation:
random effects b induce a correlation structure in y defined by U and G,
and thereby allow valid analyses of correlated data.
Marginal formulation:
model is concerned with the marginal expectation of y averaged over the
population as a function of population-level effects.
The marginal model is more general than the hierarchical model.
general estimating equations: geepack
124 / 398
Linear Mixed Model for Longitudinal Data
For subjects i = 1, . . . , m, each with observations j = 1, . . . , ni :
yij = xij β + uij bi + εi j

bi ∼ Nq (0, Σ)
⇔ y = Xβ + Ub + ε
with
Pm
I y = (y> > >
1 , y2 , . . . , ym ) (n = i=1 ni entries)
I > > >
ε = (ε1 , ε2 , . . . , εm ) (n entries)
I β = (β0 , β1 , . . . , βp )>
I X = [1 x1 . . . xp ]
I b = (b1 , b2 , . . . , bm ) of length mq, with b ∼ Nmq (0, G)
I G = diag(Σ, . . . , Σ))
I U = diag(U1 , . . . , Um ) with dimension n × mq

I Ui = 1 u1i . . . u(q−1)i with dimension ni × q. Variables in Ui are
typically a subset of those in X.
125 / 398
Other Types of Mixed Models
I hierarchical/multi-level model:
e.g., test score yijk of a pupil i in class j in school k:
yijk = β0 + x>
ijk β + b1j + b2k + εijk
with random intercepts for class (b1j ∼ N(0, σ12 )) and school
(b2k ∼ N(0, σ22 ))
I crossed designs:
e.g., score yij of a subject i on an item j:
yij = β0 + x>
ij β + b1i + b2j + εijk
with random intercepts for subject (b1i ∼ N(0, σ12 ), subject ability)
and item (b2j ∼ N(0, σ22 ), item difficulty)
126 / 398
Likelihood-Based Estimation of Linear Mixed Models
ML-Estimation
I determine ϑ̂ML so that profile likelihood in V of the marginal model
is maximal:
y ∼ N(Xβ, V(ϑ))
1n o
l(β, ϑ) = − log |V(ϑ)| + (y − Xβ)> V(ϑ)−1 (y − Xβ)
2
−1
β(ϑ)
b = arg max (l(β, ϑ)) = X> V(ϑ)−1 X X> V(ϑ)−1 y
β
1n >
o
lP (ϑ) = − log |V(ϑ)| + (y − Xβ(ϑ))
b V(ϑ)−1 (y − Xβ(ϑ))
b
2
→ max
ϑ
I for given ϑ, closed form solutions for β̂ and b.

b
I b ϑ̂) = GZ> V(ϑ̂)−1 (y − Xβ(
simple generalized least squares: b( b ϑ̂)).
I Cov(β)
\ and Cov(b) \ computable for tests, CIs.
127 / 398
Likelihood-Based Estimation of Linear Mixed Models
REML estimation:
I ML-estimates ϑ̂ are biased, unbiased variance component estimates
from marginal-marginal likelihood of ϑ (“restricted”):
Z
lR (ϑ) = log L(β, ϑ)dβ
1
∝ lP (ϑ) − log |X> V−1 X| → max
2 ϑ
I closed form solutions for β̂ and b

b and their covariances given ϑ still
apply
I both are tricky optimization problems:
I positivity constraints for most entries in ϑ
I computationally expensive, numerically unstable log-determinants
I SOTA implementation for large sub-class: mgcv
128 / 398
I GLM generalizes LM via addition of a link function

I mapping the linear predictor to a range appropriate for the response
distribution,
I and linking the variance to the expected value in a way appropriate for
the response distribution.
I carries over directly for a generalized linear mixed model (GLMM):
E(y|b) = h(Xβ + Ub)
with known response function h()
I BUT: estimation much harder problem than for LMMs or GLMs,
especially for binary responses (more later).
I BUT: GLMMs can only be interpreted in the conditional/hierarchical
perspective. Use GEEs for marginal models.
129 / 398
Model:
y|b : yi |b ∼ Expo.fam.(E(yi |b) = h(Xβ + Zb), φ)

b|ϑ : b|ϑ ∼ N(0, G(ϑ))
130 / 398
Caveat: Effect Attenuation in GLMMs
LMM Logit−GLMM
1.00
4
0.75
h(xiβ + b0 i)
0
0.50
−4 0.25
−8 0.00
−2 −1 0 1 2 −2 −1 0 1 2
x
1
For random intercept logit-models: βmar ≈ √ βcond
1+0.346σb2
131 / 398
GLMM Estimation
LMM estimation exploits analytically accessible marginal likelihood:
Z
L(β, ϑ, φ) = L(b, β, φ, ϑ)db
is the density of
y|β, φ, ϑ ∼ N(Xβ, ZG(ϑ)Z> + R(φ, ϑ)).
For GLMMs:
n
Z !
Y
L(β, ϑ, φ) = f (yi |β, φ, b, ϑ) f (b|ϑ)db
i=1
(...sucks)
132 / 398
GLMM Estimation Algorithms
I Laplace approximation based : iterate

1. Compute b̂ = arg maxb L(β, φ, ϑ, b) for given β, φ, ϑ via penalized
IWLS-Algorithmus (P-IRLS).
2. Maximize a Laplace-Approximation L̃(β, φ, ϑ) of L(β, φ, ϑ) in b̂
(numerically, typically gradient based)
(mgcv, with lots of tricksy tricks; lme4 for large b)
I (Gaussian) quadrature based methods: more accurate, much slower
(lme4, gamm4)
I penalized quasi likelihood: replace GLMM by LMM with IWLS
working reponses and weights. Biased, not guaranteed to converge,
fairly fast. (nlme, mgcv:gamm)
I do (full) Bayes: flexible choice of effect distributions, hyperpriors,
likelihoods; very slow (STAN: stanarm, brms)
133 / 398
Mixed Models in a Nutshell
I standard regression models can model only the structure of the
expected values of the response
I mixed models are regression models in which a subset of coefficients
are assumed to be random unknown quantities from a known
distribution instead of fixed unknowns, and this means we can
I model the covariance structure of the data (marginal perspective)
I estimate (a large number of) subject-level coefficients without too
much trouble (conditional perspective)
I random intercepts can be used to model subject-specific differences in
the level of the response
→ grouping variable as a special kind of nominal covariate
I a random slope for a covariate is like an interaction between the
grouping variable and that covariate
→ grouping variable as a special kind of effect modifier for that
covariate
I hard estimation problems: variance components difficult to optimize,
often very high-dim. b
134 / 398

Penalization: Controlling smoothness
Smoothing Parameter Optimization
Generalized Additive Models
Surface Estimation
Varying coefficients
135 / 398
Splines
I Splines
I piecewise polynomials with smoothness properties at knot locations
I can be embedded into (generalized) linear models (e.g. ML estimates)
I Problem: choice of optimal knot setting.
I Two-fold problem:
I how many knots?
I where to put them?
I two possible solutions, one of them good:
I adaptive knot choice: make no. of knots and their positioning part of
optimization procedure
I penalization: use large number of knots to guarantee sufficient model
capacity, but add a cost (penalty) for wiggliness / complexity to
optimization procedure
135 / 398
Function estimation example: climate reconstruction
0.5
0.0
0.0
−0.5
−0.5
−1.0
0 500 1000 1500 2000 0 500 1000 1500 2000

0.5
0.5
0.0
0.0
−0.5
−0.5
0 500 1000 1500 2000 0 500 1000 1500 2000
136 / 398
Sensitivity to number of basis functions
5 basis functions 10 basis functions
0.5
0.5
0.0
0.0
−0.5
−0.5
0 500 1000 1500 2000 0 500 1000 1500 2000
20 basis functions 40 basis functions

0.5
0.5
0.0
0.0
−0.5
−0.5
0 500 1000 1500 2000 0 500 1000 1500 2000
137 / 398
Penalized ML-Estimation
I Main idea: Add a penalty term to the likelihood for regularization:
lpen (f ) = l(f ) − λ pen(f ) → max .

f
I l(f ) measures fit of estimated function f (x) = B(x)θ to data

I penalty term pen(f ) ∈ R+ 0 measures wiggliness of estimated function
I f smooth ⇒ pen(f ) small.
I f rough ⇒ pen(f ) large
I smoothing parameter λ ∈ R+
0 controls influence of penalty
I λ → 0: unpenalized estimate.
I λ → ∞: maximally smooth estimate (regardless of data)
138 / 398
Penalties
I Frequently used: Z
pen(f ) = (f 00 (z))2 dz.
I since second derivative is curvature:“total wiggliness squared”, in

some sense.
I minimal for linear functions.
I Less frequently used:
Z
pen(f ) = (f 0 (z))2 dz.
I since first derivative is slope: “total rate of change squared”, in some

sense.
I minimal for constant functions.
I So: Z Z
pen(f ) = (f 00 (z))2 dz or (f 0 (z))2 dz
I order of derivatives expresses which kind of complexity to measure139 / 398

Penalties
I penalty term can be simplified to quadratic form
pen(f ) = θ 0 Pθ,
with P = D0 D and differencing matrix D, i.e. for first differences

 
−1 1
 −1 1 
D=  ∈ RK −1×K
 
. . . .
 . . 
−1 1
140 / 398
Penalized LS-Estimation
I penalized ML-Criterion for Gaussian errors equivalent to
(y − Bθ)0 (y − Bθ) + λθ 0 Pθ → min .

θ
I analytic solution: penalized LS-estimator
θ̂ = (B0 B + λP)−1 B0 y.
I penalized ML-estimate for non-Gaussian errors/targets via penalized

IWLS:
θ̂ (k+1) = (B0 W(k) B + λP)−1 B0 W(k) ỹ(k) .
141 / 398
Penalization as Prior
I penalty term motivated pragmatically / heuristically

I we may want a probabilistic interpretation
⇒ use a prior distribution concentrated on simple, smooth functions
I pen(f ) = θ 0 Pθ is the (negative log-)kernel of that prior
I prior: θ ∼ N (0, λ2 P−1 )

I prior p(θ; λ, P) = c(λ, P) exp (−λθ 0 Pθ)
I since posterior ∝ likelihood· prior:
adding −λ pen(f ) to log-likelihood is equivalent to using the prior
above
I λ is an inverse scale parameter: more diffuse prior as λ → 0 (no effect
of prior, no penalization)
142 / 398
Another perspective:
I penalty/prior encodes assumption about likely differences between
coefficients of adjacent basis functions
I random walk prior: e.g., (θj+1 − θj ) ∼ N(0, λ2 )
I obvious question: what about the first d coefficients, for d-th
differences?
143 / 398
I Since f (x) = Bθ and Gaussian distribution is stable w.r.t. linear

functions, we impose a (low-rank) Gaussian process prior on the
function estimate:
f (x) ∼ GP(0, λ2 B(x)P−1 B(x 0 )> )
I not really: P is not full-rank =⇒ P−1 does not exist, use

Moore-Penrose.
I improper prior: infinite variance for functions in null(P)
I interpretation: no regularization of maximally smooth functions
I intuition: polynomials functions up to order of differences−1 are
estimated unregularized.
144 / 398
I powerful idea: connects probabilistic models and function

approximation heuristics
I very useful for bringing probabilistic “machinery” to bear: penalized
inference equivalent to estimating coefficients that are random
quantities.
I very general idea: can be applied to any penalty that can be written
as quadratic form in coefficients.
I very general idea: can be applied to any type of effect estimation that
can be linearized in terms of a weighted sum of basis functions (over
arbitrary domains, for arbitrary outcomes)
I e.g.: any type of random effect with fixed correlation structure,
Markov Random Fields, and arbitrary combinations of them (cf.
tensor products, later)
145 / 398
Mixed Model Decomposition of Penalized Terms
Decompose a regularized effect into its penalized (“random”) and
unpenalized (“fixed”) components for better numerical stability & direct
applicability of mixed model algorithms:
Let h = rank(P) and p = dim(θ). Decompose
θ= X̃ β + Z̃ b
p×(p−h) (p−h)×1 p×h p×1
Choose X̃ und Z̃, so that

I PX̃ = 0: β is not penalized by P,
>
I Z̃ PZ̃ = Ih , so that pen(θ) = b> b:
>
θ > Pθ = (X̃β + Z̃b)> P(X̃ β + Z̃b) = b> |Z̃ {zPZ̃} b = b> b.
|{z}
0 Ih
=⇒ simple i. i. d. random effects b associated with design matrix BZ̃;

fixed effects β with BX̃:
bθ = (BX̃)β + (BZ̃) b
| {z } | {z }
unpenalized penalized
146 / 398
Mixed Model Decomposition of Penalized Terms
I X̃, Z̃ not unique

I for exampe: X̃ from basis of null(P): eigenvectors of P with 0
eigenvalues.
1/2
I for exampe: Z̃ = L(L> L)−1 with P = Γ Ω+ Γ> and L = ΓΩ+ , for
p×h h×h h×p
Ω+ diagonal matrix of positive eigenvalues.

I (can use any matrix root of P for Z̃)
147 / 398
Influence of λ (1st differences)
λ=0.001 λ=1
0.5
0.5
0.0
0.0
−0.5
−0.5
0 500 1000 1500 2000 0 500 1000 1500 2000
λ=1000
0.5
0.0
−0.5
0 500 1000 1500 2000
148 / 398
Influence of λ (2nd differences)
λ=0.001 λ=1
0.5
0.5
0.0
0.0
−0.5
−0.5
0 500 1000 1500 2000 0 500 1000 1500 2000
λ=1000
0.5
0.0
−0.5
0 500 1000 1500 2000
149 / 398
Optimizing smoothing parameters
I optimizing smoothing parameter instead of number, locations of knots

I much easier: positive scalar instead of variable length vector of
locations
I typical strategies:
I optimizing (pseudo-)predictive criteria: AIC, GCV, test set error, ...
I empirical Bayes: estimation via mixed models
I fully Bayes: hyperparameter λ with with suitable hyperprior.
150 / 398
Implementation in R
library("mgcv")
formula <- temp ~ s(year, bs = "ps", m = c(2, 2), k = 20)
gam1 <- gam(formula,

method = "REML", data = tempdata,
family = gaussian
)
gam2 <- gam(formula,
method = "GCV.Cp", data = tempdata,
family = gaussian
)
plot(gam1, residuals = TRUE)

plot(gam2, residuals = TRUE)
151 / 398
Implementation in R
REML fit GCV fit

0.5
0.5
s(year,18.78)
s(year,16.5)
0.0
0.0
−0.5
−0.5
0 500 1000 1500 2000 0 500 1000 1500 2000
year year
152 / 398
I gam similar syntax to glm:
I s(z) requests a smooth effect of z
I bs="ps" specifies a P-spline basis
I m=c(2,2) controls order of spline (spline order+1, m[1]) and
difference order of penalty (m[2]).
I k=20 controls number of basis functions
I method="REML" to control optimization criteria
153 / 398
I equivalent degrees of freedom (edf) measure complexity of a function
estimate
I for the linear model,
X(X0 X)−1 X0 = dim(β).

tr
| {z }
:=Hat-Matrix H: Hy=ŷ
I equivalently:
edf(λ) = tr B(B0 B + λP)−1 B0 .

I for P-splines with d-th difference penalty:
d ≤ edf(λ) ≤ dim(θ).
154 / 398
Generalized Additive Models
I Generalized Additive Models (GAM) extend generalized linear models
as:
E(y |η) = h(η), η = x0 β + f1 (z1 ) + . . . + fq (zq ).
I f1 , . . . , fq are smooth effect functions modeled via penalized splines

I not identifiable as such:
f˜1 (z1 ) = f1 (z1 ) + c, f˜2 (z2 ) = f2 (z2 ) − c.
I so center effect functions

Z n
X
fj (zj )dzj = 0, i.e., fj (zji ) = 0.
i=1
155 / 398
GAM estimation
I estimate via maximization of penalized Likelihood:

q
X
l(β, θ1 , . . . , θq ) − λj θj0 Pj θj .
j=1
I uses penalized IWLS

I multiple smoothing parameters to optimize: much harder
optimization problem
156 / 398
I additivity possibly too strong assumption: ignores interactions
I More general:
η = x0 β + f (z1 , . . . , zq )
I ... infeasible for fairly small q.

I but: can try to estimate at least lower-order interactions like f (z1 , z2 )
I interpret and visualize as effect surfaces. Direct analogy to and useful
for spatial effects.
157 / 398
Surface Estimation
I Two flavors:
I tensor product splines
I radial basis functions: basis functions over (subsets of) R2
158 / 398
Tensor Products
I linear models represent interaction effects via elementwise products of

the respective design matrix column vectors.
I analogous idea for surface estimation: create spline bases for z1 and
z2 and compute all pairwise products of their basis functions.
⇒ tensor product basis functions
159 / 398
Tensorproducts
(1) (1) (2) (2)

I define univariate bases b1 (z1 ), . . . , bK (z1 ) and b1 (z2 ), . . . , bK 0 (z2 )
for z1 und z2 .
I define tensor product basis for surface f (z1 , z2 ) by
(1) (2)
Bjk (z1 , z2 ) = Bj (z1 )Bk (z2 ), j = 1, . . . , K , k = 1, . . . , K 0 .
I represent surface as
K X
K 0
X
f (z1 , z2 ) = θjk Bjk (z1 , z2 ).
j=1 k=1
160 / 398
161 / 398
162 / 398
163 / 398
Penalty terms for Tensor Product Splines
I need sensible penalties for tensor product P-splines

I consider coefficient vector arranged as array:
θ = (θ11 . . . θK 1
.. ..
. .
θ1K . . . θKK )>
I every row θr = (θ1r , . . . , θKr ) represents a univariate spline in

z1 -direction
I every column θc = (θc1 , . . . , θcK )> represents a univariate spline in
z2 -direction
164 / 398
I let P1 penalty matrix for spline in z1 . Then
θr P1 θr>
quantifies wiggliness of the r -th row and “total row-wiggliness” is

K
X
θr P1 θr> .
r =1
I more compact with Kronecker product:

K
X
θr P1 θr> = θ > (I ⊗ P1 )θ.
k=1
165 / 398
I let P2 penalty matrix for spline in z1 . Then
θc> P2 θc
quantifies wiggliness of the j-th column and “total

column-wiggliness” is
XK
θc> P2 θc .
c=1
I more compact with Kronecker product:

K
X
θc> P2 θc = θ > (P2 ⊗ I)θ.
c=1
166 / 398
I in combination we get a penalty term
θ > (λ1 I ⊗ P1 + λ2 P2 ⊗ I) θ.
| {z }
=P(λ1 ,λ2 )
I and, as usual, we can optimize the penalized likelihood
l(θ) − θ > P(λ1 , λ2 )θ.
167 / 398
Radial Basis Functions
I For a given knot κ = (κ1 , κ2 ), a radial basis function is defined as
Bκ (z) = B(||z − κ||).
I contour lines of such basis functions are circles ⇒ radial functions

I examples (with d = ||z − κ||):
B(d) = d 2 log(d) Thin Plate Spline,

k
B(d) = d , k even,
p
B(d) = d 2 + c 2 with c > 0.
I penalty terms typically based on integrals of (squared) bivariate

derivatives
168 / 398
Tensor Product Splines vs. Radial Basis Functions
I tensor products:
I invariant against linear covariable transformations
I allow for combination of covariables on different domains, scales, units.
I allow for anisotropy: different roughness over different axes.
I radial basis functions:
I invariant against rotations of covariate space
I useful for spatial/isotropic effects
I only have a single smoothing parameter
169 / 398
I surface estimates can represent interactions of metric covariates
I how to represent interactions between categorical and metric
covariates?
170 / 398
I model equation
η = . . . + u1 f1 (z1 ) + u2 f2 (z2 ) + . . .
where u1 , u2 etc are dummy variables for the different levels of a

categorical variable u
I separate effect functions for each level of u
171 / 398
I more generally, varying coefficient effects are written as
η = . . . + uf (z) + . . .
I effect of u varies over the domain of z.

I “effect modifier” z
I useful also, e.g. for time-varying effects etc:
η = . . . + uf (t) + . . .
172 / 398
I model representation: actually a simpler special case of tensor
product bases:
I “basis” matrix for categorical covariate u is a matrix of dummy
variables
I “basis” matrix for linear effect of a metric covariate u is simply u
I in both cases, the design matrix for the varying coefficient term is
created by the tensor product of the effect modifier’s spline basis and
the covariate’s design matrix.
I for f = (f (z1 ), . . . , f (zn ))0 = Bθ, multiplication with u simply means
u · f = diag(u1 , . . . , un )Bθ :
a rescaled basis matrix.

I for categorical u, we do the same thing for each dummy variable, to
end up with “copies” of the original spline basis that are set to zero
in rows wich don’t belong to the respective level of u
173 / 398
Part III
Functional Principal Component

Analysis
174 / 398
Multivariate Principal Component Analysis (Review)
Functional Principal Component Analysis
Functional PCA for Sparse Functional Data

176 / 398
Idea:
I Find normalized weight vectors φk ∈ Rp that maximize the sample
variance of ξik = φ> p
k xi for x1 , . . . , xn ∈ R .
I Identify most important modes of variation in the data
Definition
Given observations x1 , . . . , xn ∈ Rp with zero mean, the first principal
component (PC) is defined by
1 X n > 2
φ1 = arg max φ xi .
kφk2 =1 n i=1
The subsequent principal components φk can be found analogously

subject to the additional orthogonality constraint
φ>
j φk = 0 for all j < k.
176 / 398
Remarks:
I Principal components φk ∈ Rp have same length/structure as the
data
I Normalizing restriction kφk k2 = 1 makes sure that PCs are well
defined, but no unique specification (for any solution φ the vector
−φ is a solution, too)
I Orthogonality constraint φ>j φk = 0 ensures that φk indicates a new
mode of variation that is not explained by the preceding components
φ1 , . . . , φk−1
I PCA can be seen as a dimension reduction tool: use
ξi = (ξi1 , . . . , ξiK ) K p
instead of the observed values xi = (xi1 , . . . , xip ) .
177 / 398
Alternative characterization:
Theorem (Multivariate PCA as eigenanalysis problem)
Let V = n1 X> X be the sample covariance matrix of (mean-centered)
x1 , . . . , xn with eigenvalues ν1 ≥ ν2 ≥ . . . ≥ νm > 0.
Then
n
1 X > 2
φ1 = arg max φ xi = arg max φ> V φ
kφk2 =1 n kφk2 =1
i=1
is equivalent to the solution of
Vφ1 = ν1 φ1 , kφ1 k = 1.
The principal components are hence orthonormal solutions of
Vφk = νk φk k = 1, . . . , m.
178 / 398
Remarks:
I Eigenanalysis of V is equivalent to singular value decomposition
(SVD) of X
I computationally vastly more efficient
I usually only need to compute first few leading singular vectors and
values
I very efficient algos using e.g. random projections, for sparse matrices
I variants for partially observed data (→ simple recommender systems)
I The eigenvalues ν1 , . . . , νm describe the amount of variability
explained by their principal components.
I Proportion of variability explained by the k-th principal component
νk
0 < Pm ≤ 1.
j=1 νj
179 / 398

Definition
Estimation in basis representation:
Example: FPCA
Regularized Functional Principal Component Analysis
180 / 398
Definition
Basic setting:
I Data generating process: smooth random function X (t) with
I unknown mean function µX (t) = E[X (t)]
I unknown covariance function vX (s, t) = Cov(X (s), X (t))
I Observed functions: x1 (t), . . . , xn (t)
I often given on a set of sampling points t1 , . . . , tp
I preprocessing, smoothing
I For simplicity, assume that the functions are centered, i.e.
n
1X
µ̂X (t) = xi (t) ≡ 0
n
i=1
180 / 398
Definition
Example: Berkeley growth study

200
girls
boys
180
39 boys, 54 girls
160
I p = 31 measurements
140
Height
between 1 and 18 years

120
(same timepoints for each

100
child)
I measurements not equally
80
spaced
60
5 10 15
Age
181 / 398
Definition
Functional PCA: Idea

Extend multivariate PCA to the functional case.
I Functional data xi (t) require functional weights φ(t)

I Scalar product for functional data
Z
hφ, xi i = φ(t)xi (t)dt
1 1
φ2 (t)dt
R
induces norm kφk = hφ, φi 2 = 2
182 / 398
Definition
Definition (Functional Principal Components)

The first functional principal component φ1 (t) of X is defined by
n Z 2
1X
φ1 (t) = arg max φ(t)xi (t)dt .
kφk2 =1 n
i=1
The k-th functional principal component φk (t) is found analogously,

subject to the additional constraint
Z
φj (t)φk (t)dt = 0 for all j < k.
R
The values ξik = φk (t)xi (t)dt are called functional principal component
scores.
183 / 398
Definition
Remarks:
I The definition of the functional principal components is equivalent to
the multivariate case:
I vectors xi → functions xPi (t)
p R
I scalar product hxi , φi = j=1 xij φj → hxi , φi = xi (t)φ(t)dt
I Functional principal components φk (t) are again only defined up to a
sign change
I One can show that φk is the k-th eigenfunction of the sample
covariance operator
1 Xn
Z
(Vf )(s) = xi (s)xi (t) ·f (t)dt.
|n
i=1
{z }
v̂X (s,t)
The eigenvalue νk quantifies the amount of variability represented by

the k-th functional principal component φk (t).
184 / 398
I Expand xi (t) in known basis functions bk (t):
K
X
xi (t) ≈ θik bk (t)
k=1
I Estimation results in a standard matrix eigenanalysis problem

involving coefficients Θ = [θik ] i=1,...,n and the basis functions
k=1,...,K
b(t) = (b1 (t), . . . , bk (t)):
1
vX (s, t) = b(s)Θ> Θb(t)>
Zn
(V b(·)θ̃)(s) = vX (s, t)b(s)θ̃dt
Z
1
= b(s)Θ> ΘWθ̃ with W = bk (t)bk 0 (t)dt k=1,...,K
n k 0 =1,...,K
I eigenvectors uj of n1 W1/2 Θ> ΘW1/2 for eigenvectors uj

I coefficient vectors for eigenfunction φj (t) = b(t)θj with
θj = W−1/2 uj
I results depend e.g. on choice of the basis functions bk (t) and their
185 / 398
Example: FPCA
Berkeley growth study: R-Code

library(fda)
data(growth)
age <- growth$age
height <- cbind(growth$hgtm, growth$hgtf)
# choose basis (cubic B-spline functions)

bb <- create.bspline.basis(age, norder = 4)
# create functional data object

growth.fd <- Data2fd(y = height, argvals = age, basis = bb)
# functional PCA
growth.pca <- pca.fd(growth.fd, nharm = 2, centerfns = T)
186 / 398
Example: FPCA
Berkeley growth study: Principal components
0.4
0.2
0.0
−0.2
−0.4
PC 1
PC 2
5 10 15
Age
I Proportion of variance explained: 80.9% (PC 1), 13.6% (PC 2)

I Rather hard to interpret
187 / 398
Example: FPCA
Interpretation:
I Karhunen-Loève representation:
∞
X
xi (t) = µX (t) + ξik φk (t)
k=1
The score ξik describes the weight of the k-th principal component
for the i-th observation.
I Effect of the k-th principal component:
Assume new scores
(
cc1 c2 k = 112
ξ˜k =
0 otherwise.
Then x̃(t) = µX (t) + ∞ ˜

P
k=1 ξk φk (t) = µX (t) + cc1 c2 φk (t)
√ √
I Typical choices for ck : ± νk , ±2 νk , as eigenvalues νk reflect the
variability that is explained by φk
188 / 398
Example: FPCA
Berkeley growth study: Principal components
Effect of PC 1 Effect of PC 2
200
200
++++++++++++++
++++++++++++
++++++
180
180
−−−−−−−−−−
++++ −−−−−−−−−
++++ −−−−−
+
++ −−−−
++++ ++
++
++
++
+−
−
+
+−
+
++
++++ ++
+++ − − ++
++++++++++++++++++++++++
160
160
++++ ++++ −−−
++++
++++ −−−−−−−−−−−
−−−−−−−−−−−−−−−− ++++ −−−−
++ −−−−−−−− ++++ −−−−
+
++++
+
−−−−−−− +++++ −−−−−−−−−
++++ −−−−− +
140
140
+ −− +
+++−−−−−−−−−
++ −
−−−−−
− +++++
Height
Height
+++ +++++ −−−−−−
++++ −−−−− +++++ −−−−−
++++ −−−−−− +++++ −−−−−
++++ −−−−− + −−−
120
120
++++ −−−−−−−−−− +
+
+ −− −
++
+ ++++−−−−−
++++ −−−−−
− ++++−−−−−
++++ −−−−− ++++−−−−−
+++ −−−− −−−−−
+
++
+−
+++ −−−− + −
100
100
++−
+−−−
+++ −−−−−−−
+ +−
+−
+
+++ −−−−
− ++−−+
++−−
++ −−−− ++
+−−−−
++ −−−− +−
−−
+++
−
++−−−−
80
80
−
+
− −+−
+−
−−
60
60
5 10 15 5 10 15
Age Age
√
I Effect shows µ̂X ± 2 νk φk .
I PC 1 as individual growth effect
189 / 398
Example: FPCA
Dimension reduction:
I Use K -dimensional individual score vectors
ξi = (ξi1 , . . . , ξiK )
instead of the (infinite dimensional!) functions xi (t)

I If we choose K large enough, we loose only little information
190 / 398
Example: FPCA
Berkeley growth study: Principal component scores
●
●
girls
20
●
●
boys
● ●
●
●
●
●
● ● ●
●
●●●
10
● ●
●
●
● ● ● ●
●● ●
Score PC 2
●
●
●
● ● ●● ●
● ●
●
●
● ● ● ● ●
●
●
0
● ●
● ● ●
●
●
●
● ● ●
●
●
●
●
● ●
● ●
● ●
● ●
●
−10
●
● ● ● ● ●
● ● ●
●●
●
●●
● ● ● ●
● ●
●
● ●
−60 −40 −20 0 20 40 60
Score PC 1
I PC 2 as genderspecific effect
191 / 398
Idea
Aim at smooth functional principal components
for better interpretation.
I smooth functions before doing FPCA:

e.g. via (penalized / low-rank) basis representation.
Simple, ad-hoc, can work well for regular data, scales well.
I penalize roughness of eigenfunctions:
for data in basis representation / on equidistant grids.
refund::fpca2s, refund::fpca.ssvd, fda::pca.fd
I smooth estimated covariance function before doing eigenanalysis:
computationally more challenging, applicable to sparse or irregular
data without pre-smoothing / basis representation.
refund::fpca.face, refund::fpca.sc
Enforcing smoothness of eigenfunctions acts as a low-pass filter:
leading eigenfunctions will be smoothed to represent only low-frequency
variation.
Truncated basis representation using only first few eigenfunctions can then
be used for smoothing.
192 / 398
Penalization of Eigenfunctions
I Penalization of second derivatives (cf. regularized regression)

Z
2
PEN(φ) = φ00 (t) dt
I Maximize penalized sample variance

n Z 2
1 1X
PSV(φ) = · φ(t)xi (t)dt
kφk2 + λ PEN(φ) n
| {z } | i=1
penalization
{z }
from standard fPCA
I Smoothing parameter α controls influence of the penalty.
193 / 398
Influence of α:
α = 0.1 α = 10 α = 1000
Effect of PC 1 Effect of PC 1 Effect of PC 1
200
200
200
+++++
+++++
+++++++ +++++
++++++++++++++++++++++ +++++++++ +++++
+++++++ +++++++ +++++
180
180
180
+++++ ++++++ +++++
++++ ++++++ ++++
++++ +++++ +
++
++++ +++++ ++++
++++
++++ +++++ ++++ −−−−−−
−−−−−−
160
160
160
+
+
++++ ++++
+ −− ++++ −−−−−−
++++ −−−−−−−−−−−−−−−−−−−−−− ++++ −−−−−−−−−−−− ++++ −−−−−−
+
++++
++
−−−−−−−−
−−− ++++ −−−−−−−−−− ++++
++++ −−−−−−− ++++ −−−−−−−− ++++ −−−−−−
++++ ++++ −−−−−−− ++++ −−−−−−
−−−−−− −−−−−− −−−−−−
140
140
140
++++ −−−−− ++++ −−−−−− ++++ −−−−−−
++++ −−−−− ++++ −−−−−− ++++
Height
Height
Height
+
+ −−−
−
++++ −−−−− ++++ −−−−− +
+++ −
−−−
++++ −−−−− ++++ −−−−− ++++ −−−−−−
+
++
+
+ −−−−
−−− ++
++
−−−−
−− ++++
++++ −−−−−−
−−−−−
120
120
120
++++ −−−−−−−−−− ++++ −−−−−−−−− ++++ −−−−−
++++ −− ++++ −−− ++++ −−−−−−−−−−
++++ −−−−−− ++++ −−−−− ++++
++++ −−−− ++++ −−−− ++++ −−−−−−−−−
−
+++ −−−−− +++ −−−− ++++
100
100
100
+++ −−−− +++ −−−− ++++ −−−−−−−−−−
−
+++ −−−−− +++ −−−− ++
+
+++ −−−−−−− +++ −−−− ++ −−
++++ −−−−−−−−−
++ +++ −−−−−−−
+
+ −− − +
++ −−−−− −−−
+++−−−−− −−−−−
80
80
80
−−
−
−−− −−−−
60
60
60
5 10 15 5 10 15 5 10 15
Age Age Age
Effect of PC 2 Effect of PC 2 Effect of PC 2

200
200
200
−−− −−
−−−−− −−−−−
180
180
180
−−−−−−−−−−−
−−−−−−−− −−−−− −−−−− ++
−−−−− −−−−− −−−−− ++++++
−−−− −−−−− −−−−−+++++++
+−
+− +−
−−−− −−−−−+++++++
+++++++++− ++++++
++++++ −−− +++++++++++++++++++++++ +−
+−
+−
+− ++++++++++++++++++++++++++++ −−−−−+++++
160
160
160
+−
++++ −−− ++
++++++−−
+−
++− +−
+−
+−
−−
−
+
−
−
+
+
−++++++
+−
++++ −−−− ++
+++ −− −−
+
+++
+−
+−
+
++++ −−−−−−− +++++ −−−−− +−
+− +−
+−
+−
++
+ +++++ −−−− −+
++ +−
−− +−
+−
+++ −−−− ++++ −−−− +−
+−
140
140
140
+−
++++ −−−−−−− ++++ −−−−− +−
+++−
+−
+− +−
+−
+++++ −−−−−− ++++ −−−−
Height
Height
Height
++
+−−−−−
+++++−−−−−− ++++ −−−−− +−
++−
+−
+−
+−
+++++−−−−− ++++ −−−−− +++++++
−−−−
+++++−−−−− ++++ −−−−− +++−−−−−
++++−−−−−−−−
120
120
120
++++−−−−−−− + ++++++−−−
++++
++− −− ++
+−−−−−
+++−
++
+++++ −−−−−
+++− −−−−− ++++ −− +++++ −−−−
++
+−−−−− − ++ −−−− +++++ −−−−−
+−
+−
+−
+− +−
+−
+− −−−− +++++ −−−−
100
100
+−
100
+++− +++− +++++ −−−−
+−
+−
+− −−−−
++ ++
+−
+− −−−− +++++ −−−−−
+−
+−
+− ++
+−
+−
+−
+− +++++ −−−−−−−−
++
+−
+− −− +−
+−
+− −− −−−−
+−
+− +−
+−
+− −−−−
80
+−
80
80
+− +−
−
+−
−
60
60
60
5 10 15 5 10 15 5 10 15
Age Age Age
I An appropriate value of α can be found by cross-validation

(one-curve-leave-out)
194 / 398
Smoothing the eigenfunctions
SSVD: Smooth SVD (Xiao et al. 2016, e.g.)

Simpler implementation for raw data on equidistant grid:
Center function evaluations X = [xi (tj )] i=1,...,n , for k = 1 . . . K do:
j=1,...,T
1. compute first right singular vector of X (first eigenfunction of XT X)

2. smooth with simple difference penalty and cross-validated smoothing
parameter λk to get φk = (φk (t1 ), . . . , φk (tT ))
3. compute loadings ξk
4. update X ← X − ξk φk
195 / 398
Smoothing the covariance surface
FACE: Fast Covariance Estimation (Xiao et al. 2016)

I uses lots of computational shortcuts to efficiently smooth covariance
surface via penalized tensor product splines: symmetry, array
arithmetic, clever optimization of smoothing parameter
I never actually computes covariance or tensor product
⇒ scales to large datasets with high resolution
I only easily applicable for regular data
196 / 398
R-Code: FACE & SSVD
library(refund)
growthmat <- rbind(t(growth$hgtm), t(growth$hgtf))
growth_face <- fpca.face(growthmat, argvals = growth$age, knots = 25, npc = 2)
growth_ssvd <- fpca.ssvd(growthmat, argvals = growth$age, npc = 2)
FACE SSVD
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
−0.1
−0.1
−0.2
−0.2
PC 1 PC 1
−0.3
PC 2 −0.3 PC 2
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Age Age
197 / 398

Sparse Functional Data
Borrowing Strength across Observations
Functional Principal Component Scores as Conditional Expectations
Example
198 / 398
Sparse functional data:

I Many functional data sets are sparse (or irregular)
I e.g. most longitudinal data sets
I Number and location of observation points ti1 , . . . , tiTi for each curve
may vary
I Can result
R in bad approximation for smoothed functions and scores
ξik = (xi (t) − µX (t)) φk (t)dt, in particular if Ti is small (numerical
integration fails!)
198 / 398
Example: Sparsified Berkeley growth data

200
full growth curve

sparsified observations
Artificial sparsification:
●
180
●
I Observations per child:
160
●
●
●
Ti ∈ {2, . . . , 6}, median of 4
140
●
Height
●
●
●
Time points: tij ∈ {t1 , . . . , tTi }

●
I
120
I Figure shows sparse versions for 4

100
●●
●
random children
80
●
● ●
I In total 370 observations (full

60
5 10 15 dataset: 2883)
Age
199 / 398
Principal component Analysis through Conditional Expectation:

(Yao, H.-G. Müller, et al. 2005, PACE)
Idea
Develop a new FPCA method that is applicable to sparse functional data.
Basic setting:
I Use observation points directly without previous smoothing
I Account for additional measurement errors εij ∼ N (0, σ 2 )
I Model:
X∞
yij = xi (tij ) + εij = µX (tij ) + ξik φk (tij ) +εij
| {zk=1 }
Karhunen-Loève
200 / 398
Estimation:
I Estimate mean and covariance functions using the pooled data
(”borrowing strength”)
I µ̂(t): by local linear smoother (or splines)
I v̂ (s, t) and σ̂ 2 :
Cov(yij , yil ) = Cov(xi (tij ), xi (til )) + σ 2 δjl = vX (tij , til ) + σ 2 δjl
→ Smooth ”raw” covariances

v̂i (tij , til ) = (yij − µ̂(tij ))(yil − µ̂(til )); tij 6= til
→ Consider diagonal values separately

I Estimate φk (t) and νk using the smoothed covariance estimate
v̂X (tij , til )
201 / 398
Smoothing the crossproduct surface
“Interesting” problem:
I scales very badly: quadratic in n, Ti
I but: symmetric, so only “need” upper/lower triangle.
● ● ● ●
0.01 ● ● ● ●
2 0.0
12
● ● ●
0.0
● ● ● ● 08 ●
0.0
● ●
0.01
08
●
● ●●
0.01 ● ● ●
●
●
●
●
●
●
● ●
● ●
4 ● ●
●
●
● ● ●
●
● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ●
● ● ● ●
● ●
0.01
20
20
●
●
●
●
● ●
● ●
4
0.0
0.020
1
● ● ● ●● ● ● ●
● ● ● ● ● ● ●● ●● ● ● ● ●
●
● ●
●● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ●
● ●
0.01 ● ●
●
●
●
●
●
● ●
● ● ● ● ● ●
● ● ● ●
6 ●
●
●
● ●
●
●
●
● ●
0.01 ●
●
●
●
● ●
● ● ● ● ●
●
●●
●
●
●●
● ●
●
● ●● 6 ●
● ●●
● ● ●● ● ● ● ●
● ● ● ●●●● ● ●●● ● ● ●●● ●
● ● ● ●
● ●● ●
● ● ● ● ● ●● ●
● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ● ●
●●● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
Age
0. ● ●
● ● ●
15
● ●● ●
15
● ● ● ●● ●
01 ● ● ●●
●
●
●●
8 0.015
●● ●
●● ●● ● ● ●● ●● ●
● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●● ● ● ● ● ●
● ● ● ● ● ● ●
● ●●● ●● ●● ● ● ● ● ● ●● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ●
● ● ●
●
● ● ●● ●
●●●●
● ●
●
●
●● ● ●
●
● 0.0 ●
●●●●
●
● ●● ● ● ●
●
●
● ● ●
● ●
● ●
●
● ●
18 ● ● ● ●
●
● ●
●
●
● ● ● ●● ● ● ●● ● ●
● ● ● ●● ● ● ●● ●●
●
● ●
●
●●
● ● ● ●
● ● ● ● ● ●
● ●
●● ●
●
● ●● ● ● ● ● ● ●
●● ● ●
● ● ● ● ●
● ●● ●
● ●
● ● ● ● ●
● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●● ● ●
●
● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ●●● ● ●● ● ● ● ● ●
● ●●●
●
● ●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ●
0. ● ●
●●
● ●
●●
● ●
●●
●
● 02 ●●
●
●
●● ● ●
●
●
●
●
●
●
●
●
●
●
●
● ● ●● ● ● ●
● ● ● ● ●
● ● ●
●
● ● ● ●
●
●● ● ●
0.0 ● ● ● ●● ● ●
● ● ●
2 ● ●
●
10
● ● ● ●
10
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
0.010
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●● ● ● ● ●●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●●● ● ● ● ● ●●
● ●●● ● ● ● ●●
● ● ● ● ●
●
●● ● ● ● ● ● ●
● ● ● ● ●
●● ● ● ● ●
●● ● ● ●
● ●
● ● ● ●
● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●●● ● ● ● ●
● ● ● ● ● ● ●
● ●●
● ● ● ●●
● ● ● ●
● ● ● ● ● ● ●
●● ●● ● ● ● ●
● ● ● ●●
● ● ● ● ●
●● 0.0 ● ● ●
●
●
●
●
●
● 0.0 ●
● ● ●
●
●
● ●
●
●
●
●
● ●
22 ●
●
●
●
22 ●
● ● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●
●
● ● ●
● ● ●
● ● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ●
● ●● ● ● ●
● ● ● ● ●
● ● ● ●
●
●
● ● ●
● ● ●
10 15 20 10 15 20
Age Age
0.00
0.00
Scaled eigenfunctions
−0.05
−0.05
−0.10
−0.10
100% 97.3%
0.000458% 1.15%
4.19e−06% 0.764%
−0.15
−0.15
3.24e−06% 0.268%
5 10 15 20 25 5 10 15 20 25
Age Age
Figure 2: Above: Covariance function estimates Ĉsq (left) and Ĉtr (right) for the cortical
thickness data, with locations (tij1 , tij2 ) of the values (6) that are smoothed to produce
these estimates; thus only points with tij1 > tij2 are included at right. Below: Scaled
(P. T. Reiss and Xu 2018; Cederbaum, Scheipl, et al. 2018)
√
I surface estimate under positive-definiteness constraint!
eigenfunction estimates λ̂ φ̂ , for j = 1, 2, 3, 4, for the resulting estimated covariance j j 202 / 398
Berkeley Growth Study:
Mean estimation based on pooled data
200
●
●
●
180
● ●
● ● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ●
●
● ●
● ● ●
● ●
● ●
● ● ●
● ● ● ●
● ●
● ● ●
● ● ●
● ● ● ●
●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
160 ●
● ● ● ●
●
● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ●
● ●
● ●
● ● ● ●
●
● ● ●
● ● ● ●
●
● ● ● ● ● ●
● ●
● ● ●
● ● ●
●
● ● ●
● ● ●
●
● ● ● ● ● ●
● ●
● ●
● ● ●
● ● ● ●
140
● ● ● ●
● ● ● ●
●
●
● ● ● ●
● ● ● ● ●
Height
● ● ● ● ●
●
● ●
● ●
● ●
●
● ●
● ● ● ●
● ●
●
● ●
● ● ● ●
● ●
●
120
● ●
● ●
● ●
●
● ●
● ● ●
●
● ●
●
● ●
● ●
● ●
●
●
● ●
●
100
●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
●●
● ●
●●
●
●
●●
80
●
●●●
●●
●●
●● ●
●●●
●
●●
●
●
60
5 10 15
Age
203 / 398
Berkeley Growth Study: Sparsified Data
Covariance estimation based on pooled data (raw & smoothed):
15
15
10
10
Age
Age
5
5
5 10 15 5 10 15
Age Age
I Diagonal values removed in raw covariance

204 / 398
Functional Principal Component Scores as Conditional Expectations
Theorem (FPCA through conditional expectation)
Assuming ξik and εij to be independent and jointly Gaussian, the best
prediction for ξik is given by
ξ˜ik = E(ξik | yi , T ) = νk φ> −1

ik Σyi (yi − µi ),
with Σyi = Cov(yi , yi | T).
I Notation:
yi = (yi1 , . . . , yiTi ) , µi = (µX (ti1 ), . . . , µX (tiTi ))
φik = (φk (ti1 ), . . . , φk (tiTi )) T = {tij , i = 1, . . . , n, j = 1, . . . , Ti }
I Intuition: ξ˜ik is the best prediction of the true score ξik given the
observed values ỹi and the pooled information from all observation
points (T).
205 / 398
Example
Berkeley growth study: R-Code

library("refund")
# calculate the first two principal components
sparsePCA <- fpca.sc(argvals = age, Y = height_sparse, npc = 2)
I The matrix height sparse contains the artificially sparsified heights

(row-wise).
I Data is fed directly into the function fpca.sc, no pre-smoothing
with basis functions as in fda package.
206 / 398
Example
Berkeley growth study: Estimated principal components
Effect of PC 1 Effect of PC 2
200
200
−−
−−
180
180
−−−
−−− +
+−
−− ++−
+ + −−
+ −−
−
+− +−
+−
−− −
160
160
− −
+ + +++++++ + +−
− +
−− ++ +−+−
− + + −
− + +−
140
140
−
+ ++−
Height
Height
− + −
+ ++
− −
−
+ ++ +
−
120
120
− ++ +
−
− + −
+
+ −
+
100
100
− +
+ −
+
− +
−−+ +−
−−
+
−− ++
80
80
++ +−
−
++
60
60
5 10 15 5 10 15
Age Age
I Proportion of variance explained: 93.2% (PC 1), 6.7% (PC 2)

I Reminder: Principal components are only defined up to a sign change
I Results for first PC very similar to full data analysis, even using only
13% of the original data!
207 / 398
Example
Berkeley growth study: Estimated scores
●
●
girls ●
● ●
10
●
boys ●
●
● ●
●
●
●● ●
5
●
●
● ●
● ●
● ●
● ● ●
● ● ●
● ● ●●
● ● ●●
Score PC 2
● ● ●
0
●
● ● ● ●
●●
● ● ●
●
● ●
● ● ● ●● ● ●● ●
●
●● ●
● ● ●
● ●
−5
● ●●
● ● ●
● ●
● ●
−10
● ●
●
● ●
● ● ● ●
●
●
−15
−60 −40 −20 0 20 40
Score PC 1
I PC 2: Separation of girls/boys less clear

208 / 398
Summary
PACE approach by Yao, H.-G. Müller, et al. (2005):

I Suitable for sparse functional data
I Estimation of mean and variance function based on all observations
(”borrowing strength”)
I Deals with (white-noise) measurement errors
I BLUP estimates of functional principal component scores via
conditional expectation
209 / 398
Summary
Functional PCA:
I Directly extends multivariate PCA to functional data
I Dimension reduction technique
I ”optimal” low-rank basis representation for given data:
most variance explained with smallest K
I eigenvalue decay gives indication of inherent complexity of the data
I ”low-pass” filter via truncated FPC basis representations
I clustering
I Key tool for exploring functional data and further analyses
Iclustering, anomaly detection, ...
Isupervised learning with functional features
⇒ simply use FPC scores ξi as scalar feature vectors.
I Penalized, smoothed versions often show clearer effects and facilitate
interpretation
I Phase variation can be a problem: slow eigenvalue decay
210 / 398
900
200
800
Extensions: 150
1000
800
700
Spot prices
600 600
400
“functional fragment” data: via (low-rank) matrix completion500
Downloaded from https://academic.oup.com/biomet/article-abstract/106/1/145/5250868 by Universitaetsbiblio

I 100 200
400 5
methods 50
300
10
15
10
15 s
5
t
20 200
(Descary and Panaretos
0 2018, e.g.) 20
5 10 15 20
Hours
(c) Subsample of the original dataset (d) Fragmented subsample (d=0.5)
200 200
150 150
Spot prices
Spot prices
100 100
50 50
0 0
5 10 15 20 5 10 15 20
Hours Hours
(e) (f) ~K
Fragmented and discretized subsample (d= 0.5) Rn
900
200
800
1000
150 800
700
Spot prices
600 600
400
100 200
500
0
5 400
10 5
50
10 300
15
t
15 s
20 20 200
0
5 10 15 20
Hours 211 / 398
Part IV
Background: Boosting
212 / 398
Introduction
Definition and Properties of Gradient Boosting
Regularization & Selection
Implementation
Introduction
Motivation
Why boosting?
Implementation
214 / 398
Aims and scope
I Consider a sample containing the values of a response variable y and the

values of some predictor variables x = (x1 , . . . , xp )>
I Aim: Find the “optimal” function f ∗ (x) to predict y
I f ∗ (x) should have a “nice” structure, for example,
f ∗ (x) = β0 + β1 x1 + · · · + βp xp (GLM) or
∗
f (x) = β0 + f1 (x1 ) + · · · + fp (xp ) (GAM)
⇒ f ∗ should be interpretable
214 / 398
Example 1 - Birth weight data
I Prediction of birth weight by means of ultrasound measures

(Schild et al. 2008)
I Outcome: birth weight (BW) in g
I Predictor variables:
I abdominal volume (volABDO)
I biparietal diameter (BPD, ”cranium width”)
I head circumference (HC)
I other predictors (measured one week before delivery)
I Data from n = 150 children with birth weight ≤ 1600g
⇒ Find f ∗ to predict BW
215 / 398
Birth weight data (2)
I Idea: Use 3D ultrasound measurements (left) in addition to conventional 2D

ultrasound measurements (right)
www.yourultrasound.com, www.fetalultrasoundutah.com
⇒ Improve established formulas for weight prediction
216 / 398
Example 2 - Breast cancer data
I Data collected by the Netherlands Cancer Institute

(van de Vijver et al. 2002)
I 295 female patients younger than 53 years
I Outcome: time to death after surgery (in years)
I Predictor variables: microarray data (4919 genes) + 9 clinical variables
(age, tumor diameter, ...)
⇒ Select a small set of marker genes (“sparse model”)
⇒ Use clinical variables and marker genes to predict survival
217 / 398
Classical modeling approaches
I Classical approach to obtain predictions from birth weight data and breast
cancer data:
Fit additive regression models (Gaussian regression, Cox regression) using
maximum likelihood (ML) estimation
I Example: Additive Gaussian model with smooth effects (represented by
P-splines) for birth weight data
⇒ f ∗ (x) = β0 + f1 (x1 ) + · · · + fp (xp )
abdominal volume biparietal diameter
200
60
100
40
f(volabdo)
f(bpd)
20
0
−100
0
−200
−20
50 100 150 200 250 5 6 7 8 9
volabdo bpd
218 / 398
Problems with ML estimation
I Predictor variables are highly correlated

⇒ Variable selection is of interest because of multicollinearity
(“Do we really need 9 highly correlated predictor variables?”)
I In case of the breast cancer data: Maximum (partial) likelihood estimates
for Cox regression do not exist (there are 4928 predictor variables but only
295 observations, p n)
⇒ Variable selection because of extreme multicollinearity
⇒ We want to have a sparse (interpretable) model including the relevant
predictor variables only
I Classical methods for variable selection (univariate, forward, backward, etc.)
are known to be unstable and/or require the model to be fitted multiple
times: post-selection inference?
219 / 398
Boosting - General properties
I Gradient boosting (boosting for short) is a fitting method to minimize

general types of risk functions w.r.t. a prediction function f
I Examples of risk functions: Squared error loss in Gaussian regression,
negative log likelihood loss, quantile/pinball loss, ...
I Boosting generally results in an additive prediction function, i.e.,
f ∗ (x) = β0 + f1 (x1 ) + · · · + fp (xp )
⇒ Prediction function is interpretable
⇒ If run until convergence, boosting can be regarded as a more generally
applicable alternative to conventional fitting methods (Fisher scoring,
backfitting) for generalized additive (mixed) models.
220 / 398
Why boosting?
In contrast to conventional fitting methods, ...

... boosting is applicable to many different risk functions (absolute loss,
quantile regression)
... boosting can be used to carry out variable selection during the fitting process
⇒ No separation of model fitting and variable selection
... boosting is applicable even if p n
... boosting addresses multicollinearity problems (by shrinking effect estimates
towards zero)
... boosting directly optimizes prediction accuracy (w.r.t. the risk function)
221 / 398
Introduction

Problem Statement
Functional Gradient Descent
Componentwise Gradient Boosting
Implementation
222 / 398
Gradient boosting - estimation problem
I Consider a one-dimensional response variable y and a p-dimensional set of

predictors x = (x1 , . . . , xp )>
I Aim: Estimation of
f ∗ := arg min EXY [ρ(y , f (x))] ,

f (·)
where ρ is a loss function that is assumed to be differentiable (almost

everywhere) with respect to a prediction function f (x)
I Examples of loss functions:
I ρ := (y − f (x))2 → squared error loss in Gaussian regression
I Negative
( log likelihood function of a statistical model
(1 − τ )(f (x) − y ) if y < f (x)
I ρ :=
τ (y − f (x)) if y ≥ f (x)
→ pinball loss for τ -quantile regression
222 / 398
Gradient boosting - estimation problem (2)
I In practice, we usually have a set of realizations

X = (x1 , . . . , xn ), y = (y1 , . . . , yn ) of x and y , respectively
⇒ Minimization of the empirical risk
n
1X
R(f ) = ρ(yi , f (xi )) → min
n f
i=1
Pn
I Example: R(f ) = n1 i=1 (yi − f (xi ))2 corresponds to minimizing the
expected squared error loss
I optimization over a function space =⇒ we’re in trouble...
223 / 398
Naive functional gradient descent (FGD)
I Idea: use gradient descent methods to minimize

R(f ) = R(f(1) , . . . , f(n) ) w.r.t. f(1) = f (x1 ), . . . , f(n) = f (xn )
=⇒ optimization over standard vector space!
[0] [0]
I Start with offset values fˆ(1) , . . . , fˆ(n)
I In iteration m:
∂R ˆ[m−1]
 [m]   [m−1]
fˆ fˆ
  
− ∂f (f(1) )
 (1) .   (1).   (1)
.. 
 . = .. +ν·  ,
 .     . 
[m] [m−1] [m−1]
fˆ(n) fˆ(n) ∂R ˆ
− ∂f(n) (f(n) )
where ν is a step length factor

⇒ Principle of steepest descent
224 / 398
Naive functional gradient descent (2)
(Very) simple example: n = 2, y1 = y2 = 0, ρ = squared error loss
1
(f(1) − 0)2 + (f(2) − 0)2

=⇒ R(f ) =
2
∂R ˆ[m−1] [m−1]
=⇒ (f ) = fˆ(i)
∂f(i) (i)
z = f12 + f22
3
10 10 14
14 12 16
16 12
2
6
1
2
f2
0
−1
−2
16 14 12 16
−3
10 10 12 14
−3 −2 −1 0 1 2 3
f1
225 / 398
Naive functional gradient descent (3)
I Increase m until the algorithm converges to some values

[m ] [m ]
fˆ stop , . . . , fˆ stop
(1) (n)
I Problem with naive gradient descent:

I No predictor variables involved
[m ] [m ]
I Structural relationships between fˆ(1) stop , . . . , fˆ(n) stop is ignored

[m] [m]
fˆ(1) → y1 , . . . , fˆ(n) → yn
I “Predictions” only for observed values y1 , . . . , yn
226 / 398
I Solution: Estimate the negative gradient in each iteration

I Estimation is performed by some base-learning procedure regressing the
negative gradient on the predictor variables
[m ] [m ]
=⇒ base-learning procedure ensures that fˆ(1) stop , . . . , fˆ(n) stop are
predictions from a statistical model depending on the predictor variables.
=⇒ fˆ is a learnable function of x
I To do this, we specify a set of regression models (base-learners) with the
negative gradient as the dependent variable
I In many applications, the set of base-learners will consist of p̃ = p simple
regression models
(e.g. one univariate (linear) model for each of the p predictor variables)
227 / 398
Componentwise Gradient Boosting (2)
Functional gradient descent (FGD) boosting algorithm:
[0]
1. Initialize the n-dimensional vector f̂ with some offset values (e.g., ȳ ).
Set m = 0 and specify the set of base-learners.
Denote the number of base-learners by p̃.
2. Increase m by 1.
∂
Compute the negative gradient − ∂f ρ(Y , f ) and
ˆ
evaluate at f [m−1]
(xi ), i = 1, . . . , n.
This yields the negative gradient vector

[m−1] ∂
u = − ρ(y , f )
∂f y =yi ,f =fˆ[m−1] (xi ) i=1,...,n
..
.
228 / 398
..
.
3. Approximate the negative gradient u[m−1] by each base-learner specified in

Step 1 by a simple LS(!) fit.
This yields p̃ vectors, where each vector is an estimate of the negative

gradient vector u[m−1] in terms of (parts of) x.
Select the base-learner that fits u[m−1] best (→ min. SSE).

Set û[m−1] equal to the fitted values from the corresponding best model.
..
.
229 / 398
..
.
[m] [m−1]
4. Update f̂ = f̂ + ν û[m−1] , where 0 < ν ≤ 1 is a real-valued step
length factor.
5. Iterate Steps 2 - 4 until m = mstop .
Hothorn, Bühlmann, et al. 2010, e.g.
230 / 398
Simple example
I In case of Gaussian regression, gradient boosting is equivalent to iteratively
re-fitting the residuals of the model.
I use a B-spline basis with 20 basis functions and ridge penalty as base-learner:
y = (0.5 − 0.9 e−50 x ) x + 0.02 ε

2
Residuals
● ●
0.10 ●
0.10 ●
m=0 ●●● m=0 ●●●

●● ●●
● ●
● ●
● ●
0.05 ●
●
● ● 0.05 ●
●
● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ●
●
●● ●● ●● ●●
● ● ● ● ● ● ●● ● ● ● ● ● ● ●●
● ● ● ●
Residuals
● ●● ●
● ● ● ● ● ● ●● ●
● ● ● ● ●
● ● ●● ● ● ● ●● ●
● ● ●● ● ● ● ●● ●
●● ●● ● ● ●● ●● ● ●
● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●
y
0.00 ●
● ● ●● 0.00 ●
● ● ●●
● ●●●● ● ● ●●●● ●
● ●● ● ●●
● ● ● ●● ● ●●
● ●
●●● ● ● ●
● ● ●●● ● ● ●
● ●
● ●● ● ●● ● ● ●● ● ●● ●
● ●● ● ● ● ●● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
●● ● ● ●● ● ●
● ●
● ● ● ● ● ●
●● ● ●● ●
● ●
−0.05 ● ● ● −0.05 ● ● ●
●● ● ● ●● ● ●
● ●
● ● ● ●
● ●
● ●
● ●
−0.10 ● −0.10 ●
−0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2
x x
y = (0.5 − 0.9 e−50 x ) x + 0.02 ε

2
Residuals
● ●
0.10 ●
0.10 ●
m=1 ●●● m=1 ●●●

●● ●●
● ●
● ●
● ●
0.05 ●
● ●
● ●
●
0.05 ●
● ●
● ●
●
● ● ● ● ● ● ● ●
● ● ●● ● ● ●●
● ● ● ●
●● ●● ●● ●●
●● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ●●
iduals
● ●● ●
● ● ● ● ● ● ●● ●
● ● ● ● ●
● ● ●● ● ● ● ●● ●
●
●● ●●
● ● ●● ●
● ●●
●
●● ●
●
●● ● ●
●
●
●● ●●
● ● ●● ●
● ●●
●
●● ●
●
●● ● ●
●
231 / 398
y
0.00 ●
● 0.00 ●
●
Properties of gradient boosting
I It is clear from Step 4 that the predictions of y1 , . . . , yn in iteration mstop

take the form of an additive function:
[mstop ] [0]
f̂ = f̂ + ν û[0] + · · · + ν û[mstop −1]
I The structure of the prediction function depends on the choice of the

base-learners
I For example, linear base-learners result in linear prediction functions
I Smooth base-learners result in additive prediction functions with
smooth components
⇒ final fˆ[mstop ] (x) has a meaningful interpretation
232 / 398
Properties of gradient boosting
I gradient boosting can optimize any loss function via a series of simple LS
steps
=⇒ huge flexibility, scales well
I local linear approximation to loss surface good enough for ν 1
I The step length factor ν could be chosen adaptively. Legend has it that
adaptive strategies do not improve the estimates of f ∗ and lead to an
increase in running time
=⇒ set ν small (ν = 0.1) but fixed.
Fixed ν also required for (unbiased) variable selection, easy tuning.
233 / 398
Introduction
Implementation
234 / 398
Gradient boosting with early stopping
I Gradient boosting has a built-in“ mechanism for base-learner selection in

”
each iteration.
⇒ This mechanism will carry out variable selection.
I Gradient boosting is applicable even if p > n.
I In case p > n, it is usually desirable to select a small number of informative
predictor variables (“sparse solution”).
I If m → ∞, the algorithm will select non-informative predictor variables.
=⇒ Overfitting can be avoided if the algorithm is stopped early, i.e., if
mstop is considered as a tuning parameter of the algorithm
234 / 398
Illustration of variable selection and early stopping
I Very simple example: 3 predictor variables x1 , x2 , x3 ,
[m]
3 linear base-learners with coefficient estimates β̂j , j = 1, 2, 3
I Assume that mstop = 5
I Assume that x1 was selected in iteration 1, 2, 5
I Assume that x3 was selected in iteration 3 & 4
[mstop ] [0]
f̂ = f̂ + νû[0] + νû[1] + νû[2] + νû[3] + νû[4]

[0] [0] [0] [1] [1]
= β̂0 + ν β̂0 + β̂1 x1 + ν β̂0 + β̂1 x1 +

[2] [2] [3] [3] [4] [4]
ν β̂0 + β̂3 x3 + ν β̂0 + β̂3 x3 + ν β̂0 + β̂1 x1

[1] [4] [2] [3]
= β̂0∗ + ν β̂1 + β̂1 x1 + ν β̂3 + β̂3 x3
= β̂0∗ + β̂1∗ x1 + β̂3∗ x3
=⇒ Linear prediction function

I x2 is not included in the model, since its base-learner was never selected
=⇒ variable selection
235 / 398
How should the stopping iteration be chosen?
I Use cross-validation techniques to determine mstop
(Hofner, Mayr, Robinzonov, et al. 2014)
⇒ The stopping iteration is chosen such that it maximizes prediction accuracy.
236 / 398
Shrinkage
I Early stopping will not only result in sparse solutions but will also lead to
shrunken effect estimates (→ only a small fraction of û is added to the
estimates in each iteration).
I Shrinkage leads to a downward bias (in absolute value) but to a smaller
variance of the effect estimates (similar to Lasso or Ridge regression).
⇒ Multicollinearity problems are addressed.
237 / 398
Variable selection: complications & improvements
I selection is biased towards more flexible base learners:

better able to fit gradient in each iteration, so get picked as winners
more often
=⇒ need to handicap flexible base learners accordingly:
e.g., adding penalty with fixed number of edf for spline base learners
(Hofner, Hothorn, et al. 2011)
I theoretical guarantees on FWER using stability selection:
(Hofner, Boccuto, et al. 2015; Meinshausen and Bühlmann 2010)
Idea: Use inclusion frequencies on resampled datasets as a function of
m to determine a set of stable, relevant baselearners.
238 / 398
Introduction
Implementation
239 / 398
mboost
Package mboost:
I baselearners: linear, (tensor product) splines, trees, radial basis
functions, random effects, ...
I wide variety of loss functions
I parallelized cross validation, stability selection
I computationally fairly efficient: sparse matrix algebra, index
compression, array arithmetic for tensor product designs (I. D. Currie
et al. 2006)
I ... but creates huge model objects ...
(Hothorn, Buehlmann, et al. 2018)
239 / 398
gamboostLSS
Package gamboostLSS :
I extensions to models with multiple additive predictors
I loss is a negative log-likelihood, additive predictors for different
distribution parameters
I e.g.: model conditional variances and means
I e.g.: bivariate Poisson for modeling soccer scores (Groll et al. 2018):
model rates (attacking strengths) and association (tactic effects).
I mboost as computational engine
(Mayr et al. 2012)
240 / 398
Summary
I very general, performant method for additive regression, classification

I yields regularized point estimates with feature selection
I uncertainty quantification and tuning require resampling
I good R implementation: mboost and its extensions
I many other variants (e.g. tree-based: gbm (Greenwell et al. 2019),
XGboost (T. Chen et al. 2019))
241 / 398
Part V
Functional Regression Models: Theory
242 / 398
Introduction
Model
Covariate Effects
Effect Representation
Estimation & Inference
Applications
Introduction
Motivation
Framework
Model
Covariate Effects
Applications
244 / 398
Functional Data
I Size and complexity of data on the rise. Increasingly, data collected

for which observations are curves.
I may be sampled on a (dense) regular grid or sparsely/irregularly.
I Examples: Spectroscopy, medical imaging, accelerometers,
longitudinal blood marker profiles, . . . .
244 / 398
Functional Data
Tissue type
Corticalis
60
Nerve
S.Gland
Spongiosa
50
Diffuse Reflectance [%]
40
30
20
10
0
350 400 450 500 550 600 650

Wavelength [nm]
Average spectra for 12 animals and four tissue types.

245 / 398
Functional Data
3500
3500
2500
2500
Total CD4 Cell Count
Total CD4 Cell Count

1500
1500
500
500
0
0
-20 -10 0 10 20 30 40 -20 -1
Months since seroconversion
Comparison of Pointwise Interval Widths Comparis

CD4 cell count trajectories in 366 HIV infected individuals.
1.5 1.6
1.5 1.6
Sampled Data
!^ CI)
!^ CI)
Full Data
246 / 398
Structured Functional Data
I More and more often, functional data exhibit additional structures

known from scalar data, e.g. longitudinal studies, spatial,
hierarchical or crossed designs.
I Examples: Longitudinal neuroimaging study on MS, precipitation
curves at spatial locations in Canada, . . . .
247 / 398
Structured Functional Data: Longitudinal
Patient B
0.8 0.7
Fractional Anisotropy
0.4 0.5 0.6
visit
0.3
1 4
2 5
3 6
0.2
0 20 40 60 80 100 120
248 / 398
Structured Functional Data: Spatial
1.0 0.5
●
log(Precipitation)
●
0.0
● ●
●
●
−0.5
●
●
● ● ● Arctic
● ●●
● Atlantic
−1.0
● ●
● ● Continental
● ● ●
●
●
● ● Pacific
●●● ●
●
● ●
●● 2 4 6 8 10 12
Months
249 / 398
Non-Gaussian Functional Data
Data with latent functional structure:

I Time-series of counts, absence/presence, states
I Examples: Sleep studies (REM/Non-REM etc), feeding behavior of
pigs, . . . .
250 / 398
Application: The Piggy Panopticon
PIGWISE Project: RFID surveillance of pig behaviour (Maselyne et al. 2014)
I measure proximity to trough (yes-no) every 10 secs over 102 days for
100 pigs
I additionally: humidity, temperature over time
⇒ models of feeding behaviour potentially useful for ethology (porcine
sociology, clustering) & quality control (disease, quality of feed stock)
pig 57
100
60
day
20
02:00 07:00 12:00 17:00 22:00
251 / 398
Functional Data Analysis
Questions of interest similar to those for scalar data:

I Regression with functional responses and/or functional covariates
I Quantification of uncertainty, structured variability
I Prediction
I (Not our focus right now: classification, clustering, description, . . . )
⇒ Functional data analysis
252 / 398
Aims and Means
I Functional data analogon of Generalized Additive Mixed Models:

Flexible, modular regression models with
I non-Gaussian functional or scalar responses
I multiple scalar and functional covariates
I linear and non-linear covariate effects and interactions
I correlated data (longitudinal/spatial/hierarchical)
I valid inference tests, CIs
I feature selection and model choice
I dense or sparse/irregular functional responses
I FPC- and spline-based techniques
I implementation in R packages: refund’s pffr(), FDboost
I Key idea:
Represent functional regression in terms of solved problems for
scalar data.
253 / 398
Key Idea
I Flexible, modular regression models for functional responses and/or

covariates
I Represent functional regression in terms of solved problems for
scalar data.
I Key idea for functional responses:
model observations within curves,
shift all functional structure into additive predictor → penalized scalar
regression
I model raw functional data, not basis representations
I use point-wise loss functions , integrated over functional domain
I recycle/adapt existing methodology and algorithms for scalar data
models
(Scheipl, Staicu, et al. 2015; Scheipl, Gertheiss, et al. 2016; Brockhaus, Scheipl, et al. 2015;
Greven and Scheipl 2017)
254 / 398
Introduction
Model
Generic Framework
Generalized Functional Additive Mixed Models
Covariate Effects
Applications
255 / 398
Generic Additive Regression Model
Observations (Yi , Xi ), i = 1, . . . , N, with
I Yi a functional (scalar) response over interval T = [a, b], [t, t]
I Xi a set of scalar and/or functional covariates.
Generic additive regression model

R
X
ξ(Yi |Xi = xi ) = f (xi ) = fr (xi ),
r =1
I ξ the modeled feature of the conditional response distribution, e.g.

expectation (with link function), median, a quantile, . . . .
I Partial effects fr (xi ) are real valued functions over T depending on
one or more covariates.
(Greven and Scheipl 2017)

255 / 398
Some transformation functions ξ and loss functions ρ
P R
Model: ξ(Yi |Xi = xi ) = f (xi ) = r =1 fr (xi )
Choose loss function ρ corresponding to transformation function ξ.

For scalars e.g.:
ξ ρ(Y , h(x))
mean regression E L2 -loss
median regression q0.5 L1 -loss
quantile regression qτ check function
generalized regression g ◦E neg. log-likelihood
GAMLSS (E, Var), e.g. neg. log-likelihood
Loss for functional responses: Integrate loss ρ(Y , h(x))(t) over T .

Goal: Minimize the expected loss, the risk, w.r.t. f :
Z
ρ(Y , f (x))dµ → min
f
Typically, dµ(t) = v (t)dt for some weight function v (t) ≥ 0.

256 / 398
Partial effects fr (x)
P
Model: ξ(Y |X = x) = f (x) = r fr (x)
covariate(s) type of effect fr (x)(t)

(none) smooth intercept α(t)
scalar covariate z linear effect zβ(t)
smooth effect γ(z, t)
functional covariate x(t) linear concurrent effect x(t)β(t)
R
functional covariate x(s) linear functional effect S
x(s)β(s, t)ds
R u(t)
historical effect `(t)
x(s)β(s, t)ds
R
smooth functional effect f (x(s), s, t)ds
grouping variable g functional random intercept Bg (t)
g and scalar z functional random slope zBg (t)
curve indicator i curve-specific smooth residual Ei (t)
Plus interactions. No dependence on t for scalar responses.

257 / 398
Covariate effects: function-on-function
I concurrent effect x(t)β(t)

R
I linear effect of functional covariate S x(s)β(s, t) ds
RS
:
R u(t)
I constrained effect of functional covariate l(t) x(s)β(s, t) ds
Rt Rt
0 : ; t−δ :
(Brockhaus, Melcher, et al. 2017; Brockhaus, Fuest, et al. 2018)
258 / 398
Generalized Functional Additive Mixed Models
Structured additive regression models of the general form
yi (t)|Xi ∼ P(µi (t), ν)

R
!
X
µi (t) = E (yi (t)|Xi ) = g fr (Xri , t)
r =1
I functional responses yi (t) over domain T ⊂ R, i = 1, . . . , n

observed on grids ti = (ti1 , . . . , tiTi ) ⊂ T
I P: exponential family, Beta, scaled t, Tweedie, . . .
I Xri a subset of covariate set X containing
I scalar covariates (metric or categorical)
I functional covariates
I grouping factors (random effects)
I known response function g (), additional nuisance parameters ν.
259 / 398
Application: Model
Model feeding rate
I for each day i = 1, . . . , 102 for a single pig
I as smooth function over time t
Response: binary feeding indicators ỹi (t) summed over 10min intervals
Z t
y (t) ∝ ỹi (s)ds
t−10min
y (t)|Xi ∼ Bin(n = 60, p = µi (t))

R
!
X
−1
µi (t) = logit fr (Xri , t)
r =1
fr (Xri , t) could be:
I baseline rate (functional intercept)
I effect of humidity & temperature in pig sty
I aging effect over days
I day-specific effect (functional random effect)
I auto-regressive effect of earlier feeding behaviour 260 / 398
Introduction
Model
Covariate Effects
Motivation
Recap: Penalized Splines
Applications
261 / 398
Covariate Effects: Examples
Xr fr (Xr , t)
∅ functional intercept β0 (t)
R u(t)
humidity hum(t) linear functional effect l(t) hum(s)β(s, t)ds;
R u(t)
smooth functional effect l(t) f (hum(s), s, t)ds;
concurrent effects f (hum(t), t) or hum(t)βh (t)
R t−1
yi (t) (auto-regressive) functional effects t−δ y (s)β(s, t)ds;
lagged effects f (yi (t − δ), t) or yi (t − δ)β(t)
hum(t), temp(t) concurrent interaction effects
e.g., f (hum(t), temp(t), t) or temp(t)β(hum(t))
scalar covariate i aging effect iβ(t) or f (i, t)
(day indicator) functional random intercepts bi (t)
261 / 398
Spline regression
Represent nonlinear effects as weighted sums of basis functions:
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Unpenalized Splines
K=6 knots K=12 knots K=24 knots

1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
K=6 knots K=12 knots K=24 knots

1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
−1.0 −0.5
−1.0 −0.5
−1.0 −0.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Penalized Splines
Use very flexible basis and add penalization for excessively wiggly fits:
=⇒ trade-off between goodness-of-fit and simplicity/generalizability
λ = 1e−04 λ=1
● ●
● ●
●● ● ●● ●
1.0
1.0
● ● ●●● ● ● ● ●●● ●
●● ● ● ●●● ● ●● ● ● ●●● ●
●● ● ● ●●
● ●● ●● ● ● ●●
● ●●
● ●
● ● ● ●
0.5
0.5
● ●●●● ● ●●●●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ●●
●●● ●●●
● ●
y
y
0.0
0.0
● ● ● ● ● ●
● ●● ●
●●● ● ●● ●
●●●
●●
●●
● ●● ●●
●●
● ●●
●● ● ●● ●● ● ●●
● ●
●● ●
●
● ● ●
●● ●● ●
●
● ● ●
●●
●● ●● ●● ● ● ●● ●● ●● ● ●
● ●● ●● ●
●● ●● ● ●● ●● ●
●● ●●
● ●● ● ●●
−1.0
−1.0
● ● ● ●● ●
●● ● ● ● ●● ●
●●
●● ●● ●● ●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
λ = 1000 λ = λGCV
● ●
● ●
●● ● ●● ●
1.0
1.0
● ● ●●● ● ● ● ●●● ●
●● ● ● ●●● ● ●● ● ● ●●● ●
●● ● ● ●●
● ●● ●● ● ● ●●
● ●●
● ●
● ● ● ●
0.5
0.5
● ●●●● ● ●●●●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ●●
●●● ●●●
● ●
y
y
0.0
● ● ● 0.0 ● ● ●
● ●● ●
●●● ● ●● ●
●●●
●●
●●
● ●● ●●
●●
● ●●
●● ● ●● ●● ● ●●
● ●
●● ●
●
● ● ●
●● ●● ●
●
● ● ●
●●
●● ●● ●● ● ● ●● ●● ●● ● ●
●● ●●
● ● ●
● ●● ●●
● ● ●
●
● ● ● ●
● ●● ● ●●
−1.0
−1.0
● ● ● ●● ●
●● ● ● ● ●● ●
●●
●● ●● ●● ●●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Tensor Product Splines
Source:http://www.web-spline.de/web-method/b-splines/tpspline.png
Introduction
Model
Covariate Effects
Tensor Product Representation
Penalization
Spline Bases & Penalties
FPC Bases & Penalties
Applications
266 / 398
I model for nT -vector y = (y1 (t11 ), . . . , yn (tnT ))> (t i ≡ t here).

I rows of Xr contain Xri , i = 1, . . . , n
I Bxr and Btr contain evaluations of marginal bases over Xr and t,
respectively.
Linearize effect estimation via tensor product basis of basis functions in
Xr and t:
fr (Xr )(t) ≈ Bxr ⊗ Btr θr = Br θr
nT ×1 n×Kx T ×Kt Kx Kt ×1
266 / 398
Slight complication for irregular data:

I t = (t 1 , . . . , t n )> with t i 6= t i 0
I subvectors yi of y with variable lengths Ti
I rows of Xr contain Xri , each repeated Ti times
=⇒
!
fr (Xr )(t) ≈ ( Bxr ⊗ 1>
Kt )· ( 1>
Kx ⊗ Btr ) θr = B r θr
nT ×1 nT ×Kx 1×Kt 1×Kx nT ×Kt Kx Kt ×1
I also needed if Xri depends on t i (e.g. concurrent functional

covariates)
267 / 398
Tensor Product Representation and Penalization
Define a regularization term via the Kronecker sum of marginal penalty

matrices
pen(θr |λtr , λxr ) = θrT (λxr Pxr ⊗ IKt + λtr IKx ⊗ Ptr )θr
= θrT Pr (λtr , λxr )θr .
=⇒ very flexible:
I Combine any basis & penalty for Xr with any basis & penalty for t!
→ Huge variety available in pffr() via interface to mgcv.
I Penalization parameters λtr , λxr separately control the relative
complexity of effects over the functional domain and the covariate
space, respectively.
268 / 398
Tensor Product Basis & Kronecker Sum Penalties
P ⊗ I, I ⊗ P: repeated penalties that apply to each subvector of θ
associated with a specific marginal basis function (Wood 2006a):
⊥
1 ·z1 ⊥1 ·F1 ∇1 ·z1 ∇1 ·F1 n1 ·z1 n1 ·F1

⊥ ·z ⊥1 ·F2 ∇1 ·z2 ∇1 ·F2 n1 ·z2 n1 ·F2
 ⊥11 ·z32 ⊥1 ·F3 ∇1 ·z3 ∇1 ·F3 n1 ·z3 n1 ·F3 
 ⊥2 ·z1 ⊥2 ·F1 ∇2 ·z1 ∇2 ·F1 n2 ·z1 n2 ·F1

 
 ⊥2 ·z2 ⊥2 ·F2 ∇2 ·z2 ∇2 ·F2 n2 ·z2 n2 ·F2
!
⊥1 ∇1 n1
z1 F1

⊥2 ∇2  ⊥2 ·z3 ⊥2 ·F3 ∇2 ·z3 ∇2 ·F3 n2 ·z3 n2 ·F3
Bxr ⊗ Btr = n2
⊗ =
z2 F2

⊥3 ∇3 n3  ⊥3 ·z1 ⊥3 ·F1 ∇3 ·z1 ∇3 ·F1 n3 ·z1 n3 ·F1 
z3 F3
⊥4 ∇4 n4
 ⊥ ·z ⊥3 ·F2 ∇3 ·z2 ∇3 ·F2 n3 ·z2 n3 ·F2

 3 2 
 ⊥3 ·z3 ⊥3 ·F3 ∇3 ·z3 ∇3 ·F3 n3 ·z3 n3 ·F3 
 ⊥4 ·z1 ⊥4 ·F1 ∇4 ·z1 ∇4 ·F1 n4 ·z1 n4 ·F1 
⊥4 ·z2 ⊥4 ·F2 ∇4 ·z2 ∇4 ·F2 n4 ·z2 n4 ·F2
⊥ ·z ⊥4 ·F3 ∇4 ·z3 ∇4 ·F3 n4 ·z3 n4 ·F3
 P4 3 PFz

z
1 PFz PF
Pz PFz  Pz PFz 
IKx ⊗ Pt = 1 ⊗ PFz PF = PFz PF

1  
Pz PFz
PFz PF
P⊥ P∇⊥ Pn⊥
 
P⊥ P∇⊥ Pn⊥
P⊥ P∇⊥ Pn⊥

 P∇⊥ P∇ P∇n 
Px ⊗ IKt = P∇⊥ P∇ P∇n ⊗ (1 1) = P∇⊥ P∇ P∇n

Pn⊥ Pn∇ Pn  
Pn⊥ Pn∇ Pn
Pn⊥ Pn∇ Pn
Partial Effects fr (x)(t) as Latent GPs
Equivalent representation as (low-rank) Gaussian process prior:
fr (Xr )(t) = Br θr ,
θr ∼ N 0, (Pr (λtr , λxr ))−

Latent Low-Rank Gauss Processes

fr (Xr )(t)|λtr , λxr ∼ GP 0, Br (Pr (λtr , λxr ))− B>
r
270 / 398
Choice of Bases and Penalties
Any suitable
I marginal basis Btr (e.g. B-splines)
I penalty Ptr (e.g. pth order difference matrix).
over support T .
Effects constant in t: Btr = 1nT and Ptr = 0.
271 / 398
Marginal Bases for Scalar & Concurrent Effects
Xr effect Bxr Pxr
∅ intercept β0 (t) 1nT 0
z linear: zβz (t) (z1 , . . . , zn )> ⊗ 1T 0

nonlinear: f (z, t) (spline basis over z) ⊗ 1T associated penalty
x(t) linear: x(t)β(t) vec (x) 0

n×T
nonlinear: f (x(t), t) spline basis over vec (x) associated penalty

n×T
=⇒ Effects varying in t similar to varying coefficient terms in scalar

response models.
272 / 398
Marginal Basis for Functional Covariates
Z H
X
xi (s)β(s, t)ds ≈ wh xi (sh )β(sh , t)
S h=1
H Kx X
Kt
(s) (t)
X X
≈ wh xi (sh ) Bks (sh )Bkt (t)θr ,ks kt
h=1 ks =1 kt =1
effect Bxr Pxr

R
x(s)β(s, t)ds ((x ⊗ 1T ) · W)Bs penalty associated with Bs
Quadrature weights W = (w1 , . . . , wh )T ⊗ 1nT ;

(s)
x = [xi (sh )]i;h ; Bs = [BRks (sh )]h;ks .
ui (t)
I Modify W to get
li (t) xi (s)β(s, t)ds:
⇒ historical model, lag-/lead-effects, auto-regressive terms.
I FPC-based effects of functional covariates available in pffr as well
(later).
(Ivanescu et al. 2015)
273 / 398
Marginal Basis for Functional Covariates
Fairly similar for nonlinear effects of functional covariates:

Z H
X
F (xi (s), s, t)ds ≈ wh F (xi (sh ), sh , t)
S h=1
H Kx X
Kt
(s) (t)
X X
≈ wh Bkx (xi (sh ), sh )Bkt (t)θr ,ks kt
h=1 kx =1 kt =1
(s)
where Bkx (x(sh ), sh ) are radial basis functions or elements of another
tensor product basis.
(McLean et al. 2014, for scalar response)
274 / 398
Functional Random Effects
Xr effect Bxr Pxr
g random intercepts bgi (t) ∆ ⊗ 1T Pb
{g , z} random slopes zbgi (t) (diag(z)∆) ⊗ 1T Pb
i functional residuals ei (t) In ⊗ 1T In
I ∆ = [δgi m ]i;m : incidence matrix for levels m = 1, . . . , M = Kxr of g .

I Pb : precision matrix of any Gaussian (Markov) random field
modeling the dependency structure between levels of g
(e.g.: IM if bg (t) are i. i. d.).
I implies Gaussian process prior for bg (t) with low-rank covariance
that is smooth in t and controlled by P−1
b between units.
275 / 398
Issues:
I often fairly important model component: errors typically
autocorrelated and not homoscedastic along T
=⇒ need to capture somehow with ei (t) for valid conditional
inference
I typically require fairly large basis Btr to capture high frequency /
local behavior
=⇒ estimation scales very badly for g with many level
I not (really) locally adaptive
I Pb must be fixed a priori
=⇒ no estimation of inter-level dependency structure, only of
relative variability between levels of g .
Partial solution: using a better basis → FPCs
276 / 398
Functional Random Effects: FPC representation
For (uncorrelated) functional random intercepts, can use Karhunen-Loève

expansion
Kt
X
bgi (t) ≈ ξgk φk (t),
k=1
I κk , φk (t): eigenvalues and -functions of covariance of bg (t)

I ξgk ∼ N (0, κk ): associated FPC loadings
I truncation lag Kt
effect Btr Ptr

functional random intercept bgi (t) 1n ⊗ (φ̂k (tl ))l;k diag(κ̂1 , . . . , κ̂Kt )
277 / 398
How to get φ̂k etc?

I Simple case: functional random intercepts bg (t), Gaussian model
i
I estimate model without bg (t) under working independence assumption
I compute grouped mean residual curves
I do FPCA based on covariance of grouped mean residual curves
I refit model with estimated FPC basis
I one iteration usually sufficient
I Cederbaum, Pouplier, et al. 2016: extensions for FPCA of
hierarchical, crossed, sparse/irregular Gaussian data
278 / 398
Advantages over spline-based functional random effects:

I by definition, L2 -optimal low-rank representation for any given Kt
=⇒ can get away with much fewer basis functions (usually....)
I basis is learnt from data
=⇒ automatically locally adaptive
But: Choice of Kt ? Leave penalty fixed?
279 / 398
Functional Covariates: FPC representation
For xi (s) ≈ K
P x
k=1 ψk (s)ξik ,
Z X Z X
xi (s)β(s, t)ds = ξik ψk (s)β(s, t)ds = ξik β̃k (t)
S k S k
⇒ sum of varying coefficient terms for FPC scores

Bxr Pxr
linear effect [ξîk ] ⊗ 1T 0

n×Kx
I Extends to nonlinear FPC effects

fr (xi (s), t) = k fk (ξik , t), fr (xi (s), t) = K
PKx
ˆ ˆ ˆ
P x
k;k 0 >k fk,k 0 (ξik , ξik 0 , t)
(sum of) nonlinear effects of synthetic covariates ξˆk .
(H. Müller and Yao 2008)
280 / 398
Functional Covariates: FPC representation
+ always identifiable (more later)

+ β̃k (t) easier to interpret than β(s, t), at least for interpretable ψ̂k (s)
+ applicable to sparse/irregular functional covariates
− inference becomes conditional on estimated FPCs
− suitable truncation lag Kx ?
281 / 398
Introduction
Model
Covariate Effects

GFAMMs as GAMs
Applications
282 / 398
Penalized Estimation
We can then write the model for y = (y1 (t11 ), . . . , yn (tnTn )) as a

generalized linear model in the basis functions:
E(y|X ) = g (Bθ)
for a suitably concatenated design matrix B = [B1 | . . . |BR ] and

corresponding coefficient vector θ.
X
−2 log (L(θ|y, B, ν)) + λv θ > P̃v θ → min (1)
v
with B = [B1 | . . . |BR ], θ = [θ1> , . . . , θR> ]> and P̃v the marginal penalties
suitably padded with zeros.
282 / 398
Penalized Likelihood & Mixed Effect Representation
I Criterion (1) equivalent to determining MLE in generalized linear

mixed model
!− !
X
y ∼ P (g (Bθ) , ν) , θ ∼ N 0, λv P̃v
v
(Ruppert et al. 2003)

I Reparameterize to separate θ into fixed (unpenalized) and random
(penalized) coefficients with proper priors (finite variance)
(Wood 2006a; Wood, Scheipl, and Faraway 2012).
I Smoothing parameters via (approximate) REML: λv estimated as
variance components using restricted maximum likelihood or
Laplace-approximate marginal likelihood
(Wood 2011; Wood, Pya, et al. 2016).
283 / 398
This is also just a kind of varying coefficient model...
... effects vary over index of functional response:

I write as model for concatenated function values vec Y
n×T
I reformat covariate data accordingly & add index t as covariate

⇒ can be fitted like standard GAMM for scalar responses
I effects vary smoothly over t ⇒ smoothness of E (Y (t))
I sparse/irregular Y (t) possible
284 / 398
Inference is mostly a solved problem:
I use penalized splines for smooth effects, including linear functional

effects and functional random effects
I standard penalized likelihood inference: GCV or (RE)ML via
mixed model representation
I approximate CIs, tests, diagnostics, etc. immediately available
I refund’s pffr() as wrapper for mgcv and gamm4:
I optimized, robust, well-tested algorithms
I versatile library of spline bases ready to use:
cyclic, monotonicity, adaptive, P-splines, thin plate, etc.
I model effects with any given correlation structure via GMRFs
285 / 398
Advantages of Mixed Model Framework
Profit from decades of methodological development:

I modularity
I flexibility
I unified framework for smoothing parameter and random effect
estimation:
Smoothing parameters as variance components via (approximate)
REML (Wood 2011)
I approximate confidence bands

(Ruppert et al. 2003; Wood 2006a; Marra and Wood 2011)
I tests e.g. for constant or zero effects (Wood 2012)
I model selection (Saefken et al. 2014)
286 / 398
Introduction
Model
Covariate Effects
Applications
Application 1: Piggy Panopticon
Application 2: Nutrition & ICU survival
287 / 398
Application: The Piggy Panopticon
PIGWISE Project: RFID surveillance of pig behaviour (Maselyne et al. 2014)
I measure proximity to trough (yes-no) every 10 secs over 102 days for
100 pigs
I additionally: humidity, temperature over time
⇒ models of feeding behaviour potentially useful for ethology (porcine
sociology, clustering) & quality control (disease, quality of feed stock)
pig 57
100
60
day
20
02:00 07:00 12:00 17:00 22:00
287 / 398
Example: Model
Auto-regressive model with random day effects for pig 57:

Z t−10min
logit (µi (t)) = β0 (t) + bi (t) + yi (s)β(t, s)ds
t−3h
I periodic P-spline for β0 (t)

i.i.d.
I random day effect bi (t) ∼ GP(0, K (t, t 0 )) for day i.
I fit takes about 1min (5 smoothing parameters, nT = 14544, p = 850)
288 / 398
Example: Model Fit
µ̂(t) & y (t) for selected days (training data)
1 6 10 16 21
1.00
0.75
0.50
0.25
27 31 37 42 48
1.00
0.75
0.50
0.25
type
53 59 63 69 74 fitted
1.00 observed
0.75
0.50
0.25
80 84 90 95 101
1.00
0.75
0.50
0.25
0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t
289 / 398
Example: Model Predictions
µ̂(t) & y (t) for selected days (test data)
2 5 11 17 20
1.00
0.75
0.50
0.25
26 32 38 41 47
1.00
0.75
0.50
0.25
type
52 58 61 67 73 fitted
1.00 true
0.75
0.50
0.25
79 82 88 94 100
1.00
0.75
0.50
0.25
0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t
290 / 398
Example: Estimates
β̂0 (t)
3
-3
-6
0 6 12 18 24
t
b̂i (t)
10
-10
-20
-30
0 6 12 18 24
t
291 / 398
Example: Estimates
β̂(t, s); max. lag= 3 h
estimate upper CI lower CI
0
-1 value
0.1
0.0
-0.1
s-t
-0.2
-0.3
-0.4
-2 -0.5
-3
0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t
292 / 398
Example: Alternative Model
I many alternative model specifications feasible/sensible

I slightly better in terms of predictive accuracy on test set (Brier
Score):
Z t−10min
logit (µi (t)) = β0 (t) + f (i, t) + yi (s)β(t, s)ds
t−3h
with smooth effect surface f (i, t) to capture aging effect.
293 / 398
Example: Alternative Model
β0(t) f (i, t)
^ ^
−1 estimate upper CI lower CI
100
−2
75 value
2.5
−3
day i
0.0
50
−2.5
−4 −5.0
25
−5
0
0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t t
β(t, s) ; max. lag= 3h

^
estimate upper CI lower CI

0
value
−1 0.2
0.1
s−t
0.0
−0.1
−2 −0.2
−3
0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t
294 / 398
Exposure-Lag-Response Association (ELRA):
Nutrition & ICU Survival
I multi-center study of critical care patients from 457 ICUs
(≈ 10k patients)
I investigate acute mortality (first 30d)
I confounders z: age, gender, Apache II Score, year of admission, ICU
random effect, ...
I 12-day nutrition record xi (s)
I prescribed calories (determined at baseline)
I daily caloric intake
I daily caloric adequacy (CA)= caloric intake/prescribed calories
6709 38318 50609
125 ●
●
100 ●
● ● ● ●
●
75 ●
●
● ● ●
● ●
50 ● ●
●
25 ●
0 ● ● ● ● ● ● ● ●
caloric adequacy (%)
97114 99908 99912

125 ●
●
● ● ● ● ● ●
100 ●
● ● ● ● ●
●
●
●
75 ●
50 ●
25 ● ●
●
0 ● ● ●
105803 109206 121915

125 ●
●
● ● ●
100 ● ● ● ● ●
●
● ● ● ● ● ● ●
● ● ●
● ●
● ● ●
75 ●
●
●
50 ●
25
0 ● ●
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Protocol day te
(Bender, Scheipl, et al. 2018; Bender, Groll, et al. 295

2018)
/ 398
Clinical Questions & Parameterization
I Importance of “adequate” nutrition after trauma:
catabolic/anabolic states? metabolic stress?
I How much is “adequate”, “necessary”?
I Importance of timing & amount of nutrition during critical, acute &
recovery phases unclear
Importance of timing & amount of nutrition during critical, acute &
recovery phases unclear
Model Assumptions:
delayed, time-limited, cumulative & time-varying effect of time-varying
exposure xi (s)
Idea:
(partial) effect of xi (s) on log-hazard at time t:
Z
xi (s)β(t, s)ds
W (t)
I delay & time limit defined by W (t) 296 / 398

Piece-wise Exponential Model
I define cut-points on the time line: 0 = κ0 < . . . < κJ = tmax

assume constant hazard rates in each interval [κj−1 , κj )
I
(
1, event in [κj−1 , κj )
=⇒ likelihood for event indicators yij = is
0, else
proportional to Poisson likelihood (Friedman, 1982)
I for t ∈ [κj−1 , κj ), piecewise constant hazard rate for event process
λ(t; zi , xi ) is rate of pseudo-Poisson responses yij !
=⇒ ≈ a model for pseudo-Poisson functional responses yi (tj )
297 / 398
Piece-wise Exponential Model
Time-to-event model reparameterized as Poisson-GAMM:

X Z
log (λ(t|zi , xi )) = α0 (t) + hj (zi , t) + xi (s)β(t, s)ds
j W (t)
I estimated with mgcv (could use FDboost as well)

I flexible, semi-parametric modeling of baseline hazard rate α0 (t)
I use GFAMM framework for functional covariates to perform inference,
e.g. about nutrition effects
I flexible confounder adjustment: non-linear & time-varying effects;
time-varying covariates, random effects / frailties, . . .
298 / 398
ELRA: Example
Compare hazard of patient with given nutrition record to constant
undernutrition (e.g., ceteris paribus):
100 1.25
75
0.75
●
caloric adequacy [%]
hazard ratio estimate

●
● ●
50 ●
● 0.5
●
●
25 ●
0.25
0 ●
1 2 3 4 5 6 7 8 9 10 11 10 20 30
protocol day t
I’m leaving out a LOT of complications here...
299 / 398
Part VI
Functional Regression Models:

Implementation
300 / 398
GAMM Algorithm
refund
FDboost
GAMM Algorithm
GAMMS as GLMs as LMs
Estimation Algorithm
refund
FDboost
302 / 398
Introduction
I Generalized functional additive mixed models (GFAMM)

mathematically equivalent to varying coefficient models for scalar data
=⇒ can reuse, adapt, recycle scalar data methodology
I SOTA R implementation of GAMs: mgcv (N 2019)
I GFAMM implemented in package refund (Goldsmith, Scheipl, et al.
2018): wrapper around mgcv
302 / 398
MLE for GAMMs
yi |xi ∼ EF(µi , ν); g (µi ) = Bi θ
I B contains evaluated basis functions, θ the combined coefficient

vector
combined penalty v λv θv> Pv θ with smoothing parameters λv ,
P
I
MLE: θ̂ = arg maxθ `P (θ) = arg maxθ `(θ) − v λv θv> Pv θ

P
I
I equivalent to this penalized likelihood: priors θv ∼ N (0, λv P−

v)
I ... this also covers most random effects ...
303 / 398
IWLS
I GLMs estimated by Fisher-Scoring:

−1
∂ 2 `P (θ)

(k+1) (k) ∂`P (θ) (k)
θ̂ = θ̂ −E (θ̂ )
∂θ∂θ > ∂θ
I reduces to iteratively solving a penalized working linear model

(IWLS):
ỹ(k) = Bθ + , Cov() = (W(k) )−1 ν, E() = 0
for diagonal W, with

(k) (k) (k) (k)
I working observations ỹi = g 0 (µ̂i )(yi − µ̂i ) + η̂i
(k) 0 (k) 2
I (W(k) )−1
ii = V (µ̂i )g (µ̂i ) , ηi = g (µi )
I V determined by likelihood.
304 / 398
mgcv Algorithm
I estimation of θ given λ fairly trivial

I much harder: optimizing λ
I in mgcv: based on (restricted) marginal approximate likelihood
˜ kỹ − Bθ̂λ k2 + θ̂λ> Pλ θ̂λ

`(λ) ∝ + log B> WB + Pλ − log |Pλ |
ν
I inner loop:
run IWLS to convergence for θ̂λ for each evaluation of `(λ)
I outer loop:
optimize `(λ)
305 / 398
mgcv Algorithm: Outer Loop
kỹ − Bθ̂λ k2 + θ̂λ> Pλ θ̂λ
`(λ) ∝ + log B> WB + Pλ − log |Pλ |
ν
Ioptimization/evaluation ugly: log-determinants numerically unstable

(log of zero!), expensive
=⇒ optimize `(λ) without evaluating it
=⇒ optimize `(λ) over ρ = log(λ) via Newton steps: only need 1st, 2nd
derivatives
I step length control via newton step-length only, no re-evaluation
I avoids log-determinant problem, e.g.:
!−1
∂ log B> WB + Pλ X
= tr B> WB + λ r Pr Pj λ j
∂ρj r
I computation via pivoted, blockwise Cholesky: parallelizable in

OpenMP.
306 / 398
mgcv Algorithm: Discretisation
I computing B> WB is O(np 2 ), for B

n×p
I much more efficiently computable if B only has m n distinct rows

(Lang et al. 2014)
I always given for Bj basis for a given covariate: much fewer distinct
covariate values than observations, if not, binning.
I write entries Bj,il = B̃j,kj (i)l , where kj (i) maps observations i to
distinct covariate value, so B̃j
m×p
=⇒ much cheaper crossproducts:

> X
B>
j w = B̃j w̃ with w̃l = wi
kj (i)=l
=⇒ O(n + mj pj ) instead of O(npj )
307 / 398
mgcv Algorithm: Discretisation
I extends to crossproducts and quadratic forms of compressed design

matrices
I extends to multiple smooth terms, tensor product smooths (Wood, Z. Li,
et al. 2017)
I computations for large p orders of magnitude faster (Z. Li and Wood 2019)
I in mgcv::bam(): option discrete
308 / 398
mgcv Alternative Inference Algorithms
mgcv implements a variety of alternatives to LAML maximization:

I alternative REML-type model fitter better suited for large, sparse
design matrices: gamm4::gamm4() (Wood and Scheipl 2017)
I penalized likelihood inference via extended Fellner-Schall iteration:
(Wood and Fasiolo 2017)
faster, simpler, potentially less accurate.
I approximate fully Bayesian inference via integrated nested Laplace
approximation (mgcv::ginla())
I fully Bayesian inference via JAGS (mgcv::jagam())
309 / 398
GAMM Algorithm
refund
Scalar responses: pfr
Functional responses: pffr
FDboost
310 / 398
refund
I fairly large collaborative software project, mainly Johns Hopkins,

Columbia University, LMU
I definitely mostly research quality software.... caveat emptor
I FPCA: fpca.sc, fpca.face, fpca.ssvd, fpca.2s
I functional regression:
I pffr(): penalized function-on-function regression for functional
responses and scalar and/or functional covariates.
I pfr(): penalized functional regression for scalar responses and scalar
and/or functional covariates.
310 / 398
refund::pfr
I fairly lightweight wrapper around mgvc’s model fitting functions

I defines some additional term types for functional covariates.
I formula-based model definition
R adapted from mgvc:
e.g. E(y |x) = β0 + x1 (s)β1 (s)ds + f (x2 )
becomes y ~ 1 + lf(x1) + s(x2)
effect syntax
R
linear functional effectR x(s)β(s)ds lf(x)
smooth functional effect RF (x(s), s)ds af(x)
FPC based x(s)β(s)ds fpc(x)
Additional arguments control basis representation of β(s), FPC

parameters, etc.
311 / 398
refund::pffr
I wrapper for GFAMMs around mgvc’s model fitting functions
I defines additional term types for functional covariates and formula
specials for functional responses
I formula-based model definitionRadapted from mgvc:
e.g. E(y (t)|x) = β0 (t) + x1 (s)β1 (s, t)ds + x2 β2 (t) + f (x3 )
becomes y ~ 1 + ff(x1) + x2 + c(s(x3))
Iby default, all effects vary over t
→ tensor product representation of effects
=⇒ all effects available for scalar responses, covariates in mgcv usable for
functional responses (... almost)
I constant effects wrapped in c()
I specification of basis over t in arguments bs.yindex, bs.int
I irregular responses possible

I modified identifiability constraints for better interpretability of
functional effects:
ˆj (Xi , t) = 0 ∀ t instead of mgcv default ˆ
P P
i f i,t fj (Xi , t) = 0
I location-scale models via family = "gaulss" for heteroskedastic
data (more later)
312 / 398
pffr terms syntax
effect syntax
R
linear functional effectR x(s)β(s, t)ds ff(x)
smooth functional effectR F (x(s), s, t)ds sff(x)
FPC based x(s, t)β(s, t)ds
as k ξˆk β̃k (t)
P
ffpc(x)
FPC based random-effects bgi (t) pcre(g, ...)
313 / 398
GAMM Algorithm
refund

mboost
FDboost
314 / 398
Introduction
I Generalized functional additive mixed models (GFAMM)

mathematically equivalent to varying coefficient models for scalar data
=⇒ can reuse, adapt, recycle scalar data methodology
I SOTA R implementation for componentwise gradient boosting:
mboost (Hothorn, Buehlmann, et al. 2018)
I Boosted GFAMM implemented in package FDboost (Brockhaus and
Ruegamer 2018): wrapper around mboost and gamboostLSS
(Hofner, Mayr, Fenske, et al. 2018)
314 / 398
Component-wise gradient boosting
I Boosting is an ensemble method that aims at minimizing the
expectation of a loss criterion.
I The predictor is iteratively updated along the steepest gradient with
respect to the components of an additive predictor (functional
gradient descent).
I Model represented as a sum of simple (penalized) regression models,
the base-learners, fitted to the negative gradients by OLS in each
step.
I In each boosting iteration only the best fitting base-learner is updated
(component-wise) with step-length ν.
I For functional response regression, response and predictors are
functions over T .
(Hothorn, Bühlmann, et al. 2010; Brockhaus and Ruegamer 2018)
315 / 398
Some transformation functions ξ and loss functions ρ
PR
Model: ξ(yi |Xi = xi ) = f (Xi ) = r =1 fr (xri )
Choose loss function ρ corresponding to transformation function ξ.

For scalars e.g.:
ξ ρ(Y , h(x))
mean regression E L2 -loss
median regression q0.5 L1 -loss
quantile regression qτ check function
generalized regression g ◦E neg. log-likelihood
GAMLSS vector of Q par. neg. log-likelihood
Loss for functional responses: Integrate loss ρ(y , f (x))(t) over T .

Goal: Minimize the expected loss, the risk, w.r.t. f .
316 / 398
Algorithm: component-wise gradient boosting
I [Step 1:] initialize all parameters, set m = 0

I [Step 2:] (within each iteration m)
I compute the negative partial gradients ui for i = 1, . . . , N of the
empirical risk w.r.t. the predictor f (Xr , t) using the current
estimates of all distribution parameters
I fit each base-learner fr to ui , r = 1, . . . , R
I select the best fitting base-learner fr ∗ and update it with a small
[m]
f [m] = b
step-length ν, b f [m−1] + ν b
fr ? .
I [Step 3:] unless m > mstop , set m = m + 1, go to Step 2.
→ The final fˆ is a linear combination of base-learner fits.
317 / 398
Algorithm: functional GAMLSS boosting
I [Step 1:] initialize all parameters, set m = 0
I [Step 2:] (within each iteration m)
I for q = 1, ..., Q
(q)
I compute the negative partial gradients ui for i = 1, . . . , N of the
empirical risk w.r.t. the predictor f (q) using the current estimates of
all distribution parameters
(q) (q)
I fit each base-learner fr to ui , r = 1, . . . , R
(q)
I select the best fitting base-learner fr ∗
(q ∗ )
I select parameter q ∗ with the best fitting base-learner fr ∗ and
update its coefficients with a small step-length ν
I [Step 3:] unless m > mstop set m = m + 1, go to Step 2.
→ Each final fˆ(q) is a linear combination of base-learner fits.
(Mayr et al., 2012; Brockhaus et al., 2017; non-cyclical: Thomas et al, 2017)
318 / 398
Tuning parameters of gradient boosting
I The number of boosting iterations determines the model complexity

for fixed step length and smoothing parameters (chosen for unbiased
selection of base learners; Hofner et al., 2012).
I Choose prediction-optimal stopping iteration mstop by resampling
methods (on the level of curves).
I Model selection by early stopping and stability selection (Shah &
Samworth, 2013).
319 / 398
Summary
Idea:
I iteratively boost the model performance
(=ˆ reduce expected loss)
I by fitting and evaluating the partial effects (base learners) fr (Xr , t)
component-wise
I using one partial effect at a time to update the model
=⇒ Results in component-wise gradient descent steps.
To account for within-function dependency:

I function specific smooth error functions
I function-wise cross-validation
320 / 398
Comparison
Major differences to penalized likelihood estimation:

− no uncertainty quantification (confidence intervals, ...), or at least not
that easy
+ very flexible in terms of fitting criteria
+ much faster for many complex covariate effects
+ inherent variable selection
321 / 398
mboost
I very efficient stepwise divide-and-conquer algorithm: any loss

function optimized via simple, small LS updates.
I uses compressed design matrices for baselearners (cf. mgcv)
I exploits sparsity of base learner design matrices, where possible
I for tensor product baselearners, uses very efficient array arithmetic
(I. D. Currie et al. 2006)
322 / 398
Generalized Linear Array Models
ND
I Tensor product design matrices B = d=1 Bd become huge very
quickly
I .. but they have lots of repeating structure, by construction
I Idea: use structure for more efficient computations of Bθ,
B> diag(w)B
I never actually compute B, only marginal Bd , d = 1, . . . , D
I perform matrix operations by clever re-dimensioning and successive
operations on marginal Bd
I e.g. D = 2 : (B1 ⊗ B2 )θ = B2 ((B1 Θ)> )> with Θ = [θj k]j,k
I large gains for high D, parameter count
I at least 1 order of magnitude fewer ops for Bθ, 2-3 orders for
B> diag(w)B
323 / 398
GAMM Algorithm
refund
FDboost
Example: Emotion components data
324 / 398
Implementation in FDboost
I Main fitting function:
FDboost(formula, timeformula, data, ...)
I timeformula
= NULL for scalar-on-function regression,
= ∼ bbs(t) for function-on-function regression
I Some of the base-learners for functional data:

zβ(t) bolsc(z) %O% bbs(t)
f (z, t) bbsc(z) %O% bbs(t)
z1 z2 β(t) bols(z1) %Xc% bols(z2) %O% bbs(t)
R
S
x(s)β(s, t)ds bsignal(x, s = s) %O% bbs(t)
x(t)β(t) bconcurrent(x, s = s, time = t)
R u(t)
l(t)
x(s)β(s, t)ds bhist(x, s = s, time = t, limits = ...)
324 / 398
Data set from Gentsch et al. 2014, also used in Rügamer et al. 2018
I Main goal: Understand how emotions evolve
I Participants played a gambling game with real money outcome
I Emotions “measured” via EMG (muscle activity in the face)
I Influencing factor appraisals measured via EEG (brain activity)
I Different game situation, a lot of trials
325 / 398
1 25
4 2 26
EEG
0 3 27
value [microvolt]
4 28
−4 5 29
−8 6 30
8 32
50
15 33
EMG
25 19 34
22 35
0
23 36
−25 24
0 100 200 300 400
time
(Rügamer et al. 2018)

Goal: Try to explain
I facial expressions (measured with EMG)
I by brain activity (measured with EEG)
→ Function-on-function-regression
326 / 398
Model equation:
yEMG (t) = β0 (t) + xEEG (t)β1 (t) + ε(t)

Z
yEMG (t) = β0 (t) + xEEG (s)β1 (s, t)ds + ε(t)
I One-to-one relation between EEG and EMG → Concurrent effect
Cumulated effect of functionalt−δcovariates → Linear functional effect
Z
I
yEMG (t) = β0 (t) + xEEG (s)β1 (s, t)ds + ε(t)
I EMG can only be influenced0by EEG activities in the past
→ Historical effect
327 / 398
Results for more complex model
328 / 398
Example: FDboost call
FDboost(EMG ~ 1 +
brandomc(id, df = 5) +
bhist(EEG, df = 20),
timeformula = ~ bbs(t, df = 4),
control = boost_control(mstop = 5000,
trace = TRUE),
data = data)
329 / 398
Part VII
Modeling Functional Data: Issues,

Outlook & Advanced Topics
330 / 398
Functional regression: Edge Cases, Problems, Pitfalls
Functional Regression: Extensions
Functional Response Regression: Alternative Approach
Clustering Functional Data

Identifiability of Functional Covariate Effects
Phase Variation
Registration Approaches
Scaling up GFAMM Inference
Practical considerations
332 / 398
Identifiability
Functional covariates x(s) often well approximated by truncated
Karhunen-Loève-expansions,
∞
X M
X
xi (s) = ξi,l φX
l (s) ≈ ξi,l φX
l (s).
l=1 l=1
For a linear functional effect

Z
fj (x(s))(t) = x(s)β(s, t)ds,
S
β is only identified up to additions of functions

g : g (·, t) ∈ span({φX
l , l > M}) ∀ t.
For finite grid data, the design matrix is rank-deficient if M < Kj or if the
span of the basis for β in s-direction contains functions orthogonal to
{φX
l , 1 ≤ l ≤ M} using numerical integration.
332 / 398
Identifiability
g : g (·, t) ∈ span({φXl , l > M}) ∀ t.

Z M
Z X
x(s) (g (s, t) + β(s, t)) ds = ξl φXl (s) (g (s, t) + β(s, t)) ds
S S l=1
 
M
X Z Z
X X
 
=  φl (s)g (s, t)ds + φl (s)β(s, t)ds 
ξl  
l=1 S S
| {z }
≡0
M
Z X
= ξl φXl (s)β(s, t)ds
S l=1
Z
= x(s)β(s, t)ds
S
Since φXl ⊥ φXl0 by definition and therefore g (·, t) ⊥ φXl0 , l 0 ≤ M.
333 / 398
Identifiability
I Iff the kernels of penalty and design matrix do not overlap, there is a
unique minimum of the penalty on each hyperplane defined by
coefficient vectors θ representing β(s, t) that yield identically valued
effects fj (x(s))(t).
I The penalized optimization criterion then finds the “smoothest”

solution (i.e, argmin(pen(θ)) among these solutions with optimal fit
to the data.
I If kernels of penalty and x(s) overlap, bad things can happen: e.g.
meaningless linear transformations or constant shifts of estimated
β(s, t)
→ c.f. collinearity in classical models.
=⇒ possibly no valid interpretation of slope, level, sign of β(s, t)
334 / 398
Identifiability: Synthetic Example
Truth Ridge 1st Diff. 2nd Diff.
1.0 1.0 1.0 0

0.5 0.5 0.5
-5000
0.0 1.0 0.0 1.0 0.0 1.0 1.0
0.8 0.8 0.8 0.8
-0.5 0.6 -0.5 0.6 -0.5 0.6 -10000 0.6
0.0 0.0 0.0 0.0
0.2 0.4 0.2 0.4 0.2 0.4 0.2 0.4
t
t
0.4 0.4 0.4 0.4
s 0.6 0.2 s 0.6 0.2 s 0.6 0.2 s 0.6 0.2
0.8 0.8 0.8 0.8
0.0
1.0 0.0
1.0 0.0
1.0 0.0
1.0
E(Y (t)) Ŷ (t) Ŷ (t) Ŷ (t)
335 / 398
Identifiability: Practical recommendations
1. for low-rank functional covariates, use FPC based effect

representation – identifiable (and possibly: interpretable) by
construction.
2. avoid rank-reducing pre-processing of functional covariates –
specifically: curve-wise centering, where possible.
3. check diagnostic measures (implemented in refund/FDboost)
4. use penalties with small kernel: 1st order differences/ derivatives or
full-rank penalties (Marra and Wood, 2011) or suitable constraints
(Scheipl and Greven 2016)
336 / 398
Phase & Amplitude Variation
30
2 2
Absolute Time
20
1 1
10
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
Time Time Individual Time
(Happ, Scheipl, et al. 2019)
337 / 398
Phase Variation: Functional responses
I ignored phase variation can invalidate even simple summaries like

functional means
I models presented so far well suited for explaining amplitude variation
I modeling phase variability typically requires large flexibility of effects
(time-varying, non-linear)
I phase variability of responses left unaccounted induces
auto-correlated errors where features/landmarks don’t align
338 / 398
0.3
centered FA-RCST
0.7
0.1
FA-CCA
0.5
-0.1
0.3
-0.3
0 20 40 60 80 0 10 20 30 40 50
CCA tract location t RCST location s
339 / 398
Cor(Yi (t) − Ŷi (t))
1.0
93
0.68
MS 1
0.10
control 0.76
0.5
70
CCA tract location t
0.56
0.05
0.59
Mean FA-CCA
Yi (t) − Ŷi (t)

0.32
0.0
Êi (t)
-0.05 0.00
0.12
46
-0.12
0.50
-0.5
-0.32
23
-0.56
-0.10
-1.0
-0.76
0.41
-1
0 23 46 70 93 0 23 46 70 93 0 23 46 70 93 23 46 70 93
CCA tract location t CCA tract location t CCA tract location t CCA tract location t
340 / 398
Phase Variation: Functional covariates
I important? relevant? =⇒ register data, then use both phase &

amplitude information
I for FPCA: data with phase variation typically with slow eigenvalue
decay, FPC score distributions with lots of structure
=⇒ (linear) dimension reduction less successful / informative
I non-linear dimension reduction methods (D. Chen, H.-G. Müller, et al.
2012, e.g.)
I joint phase-amplitude PCA (Tucker et al. 2013; Happ, Scheipl, et al.
2019, e.g.)
341 / 398
30
2 2
Absolute Time
20
1 1
10
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
342 / 398
Phase & Amplitude Variation: Registration
Typical procedure:
Decompose xi (t) = (wi ◦ γi )(t) = wi (γi (t))
I warping functions γi : T → T
I map clock time of observed curve xi to common system time of
registered curves
(e.g. growth curves of kids: earlier/later puberty etc.)
I no time jumps: γi continuous
∂
I no time reversals: ∂t γi (t) > 0
I γi (min T ) = min T , γi (max T ) = max T
I registered functions wi (t̃): landmarks like maxima/minima typically
aligned, easier to interpret
I decomposition into horizontal/phase variation γi and
vertical/amplitude variation wi
(Marron et al. 2015, e.g.)
343 / 398
Phase & Amplitude Variation: Registration
I Registration can be ill-posed: accelerating one function or slowing

down another one, eg.?
I Registration is difficult: estimate n functions γi from n functions
xi (t), under weird constraints.
=⇒ theoretically and numerically challenging
I yields very “interesting” math in weird function spaces – many
different approaches
I Registration in practice:
typically needs smooth, non-noisy data on dense grid or in basis
representation to work well
344 / 398
I Dynamic Time Warping: (K. Wang, Gasser, et al. 1997)

Iconstrained to piece-wise linear warping functions
+ fairly fast, powerful algorithms
− need equidistant grid
− need to choose penalization to avoid over-alignment
I Landmark Registration:
I idea: match recognizable features occuring in all xi (t) to fixed
timepoints
I warping functions interpolate between those somehow
+ simple
− only applicable for simple, “one-shape-fits-all” data
− landmark choice can be arbitrary (minima? maxima? zero-crossings?)
− exact landmark localization difficult for noisy data
345 / 398
I L2 -Distance Based:
Criterion L(γi ; xi , x0 ) = (x0 (t) − xi (γi−1 (t)))2 dt → minγi
R
I
− pinching problem (if γi too flexible)

− not really a suitable metric: L(γi ; xi , x0 ) 6= L(γi ; xi ◦ γ0−1 , x0 ◦ γ0−1 )
I ignore proportional (amplitude) variation:
2
R x0 (t) xi (γi−1 (t))
L(γi ; xi , x0 ) = kx0 (t)k − −1
kxi (γi (t))k
dt (J. Ramsay and Silverman 2005)
+ implemented – partially – in fda
I Square Root Velocity Function: (Srivastava and Klassen 2016)
Iidea: find distance metric for equivalence classes of “amplitude
functions”
I metric can be computed simply as L distance of SRVF:
p 2
SRVF(x) := sgn(x 0 (t)) |x 0 (t)|
+ deep, elegant maths guarantees consistency, well-posedness
+ rigorous definitions of phase, amplitude & respective means,
generalized FPCA etc.
+ implemented in fdasrvf (Tucker 2017)
346 / 398
Registration in practice
I almost all registration methods for multiple curves (not pairwise)

require a template curve as target.
choice of template? well-defined?
I in many cases: one template is not actually enough – functional data
has grouped structure, requires multiple templates.
how many? which ones?
I clear distinction between amplitude and phase variation or can
variation be represented either way?
I consider joint variation in phase and amplitude? or is phase variation
just nuisance?
347 / 398
What’s my n?
I fairly successful approach: rephrase functional response models as

models for scalar function evaluations by shifting all functional
structure into the predictor.
I issue: likelihood now for nT data points, but actually only have n
observational units
=⇒ downsampling/upsampling could change inference: smoothing, CI
widths, selection, etc.!
I so far: not encountered as huge problem in practice, but must not
ignore
348 / 398
Autocorrelation & Variance Heterogeneity
Ifairly successful approach: rephrase functional response models as
models for scalar function evaluations by shifting all functional
structure into the predictor.
I issue: intra-functional dependency needs to be modeled (see slide
before)
I issue: GFAMMs are conditional models with independence
assumption over yi (tj )|X , i = 1, . . . , n, j = 1, . . . , T
=⇒ most models will require smooth residual terms Ei (t) to capture
intra-functional dependencies, variance heterogeneity:
scales rather terribly....
I marginal type models computationally preferable, but:
I not clear how to incorporate into mgcv’s computational framework,
generally
I hard/impossible to generalize to non-Gaussian case
I doable: AR(1) residual structure
I doable: explicit model for time- and covariate-dependent variance for
Gaussian data (family = "gaulss", more tomorrow)
349 / 398
What my thesis comitte did not get to hear
I approach: rephrase functional response models as models for scalar

function evaluations by shifting all functional structure into the
predictor.
+ no presmoothing/preprocessing of functional responses =⇒ honest,
“unconditional” inference
?? how relevant if data are truly smooth without relevant iid error?
+ reuse/recycle/adapt all the (scalar) things
− computation does not scale well for dense grids, large n
− if data are low-rank representable yi ≈ Φ ξi , usually preferable to
T ×1 T ×K
exploit this low-rank structure and do inference on basis

coefficients instead of function evaluations: nK nT
350 / 398
Predictive modeling with functional covariates
For predictive accuracy, modern ML algos using scalar features derived

from functional data usually as good or better than “functional data”
methods.
I.e., something like dumping

I FPC scores
I FPC scores of (smoothed) derivatives
I locations of maxima / minima / zero-crossings, if relevant
I etc ...
as features into a RF or SVM tends to work really well without needing a
lot of “FDA” know-how or specialized software.
Caveat: For benchmarking, include all preprocessing (FPC estimation etc)

in cross-validation to avoid information leakage!
351 / 398

Regression Models for Multi-Level Functional Data
Regression Models for Multivariate Functions
Regression Models for Probability Density Functions
MFPCA for Phase & Amplitude
352 / 398
Multi-Level (Functional) Data
I data with hierarchical structure: observations in nested groups

students in classes in schools in school districts
I data with crossed structure: observations in (partially) overlapping
groups
product sales in categories and countries
I few levels, enough replications, group level effects interesting:
=⇒ just treat like standard nominal covariates
I many levels, (partially) few replications:
=⇒ (functional) random effects for group levels =⇒ Functional
LMM
352 / 398
I in GFAMM context, simple tensor product basis representation

available:
spline-based random intercepts bg (i) (t)
I problem: typically requires large basis Btr → tensor product basis
becomes hyuuge.
I possible solution:
PK use more compact FPC-basis for Btr :
bg (i) (t) = k ξgk φk (t)
I problem: how to estimate FPCs φk (t)?
I easy for simple random intercepts: do FPCA of grouped mean
residual curves from estimate without random intercepts.
I general case: coming up now
353 / 398
Functional Linear Mixed Model
yi (t) = µ(t, Xi ) + Bg 1(i) (t) + Cg 2(i) (t) + Ei (t) + εi (t)
I µ(tij , xi ) : mean function (conditional on covariates)

I Bg 1(i) (t), Cg 2(i) (t) mutually independent functional random effects
for grouping factors g1 (i), g2 (i), can be (partially) nested or crossed.
I Ei (t) smooth residuals, εi (t) unstructured measurement error with
variance σε2
Cov(yi (t), yi 0 (t 0 )) = δiig01 K B (t, t 0 ) + δiig02 K C (t, t 0 )+

δi=i 0 K E (t, t 0 ) + δt=t 0 σ 2
with auto-covariance Z 0 0
( functions K (t, t ) := Cov(Z (t), Z (t )), and
1 if g (i) = g (i 0 )
indicators δiig0 := (Cederbaum, Pouplier, et al. 2016)
0 else
354 / 398
FLMM: Inference
1. estimate mean function µ(t, Xi ) under working independence via

FAMM (Scheipl, Staicu, et al. 2015)
2. estimate auto-covariances from centered data ỹi (t) = yi (t) − µ̂(t, Xi ):
use E(ỹi (t)ỹi 0 (t 0 )) = Cov(yi (t), yi 0 (t 0 ))
=⇒ ỹi (t)ỹi 0 (t 0 ) ≈ δiig01 K B (t, t 0 ) + δiig02 K C (t, t 0 )+

δi=i 0 K E (t, t 0 ) + δt=t 0 σ 2
=⇒ estimate additive model for cross-products of responses

3. do FPCA of resulting K̂ Z (t, t 0 ) to get FPCs φ̂Zk (t)
4. re-estimate model using φ̂Zk (t)
355 / 398
FLMM: Covariance Surface Estimation
I generalized additive model for crossproduct surface ỹi (t)ỹi 0 (t 0 ) :

K Z (t, t 0 ) estimated as isotropic tensor product splines over T × T
I working assumptions: crossproducts independent, homoskedastic (!)
I uses all crossproducts with at least one gj (i) = gj (i 0 )
=⇒ quadratic in no. of data points (!)
I some savings possible: exploit symmetry, discretization
I estimates K̂ Z (t, t 0 ) evaluated on dense grid, eigendecomposition
yields φ̂Zk (t), ν̂kZ
(Cederbaum, Scheipl, et al. 2018)
356 / 398
FLMM: Final Model-Reestimation
Alternatives:
a) Simple LMM for
I responses ỹ = (ỹ1 (t11 ), . . . , ỹn (tnTn )>
I random effect design matrices P ΦZ = [φ̂Zk (ti )]
Z
i Ti ×K
I random effect co-variances diag(ν̂1Z , . . . , ν̂KZ Z )
Closed form solutions for ξîk
Z.
P
b) Re-estimate µ(t, Xi ) = r fr (Xri , t) simultaneously with functional
random effects using FPC bases φ̂Zk (t).
I Option a) is (much) faster and worked well for recovering random
effects in synthetic data.
Option b) will yield more honest uncertainty quantification for
covariate effects in µ.
Both performed much better than direct estimation of spline-based
random effects in simulations.
I procedure extends to more grouping factors & random slopes,
irregular/sparse data
(Cederbaum, Pouplier, et al. 2016; Cederbaum, Scheipl, et al. 2018)
357 / 398
Multivariate Functional Data
(1) (D)
I functional data frequently multivariate yi (t) = (yi (t), . . . , yi (t))
e.g. spatial accelerations for accelerometry, trajectory data over a
plane, ...
I challenge: model dependence structure within (y (1) (t), . . . , y (D) (t))
I one possible approach based on GFAMM, FLMM, MFPCA
described in the following
I assumptions: identical grids and commensurable measurement scales
for all dimensions
358 / 398
FAMM for each dimension
I Univariate FAMM for each dimension d = 1, ..., D:
y(d) = B(d) θ (d) + Φ(d) ξ + (d) ,
y(d) contains function evaluations, B(d) θ (d) represents fixed effects, Φ(d) contains
evaluations of random effect FPCs, unstructured errors (d) ∼ N(0, σd2 I)
I scores for FPCs ξ same for all d (!)
359 / 398
Multivariate FAMM
I connect dimensions simply by stacking:
y(1) B(1) θ (1)

  (1)   (1) 
0 ... 0
   
Φ
 y(2)   0 B(2) ... 0   θ (2)   Φ(2)   (2) 
 ..  =    ..  +  ..  ρ +  .. ,
        
.. .. ..
 .   . . .  .   .   . 
y(D) 0 0 ... B(D) θ (D) Φ(D) (D)
| {z } | {z } | {z } | {z } | {z }
ȳ B̄ θ̄ Φ̄ ¯
with
¯ ∼ N 0, diag(σ12 , ..., σp2 ) ⊗ I

I variance heterogeneity over dimensions simply uses weighted fits (...

or use "gaulss")
I model is conditional on estimated multivariate FPCs in Φ̄
360 / 398
Multivariate Functional PCA P
We want a multivariate FPC representation y (t) = K k=1 ξk φ(t),
(1) (D)
where φk (t) = (φk (t), . . . , φk (t))> with eigenvalues νk .
Assume a finite univariate Karhunen-Loève representation for each y (d) (t) exists, i.e.
XKd (d) (d)
y (d) (t) = ρk ψk (t),
k=1
Then the MFPC eigenvalues νk correspond to the eigenvalues of

 (11)
. . . Z (1D)

Z
 . ..  dd 0 (d 0 )

Z =  .. .. with entries Z = Cov ρ (d)
, ρ .
. .  mn m n
(D1) (DD)
Z ... Z
→ multivariate eigenvalues from (cross-)covariances of univariate score vectors
The multivariate FPC components are given by
XKd
φ(d)
m (t) = [cm ](d) (d)
n ψn (t),
n=1
(d) Kd
where [cm ] ∈ R denotes the j-th block of the m-th eigenvector cm of Z .
→ multivariate eigenfunctions from “re-distributing” univariate eigenfunctions
based on univariate score cross-covariance
For the MFPC scores: XD XKd
ξm = [cm ](d) (d)
n ρn .
d=1 n=1
(Happ and Greven 361
2018)
/ 398
MFPCA: Interpretation
I yields simple vector representation ξ of multivariate functions in a

multivariate functional basis φ(t).
(1) (D)
I Each MFPC φk (t) = (φk (t), . . . , φk (t))> represents a joint mode
of variation across all dimensions of the function
=⇒ MFPCA captures patterns of (cross-)covariance between and within
dimensions
362 / 398
MFPCA: Estimation Algorithm
1. Estimate univariate functional PCA for each element y (d) with

existing algorithms
2. Perform standard (multivariate) PCA for combined score vectors
(D) >

(1) (1) (D)
ρ1 , . . . , ρK1 , . . . , ρ1 , . . . , ρKD
3. Plug the results into the formulae for φm , νm , ξm
Implemented in R-package MFPCA (Happ 2018)
363 / 398
Multivariate FAMM
I flexibility for covariate effects equivalent to standard FAMM

I 3-step method:
1. estimate univariate FLMMs or FAMMs for each component
2. do MFPCA of estimated univariate FPCs
3. fit multivariate model with MFPCs, e.g. via pffr
I non-parametric estimate of cross-covariance between components of
the response
via MFPC representation
of smooth residuals
(1) (D)
Ei (t) = Ei (t), . . . , Ei (t)
I not published yet, Master thesis of Alexander Volkmann.
364 / 398
Density Functions as Outcomes
Problem:
Density functions are positive, must integrate to 1. P
How to guarantee this in an addditive model where y (t) ≈ r fr (X , t)?
Solution:
Similar problem as with non-Gaussian responses for LMs, similar solution:
=⇒ transformation of the response into a more friendly space without
restrictions.
365 / 398
Density Functions as Outcomes
Slightly more mathematically precise:

need to define a proper vector space of PDFs where addition,
multiplication, inner products make sense.
366 / 398
Vector space structure for densities
I Bayes Spaces introduced by Egozcue et al. (2006), Boogart et al.

2014
I To define a Bayes Hilbert Space consider a measurable space (T , A)
with a finite measure µ and the set B 2 (µ) of PDFs f with
I f > 0 µ-almost everywhere
I fR bounded upwards and away from 0
I log (f )2 dµ exists.
I Continuous distributions on T = [0, 1]:

=⇒ µ = λ, the Lebesgue measure
Discrete distributions on T = {1, ..., K }:
=⇒ µ = K
P
δ
k=1 k , the counting measure
367 / 398
Vector space structure for densities
Proposition (adapted from Egozcue et al. 2006, Boogart et al. 2014)

B 2 (µ) is a vector space with addition ⊕ and scalar multiplication
defined as follows:
For fi : T → R, i = 1, 2 in B 2 (µ) and a ∈ R define
I Addition: f1 ⊕ f2 = R f1 ·f2
f1 ·f2 dµ
a
I Scalar multiplication: a f1 = R f1a
f1 dµ
and with
I Additive neutral element: 0B = µ(T )−1 (const.)
f1−1
I Additive inverse: f1 = R
f1−1 dµ
(µ-a.e.)
I Multiplicative neutral element: 1 ∈ R
368 / 398
The clr-transformation
Introduced by Aitchison 1986, Boogart et al. 2014
Let L20 (µ) be the space of square-integrable functions with integral zero
(sub-vector space of L2 (µ)).
The centered-log-ratio (clr) transformation B 2 (µ) → L20 (µ), f 7→ f˜ is
defined as Z
˜
f = log(f ) − µ(T )−1
log(f )dµ
with inverse
exp(f˜)
f =R
exp(f˜)dµ
Well-defined, linear, injective!
This means: we can take elements in B 2 (µ), send them to L20 (µ), analyse
them there, and map them back to B 2 (µ).
369 / 398
An inner product for densities
Let f1 , f2 ∈ B 2 (µ), then the B 2 (µ)-inner product h·, ·iB is defined as

Z
˜ ˜
hf1 , f2 iB = hf1 , f2 i = f˜1 f˜2 dµ
I inner product properties inherited from L20 (µ) due to

linearity/injectivity of clr -transformation
370 / 398
Additive functional regression model for PDFs
Formulation in B2 (µ)
I Pairs (y (t), X ) with response PDF y : T → R, t 7→ y (t) in B 2 (µ), and

vector of scalar/functional covariates X .
I Formulation of generic Functional Additive Mixed Model (FAMM), as
discussed, now in the B 2 (µ) setting:
⊕-additive model for PDF

M
y (t) = f (X )(t) = fr (X , t) ⊕ ε
r
I predictor f composed of fr partial effects mapping X into B 2 (µ)

I error ε ∈ B 2 (µ) with E[ε] = 0B
371 / 398
Additive functional regression model for PDFs
Formulation in L20 (µ)
Additive model for clr-transform

clr(y ) = ỹ (t) = f˜(X )(t) = f˜r (X )(t) + ε̃
X
r
I predictor f˜ composed of f˜r partial effects mapping X into L20 (µ)

I error ε̃ ∈ L20 (µ) with E[ε̃] = 0 (const.)
I constraint f˜r ∈ L20 (µ) very easy to enforce by linear constraints on
design matrix: standard sum-to-zero
I implemented in FDboost
(Maier et al. 2019)
372 / 398
Application: relative income within households
Distribution of the income share of the woman
I Economic consequences of gender identity (Bertrand 2014):
wife income
distribution of s = wife income + husband income in the U.S.
I German Socio-Economic Panel (SOEP):
East Germany 2016 East Germany 2016

f(s): density estimate
f(s): density estimate

3.0
3.0
I Potentially relevant
2.0
2.0
factors: region, year,
children, ...
1.0
1.0
I Positive probability mass
0.0
0.0
at 0 and 1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0
s: income share earned by the Wife s: income share earned by the
373 / 398
The SOEP-Data
German Socio-Economic
1984 Panel (SOEP) (Schupp et al. nd):
I wide-ranging representative longitudinal study of private households
1987
in Germany
1990
I mixed discrete + continuous PDFs estimated per
1993
I year: from 1984 to 2016
I region: e.g., south = {Bavaria, Baden-Würt.}
geographical1996
I child status: age 0 − 6 / age 7 − 18 / older or no child in household
1999
Example PDFs:
2002
Weighted Densities per Year: south, child group 1 Weighted Densities per Year: south, child group 3
2005
2.0
2.0
2008
1.5
1.5
Density
Density
1.0
1.0
2011
−
−
−
0.5
0.5
−
−
−
−
−
− −
−
2014 − −
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Share earned by the Wife Share earned by the Wife
1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014
374 / 398
Model formulation
I Mixed reference measure µ = δ0 + λ + δ1
I Response PDFs y (s) ∈ B 2 (µ) of income share s ∈ [0, 1]
⊕-additive model for PDF
y (s) = β0 (s) ⊕ βnew (s) ⊕ βregion (s) ⊕ βchild (s)⊕

⊕ g (year )(s) ⊕ gnew (year )(s) ⊕ ε
I joint fitting possible, but separate fitting easier:

unique orthogonal decomposition y = yc ⊕ yd into
I ’continuous’ part with y˜c |{0,1} = 0 and
I ’discrete’ part with y˜d |(0,1) constant.
⇒ corresponds to fitting yc in B 2 (λ) and yd in B 2 (δ0 + δ1 )

375 / 398
West vs East Germany
Effects under clr -transform: Effects in B 2 (µ):
old: b 0
5
1.5
new: b 0 Å b new
0
1.0
PDF
clr
-5
0.5
~
old: b 0
new: b~ +b ~
0 new
-10
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
s: income share earned by wife s: income share earned by wife
376 / 398
30
2 2
Absolute Time
20
1 1
10
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
377 / 398
3 ideas:
I warping functions are (generalized) cumulative distribution functions
I define FPCA for densities via clr-projection
I do MFPCA of (derivatives of) warping functions and registered
functions to get joint representation of phase and amplitude variation
378 / 398
Warping functions & densities
I warping functions on a dataset are (generalized) cumulative

distribution functions:
I (strictly) increasing
I always go from same minimum to same maximum over same domain.
∂
=⇒ ∂t γ(t) are functions that are very much like densities
=⇒ can transfer B 2 (µ)-cconcepts defined above to ∂
∂t γ(t)
379 / 398
FPCA for densities
I previous section defined notions of addition, multiplication, inner

product for densities
I that’s all you need to define (Fréchet) means and (co)variances
=⇒ can develop FPCA for densities in B 2 (µ)
=⇒ computable by clr-projection into L20 (µ)
380 / 398
MFPCA for warping functions and registered
functions
1. perform registration with your favorite method: xi (t) → (γ̂i (t), ŵi (t̃))
∂

2. do FPCA of clr ∂t γ̂(t) , ŵ (t̃).
3. combine univariate FPCs into multivariate FPC via MFPCA
=⇒ MFPC scores represent both phase and amplitude variation
(Happ, Scheipl, et al. 2019) (conjectured:
=⇒ not well-definable distinction between phase and amplitude variation becomes less
relevant, just describe the data compactly...)
381 / 398
Phase-Amplitude-MFPCA: Example
PC 1 (26.8% Fréchet variance explained) PC 1 PC 1
Full variation Phase variation Amplitude variation
1.5 1.5
++++ ++ −
+ −+ − ++−− +++
++ ++ +− ++ ++
+ + + +
+ + +++ ++ + − + − + −+ − + − +−−− ++
−−
+ + + +
+ + +++ ++
0.9 + −+ − + +
+ + + ++++ + + + + ++
+ −− + −+ − + −+ − + − + − + − +−
++
− + + + ++++ + + + + +
++ +++ + + + + − +− ++ +++ + ++ ++ +
1.0 +++ ++ ++−− ++−− + + + − +− + 1.0 ++ +
+ +− +−−− +++ −− −+
+− +
−− 0.6 −−
+ −−− − − −− +− + −−− − − −−
− − −−− −− − − −−− −−
+ −− − − − − + − − − − −
0.5 − − − − −− − + 0.5 − − − − −
+ −−
−−− −−− −−− − − −− 0.3 + − + −−
−−− −−− −−− − −− −−−−
−− −−
+ − + − +−−
+ − + − +−
0.0 +
−+
−+−+
−+−+−−+
−−
−++ −− − 0.0 −+
+−+
−+
−+−+
−+ −+
−+−−−− 0.0 +
−+
−+
−++−
−− +−
+−+−−+
++−−
0 10 20 30 0 10 20 30 0 10 20 30
Time Time Time
PC 2 (15.6% Fréchet variance explained) PC 2 PC 2

Full variation Phase variation Amplitude variation
1.2 − + − 1.2 −
−− + +−− − ++ −−− −−
+ − ++ −−− + −
+−
++ −−− + −+ + − ++ −−− −+ +− − +− −−
− − − +−
+ + + −
−+ − − − −
− − + + −+ + ++ − −− ++
−+ − + ++− −−− −−
0.9 + + 0.9
− +− + + + 0.9 + + − +− − +++− −++−
− +− + − + +
− + −
+ − − +++ − + − ++− + − + − − −+ − − − − + + −+ − −
+
− −
+ − −−−−
+ + + −−++ + − − −− + −− ++ −+ −− + −+ − −+
−++ − + − −−+ +− −+ +− −−+ + −−
−− ++ − ++ ++ + + −
++ + + −− + ++
−
− ++++
− − +−+
+ ++ −
+++ −−
+ ++ + +
0.6 + ++ ++++ 0.6 0.6 + ++ ++
− +− −
+ +
+− +− −
+
0.3 +− 0.3 +− 0.3 −
+
− − −
+
+− +− −
−+
0.0 −+
+ −+
−+
−+ −+
−+
−+ −+
−+ −−
−+ 0.0 −+
+−+
−+
−+
−+ −+
−+
−+ −+
−+−− 0.0 −+
+ −+
−+
−+
−+ −+
−+−+−+
−+
0 10 20 30 0 10 20 30 0 10 20 30
Time Time Time
382 / 398

Wavelet-Based Functional Mixed Models
383 / 398
Alternatives
Alternatives to the GFAMM approaches differ in primarily in one crucial

aspect:
Instead of modeling scalar function evaluations, they project the
functional responses into a coefficient space of some known basis and do
the modeling there.
383 / 398
Model
A
X H
X
yi (t) = xia βa (t) + zihm bhm (t) + ei (t)
a h
I for functional responses yi (t) on common, fine grids over very

general domains T
I covariates x and grouping factors z associated with very general
random effects bhm (t) ∼ GP(0, Qh (t, t 0 )
I i. i. d. functional errors ei (t) ∼ GP(0, S(t, t 0 ) with generalizations to
autocorrelation, t-distributed errors, etc. (Zhu, P. Brown, et al. 2011)
I extensions for linear effect of functional covariates and nonlinear
effects of scalar covariates possible.
(J. S. Morris and R. J. Carroll 2006; J. S. Morris, P. J. Brown, et al. 2008; Meyer et al. 2015;
Zhu, Versace, et al. 2018)
384 / 398
Basis Representation
Use a (near) lossless basis representation yi (t) = Φ(t)y?i , i.e.,
Y = Y? Φ
N×T N×K K ×T
Project data into coefficient space:
Y? = YΦ> (ΦΦ> )−1
(use FFT, DWT!)
Model
A
X H
X
yik? = ?
xia βak + ?
zihm bhmk + eik? ,
a h
PK ? φ (t) etc..
with βa (t) = k βak k
385 / 398
Inference
Fully Bayesian inference implemented via purpose-built MH-samplers in

coefficient space.
Model for each coefficient space dimension yk fit separately (!)
Coefficient functions etc. then projected back on T for visualization,

diagnostics, etc.
=⇒ full posterior in function space accessible

=⇒ easy to get simultaneous CIs, local exceedance probabilities, etc....
=⇒ critical mutual independence assumption across K basis dimensions.
386 / 398
Distributional assumptions
? , e ? for random effect and error functions assumed

Basis coefficients bhmk ik
multivariate Gaussian, with variants for Laplace and t via scale mixtures of
Gaussians.
Very flexible, non-stationary covariance structures possible since

covariance effectively estimated separately for each coefficient space
dimension.
? with shrinkage priors like Laplace, spike-and-slab
Basis coefficients βak
(Bayesian LASSO variants).
387 / 398
Pros & Cons
+ vast scope: anything that can be represented in a basis can be

modeled: images, multivariate functions on complex domains, etc.
+ Bayesian flexibility: very complex covariance structures possible
+ scales to very large data:
I basis representation reduces T to K , effort independent of grid size.
I clever marginalization of random effects etc makes MCMC
computations linear in K & (almost) independent of n
I parallelized over k = 1, . . . , K
+ fully Bayesian inference
388 / 398
Pros & Cons
− only applicable to additive error models for continuous functional

data: no counts, binary, ordinal etc.
(workaround requires re-computation of yik? in each MCMC
iteration....)
− requires lossless basis representability of data
− requires common dense observation grid: no sparse or irregular
data
− requires all effects at all levels representable in same basis
− assumes independent basis dimensions: y?k ⊥ y?k 0 ∀ k 6= k 0
− no unified software implementation: undocumented MATLAB
scripts and a closed-source C tool (Windows only).
389 / 398
390 / 398
Clustering functions, the easy way
I do FPCA or other basis representation

I use any standard multivariate cluster methods of basis coefficient
vectors
=⇒ can use all types of custering methods: partitioning, hierarchical,
model-based, ...
I alternative: define any distance measure of functions, use distance
matrix as input for hierarchical clustering.
390 / 398
Clustering functions, more fancy
K-means FPCA (J.-M. Chiou and P.-L. Li 2008)

Initialize: do FPCA on entire data set, run k-means on coefficient vectors.
Iterate:
1. for each of the k clusters, redo FPCA using only data in that cluster
2. project every curve on each of the k sets of FPCs (LOCO estimates!)
3. re-assign each curve to the cluster whose FPCs “predict” it best
391 / 398
Clustering functions, more fancy yet
K-means Alignment (Sangalli et al. 2010)

Initialize: define k registered “template” functions, assign each
observation to one of them.
Iterate:
1. for each of the k clusters, update its “template” functions using only
data in that cluster
2. align all functions in each cluster to the respective templates and
re-assign to cluster based on their amplitude distances
3. re-align all functions in each cluster with the average cluster warping
392 / 398
References I
Bender, A., A. Groll, and F. Scheipl (2018). A generalized additive model approach to time-to-event analysis. In: Statistical
Modelling 18.3-4, pp. 299–321.
Bender, A., F. Scheipl, et al. (2018). Penalized estimation of complex, non-linear exposure-lag-response associations. In:
Biostatistics dkxy003.
Brockhaus, S., A. Fuest, A. Mayr, and S. Greven (2018). Signal regression models for location, scale and shape with an
application to stock returns. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 67.3, pp. 665–686.
Brockhaus, S., M. Melcher, F. Leisch, and S. Greven (2017). Boosting Flexible Functional Regression Models with a High
Number of Functional Historical Effects. In: Statistics and Computing 27.4, pp. 913–926.
Brockhaus, S. and D. Ruegamer (2018). FDboost: Boosting Functional Regression Models. R package version 0.3-2.
Brockhaus, S., F. Scheipl, T. Hothorn, and S. Greven (2015). The functional linear array model. In: Statistical Modelling 15.3,
pp. 279–300.
Brumback, B. and J. Rice (1998). Smoothing spline models for the analysis of nested and crossed samples of curves (with
discussion). In: Journal of the American Statistical Association 93, pp. 961–994.
Cardot, H., F. Ferraty, and P. Sarda (1999). Functional Linear Model. In: Statistics and Probability Letters 45.1, pp. 11–22.
Cardot, H., F. Ferraty, and P. Sarda (2003). Spline estimators for the functional linear model. In: Statistica Sinica 13.3,
pp. 571–592.
Cederbaum, J., M. Pouplier, P. Hoole, and S. Greven (2016). Functional linear mixed models for irregularly or sparsely sampled
data. In: Statistical Modelling 16.1, pp. 67–88.
Cederbaum, J., F. Scheipl, and S. Greven (2018). Fast symmetric additive covariance smoothing. In: Computational Statistics
& Data Analysis 120, pp. 25–41.
Chen, D., H.-G. Müller, et al. (2012). Nonlinear manifold representations for functional data. In: The Annals of Statistics 40.1,
pp. 1–29.
Chen, T. et al. (2019). xgboost: Extreme Gradient Boosting. R package version 0.81.0.1.
https://CRAN.R-project.org/package=xgboost.
Chiou, J. M. and P. L. Li (2007). Functional clustering and identifying substructures of longitudinal data. In: Journal of the
Royal Statistical Society. Series B: Statistical Methodology 69.4, pp. 679–699.
Chiou, J.-M. and P.-L. Li (2008). Correlation-based functional clustering via subspace projection. In: Journal of the American
Statistical Association 103.484, pp. 1684–1692.
Chiou, J., H. Müller, and J. Wang (2004). Functional response models. In: Statistica Sinica 14.3, pp. 675–694.
Cuevas, A. (2014). A partial overview of the theory of statistics with functional data. In: Journal of Statistical Planning and
Inference 147, pp. 1–23.
References II
Currie, I. D., M. Durban, and P. H. Eilers (2006). Generalized linear array models with applications to multidimensional
smoothing. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.2, pp. 259–280.
Currie, I., M. Durban, and P. Eilers (2006). Generalized linear array models with applications to multidimensional smoothing.
In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.2, pp. 259–280.
Descary, M.-H. and V. M. Panaretos (2018). Recovering covariance from functional fragments. In: Biometrika 106.1,
pp. 145–160.
Di, C.-Z., C. Crainiceanu, B. Caffo, and N. Punjabi (2009). Multilevel functional principal component analysis. In: Annals of
Applied Statistics 3.1, pp. 458–488.
Egozcue, J. J., J. L. Dáz–Barrero, and V. Pawlowsky–Glahn (2006). Hilbert space of probability density functions based on
Aitchison geometry. In: Acta Mathematica Sinica 22.4, pp. 1175–1182.
Ferraty, F. and P. Vieu (2006). Nonparametric Functional Data Analysis. Springer Series in Statistics. Springer, New York.
Theory and practice.
Genest, M., J.-C. Masse, and J.-F. Plante. (2017). depth: Nonparametric Depth Functions for Multivariate Analysis.
R package version 2.1-1. https://CRAN.R-project.org/package=depth.
Gentsch, K., D. Grandjean, and K. R. Scherer (2014). Coherence explored between emotion components: Evidence from
event-related potentials and facial electromyography. In: Biological Psychology 98, pp. 70–81.
Goldsmith, J., J. Bobb, et al. (2011). Penalized Functional Regression. In: Journal of Computational and Graphical Statistics
20.4, pp. 830–851.
Goldsmith, J., C. Crainiceanu, B. Caffo, and D. Reich (2012). Longitudinal Penalized Functional Regression for Cognitive
Outcomes on Neuronal Tract Measurements. In: Journal of the Royal Statistical Society: Series C 61.3, pp. 453–469.
Goldsmith, J., M. Wand, and C. Crainiceanu (2011). Functional regression via variational Bayes. In: Electronic Journal of
Statistics 5, p. 572.
Goldsmith, J., F. Scheipl, et al. (2018). refund: Regression with Functional Data. R package version 0.1-17.
https://CRAN.R-project.org/package=refund.
Greenwell, B., B. Boehmke, J. Cunningham, and G. Developers (2019). gbm: Generalized Boosted Regression Models. R
package version 2.1.5. https://CRAN.R-project.org/package=gbm.
Greven, S., C. Crainiceanu, B. Caffo, and D. Reich (2010). Longitudinal Functional Principal Component Analysis. In:
Electronic Journal of Statistics 4, pp. 1022–1054.
Greven, S. and F. Scheipl (2017). A general framework for functional regression modelling. In: Statistical Modelling 17.1-2,
pp. 1–35.
References III
Groll, A., T. Kneib, A. Mayr, and G. Schauberger (2018). On the dependency of soccer scores–a sparse bivariate Poisson model
for the UEFA European football championship 2016. In: Journal of Quantitative Analysis in Sports 14.2, pp. 65–79.
Happ, C. (2018). MFPCA: Multivariate Functional Principal Component Analysis for Data Observed on Different
Dimensional Domains. R package version 1.3-1. https://github.com/ClaraHapp/MFPCA.
Happ, C. and S. Greven (2018). Multivariate functional principal component analysis for data observed on different
(dimensional) domains. In: Journal of the American Statistical Association 113.522, pp. 649–659.
Happ, C., F. Scheipl, A.-A. Gabriel, and S. Greven (2019). A general framework for multivariate functional principal component
analysis of amplitude and phase variation. In: Stat 8.1.
He, G., H. Müller, and J. Wang (2003). “Extending correlation and regression from multivariate to functional data”. In:
Asymptotics in Statistics and Probability. Ed. by M. Puri. VSP International Science Publishers, pp. 301–315.
Hofner, B., L. Boccuto, and M. Göker (2015). Controlling false discoveries in high-dimensional situations: boosting with stability
selection. In: BMC bioinformatics 16.1, p. 144.
Hofner, B., T. Hothorn, T. Kneib, and M. Schmid (2011). A framework for unbiased model selection based on boosting. In:
Journal of Computational and Graphical Statistics 20.4, pp. 956–971.
Hofner, B., A. Mayr, N. Fenske, and M. Schmid (2018). gamboostLSS: Boosting Methods for GAMLSS Models. R package
version 2.0-1. https://CRAN.R-project.org/package=gamboostLSS.
Hofner, B., A. Mayr, N. Robinzonov, and M. Schmid (2014). Model-based boosting in R: a hands-on tutorial using the R
package mboost. In: Computational statistics 29.1-2, pp. 3–35.
Hothorn, T., P. Buehlmann, et al. (2018). mboost: Model-Based Boosting. R package version 2.9-1.
https://CRAN.R-project.org/package=mboost.
Hothorn, T., P. Bühlmann, et al. (2010). Model-based boosting 2.0. In: Journal of Machine Learning Research 11,
pp. 2109–2113.
Hyndman, R. J. and H. L. Shang (2010). Rainbow Plots, Bagplots, and Boxplots for Functional Data. In: Journal of
Computational and Graphical Statistics 19.1, pp. 29–45.
Ivanescu, A., A.-M. Staicu, F. Scheipl, and S. Greven (2015). Penalized function-on-function regression. In: Computational
Statistics 30.2, pp. 539–568.
Jacques, J. and C. Preda (2013). Funclust: A curves clustering method using functional random variables density approximation.
In: Neurocomputing 112, pp. 164–171.
James, G. M. and C. A. Sugar (2003). Clustering for Sparsely Sampled Functional Data. In: Journal of the American Statistical
Association 98.462, pp. 397–408.
References IV
James, G. (2002). Generalized linear models with functional predictors. In: Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 64.3, pp. 411–432.
Lang, S. et al. (2014). Multilevel structured additive regression. In: Statistics and Computing 24.2, pp. 223–238.
Li, Z. and S. N. Wood (2019). Faster model matrix crossproducts for large generalized linear models with discretized covariates.
In: Statistics and Computing, pp. 1–7.
Loève, M. (1978). Probability theory II. Springer.
López-Pintado, S. and J. Romo (2009). On the Concept of Depth for Functional Data. In: Journal of the American Statistical
Association 104.486, pp. 718–734.
Maier, E., A. Stoecker, B. Fitzenberger, and S. Greven (2019). “Flexible Regression for Probability Densities in Bayes Spaces”.
in preparation.
Malfait, N. and J. Ramsay (2003). The historical functional linear model. In: Canadian Journal of Statistics 31.2, pp. 115–128.
Marra, G. and S. N. Wood (2011). Practical variable selection for generalized additive models. In: Computational Statistics &
Data Analysis 55.7, pp. 2372–2387.
Marron, J. S., J. O. Ramsay, L. M. Sangalli, and A. Srivastava (2015). Functional Data Analysis of Amplitude and Phase
Variation. In: Statistical Science 30.4, pp. 468–484.
Maselyne, J. et al. (2014). Validation of a High Frequency Radio Frequency Identification (HF RFID) system for registering
feeding patterns of growing-finishing pigs. In: Computers and Electronics in Agriculture 102, pp. 10–18.
Mayr, A. et al. (2012). Generalized additive models for location, scale and shape for high-dimensional data - a flexible approach
based on boosting. In: Journal of the Royal Statistical Society, Series C - Applied Statistics 61.3, pp. 403–427.
McLean, M. W. et al. (2014). Functional generalized additive models. In: Journal of Computational and Graphical Statistics
23.1, pp. 249–269.
Meinshausen, N. and P. Bühlmann (2010). Stability selection. In: Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 72.4, pp. 417–473.
Meyer, M. J. et al. (2015). Bayesian function-on-function regression for multilevel functional data. In: Biometrics 71.3,
pp. 563–574.
Morris, J. S. (2015). Functional Regression. In: Annual Review of Statistics and Its Application 2.1, pp. 321–359.
Morris, J. S., P. J. Brown, et al. (2008). Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional
mixed models. In: Biometrics 64.2, pp. 479–489.
Morris, J. S. and R. J. Carroll (2006). Wavelet-based functional mixed models. In: Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 68.2, pp. 179–199.
References V
Morris, J. and R. Carroll (2006). Wavelet-based functional mixed models. In: Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 68.2, pp. 179–199.
Müller, H. and F. Yao (2008). Functional additive models. In: Journal of the American Statistical Association 103.484,
pp. 1534–1544.
N, W. S. (2019). mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. R package version
1.8-27. https://CRAN.R-project.org/package=mgcv.
Nychka, D. (1988). Confidence intervals for smoothing splines. In: Journal of the American Statistical Association 83,
pp. 1134–1143.
Prchal, L. and P. Sarda (2007). “Spline estimator for functional linear regression with functional response”. unpublished. url:
http://www.math.univ-toulouse.fr/staph/PAPERS/flm_prchal_sarda.pdf.
R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing. Vienna, Austria. http://www.R-project.org/.
Ramsay, J. O., H. Wickham, S. Graves, and G. Hooker (2018). fda: Functional Data Analysis. R package version 2.4.8.
https://CRAN.R-project.org/package=fda.
Ramsay, J. and G. Hooker (2017). Dynamic data analysis. New York: Springer.
Ramsay, J. and B. Silverman (2005). Functional Data Analysis. 2. ed. New York: Springer.
Reiss, P., L. Huang, and M. Mennes (2010). Fast Function-on-Scalar Regression with Penalized Basis Expansions. In: The
International Journal of Biostatistics 6.1, p. 28.
Reiss, P. and T. Ogden (2009). Smoothing parameter selection for a class of semiparametric linear models. In: Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 71.2, pp. 505–523. issn: 1467-9868. doi:
10.1111/j.1467-9868.2008.00695.x. url: http://dx.doi.org/10.1111/j.1467-9868.2008.00695.x.
Reiss, P. T. and M. Xu (2018). Tensor product splines and functional principal components.
https://works.bepress.com/phil_reiss/46.
Reiss, P. and R. Ogden (2007). Functional principal component regression and functional partial least squares. In: Journal of
the American Statistical Association 102.479, pp. 984–996.
Rügamer, D. et al. (2018). Boosting factor-specific functional historical models for the detection of synchronization in
bioelectrical signals. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 67.3, pp. 621–642.
Ruppert, D., R. Carroll, and M. Wand (2003). Semiparametric Regression. Cambridge, UK: Cambridge University Press.
Saefken, B., T. Kneib, C.-S. van Waveren, and S. Greven (2014). A unifying approach to the estimation of the conditional
Akaike information in generalized linear mixed models. In: Electronic Journal of Statistics 8.1, pp. 201–225.
References VI
Sangalli, L. M., P. Secchi, S. Vantini, and V. Vitelli (2010). K-mean alignment for curve clustering. In: Computational Statistics
& Data Analysis 54.5, pp. 1219–1233.
Scheipl, F., J. Gertheiss, and S. Greven (2016). Generalized functional additive mixed models. In: Electronic Journal of
Statistics 10.1, pp. 1455–1492.
Scheipl, F., A.-M. Staicu, and S. Greven (2015). Functional additive mixed models. In: Journal of Computational and Graphical
Statistics 24.2, pp. 477–501.
Scheipl, F. and S. Greven (2016). Identifiability in penalized function-on-function regression models. In: Electronic Journal of
Statistics 10.1, pp. 495–526. url: http://arxiv.org/abs/1506.03627.
Shi, J. Q. and T. Choi (2011). Gaussian process regression analysis for functional data. Chapman and Hall/CRC.
Sørensen, H., J. Goldsmith, and L. M. Sangalli (2013). An introduction with medical applications to functional data analysis.
In: Statistics in Medicine 32.30, pp. 5222–5240.
Srivastava, A. and E. P. Klassen (2016). Functional and shape data analysis. Springer.
Sun, Y. and M. G. Genton (2011). Functional Boxplots. In: Journal of Computational and Graphical Statistics 20.2,
pp. 316–334.
Tucker, J. D. (2017). fdasrvf: Elastic Functional Data Analysis. R package version 1.8.3.
https://CRAN.R-project.org/package=fdasrvf.
Tucker, J. D., W. Wu, and A. Srivastava (2013). Generative models for functional data using phase and amplitude separation.
In: Computational Statistics & Data Analysis 61, pp. 50–66.
Van den Boogaart, K. G., J. J. Egozcue, and V. Pawlowsky-Glahn (2014). Bayes hilbert spaces. In: Australian & New Zealand
Journal of Statistics 56.2, pp. 171–194.
Wang, J.-L., J.-M. Chiou, and H.-G. Müller (2016). Review of Functional Data Analysis. In: Annual Review of Statistics and Its
Application 3.1, pp. 1–41.
Wang, K., T. Gasser, et al. (1997). Alignment of curves by dynamic time warping. In: The annals of Statistics 25.3,
pp. 1251–1276.
Wood, S. N. (2006a). Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC.
Wood, S. N. (2006b). Low Rank Scale Invariant Tensor Product Smooths for Generalized Additive Mixed Models. In:
Biometrics 62.1.
Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized
linear models. In: Journal of the Royal Statistical Society (B) 73.1, pp. 3–36.
Wood, S. N. (2012). On p-values for smooth components of an extended generalized additive model. In: Biometrika 100.1,
pp. 221–228.
References VII
Wood, S. N., Y. Goude, and S. Shaw (2015). Generalized additive models for large data sets. In: Journal of the Royal
Statistical Society: Series C (Applied Statistics) 64.1, pp. 139–155.
Wood, S. N., F. Scheipl, and J. Faraway (2012). Straightforward intermediate rank tensor product smoothing in mixed models.
In: Statistics and Computing, pp. 1–20. url: http://dx.doi.org/10.1007/s11222-012-9314-z.
Wood, S. N. and M. Fasiolo (2017). A generalized Fellner-Schall method for smoothing parameter optimization with application
to Tweedie location, scale and shape models. In: Biometrics 73.4, pp. 1071–1081.
Wood, S. N., Z. Li, G. Shaddick, and N. H. Augustin (2017). Generalized additive models for gigadata: modeling the UK black
smoke network daily data. In: Journal of the American Statistical Association 112.519, pp. 1199–1210.
Wood, S. N., N. Pya, and B. Säfken (2016). Smoothing parameter and model selection for general smooth models. In: Journal
of the American Statistical Association 111.516, pp. 1548–1563.
Wood, S. N. and F. Scheipl (2017). gamm4: Generalized Additive Mixed Models using ’mgcv’ and ’lme4’. R package
version 0.2-5. https://CRAN.R-project.org/package=gamm4.
Xiao, L., V. Zipunnikov, D. Ruppert, and C. Crainiceanu (2016). Fast covariance estimation for high-dimensional functional
data. In: Statistics and computing 26.1-2, pp. 409–421.
Yao, F., B. Liu (Authors), H.-G. Mueller, and J.-L. Wang (Coordinators) (2012). PACE: Principal Analysis by Conditional
Expectation, Functional Data Analysis and Empirical Dynamics. MATLAB package version 2.15.
http://anson.ucdavis.edu/~ntyang/PACE/.
Yao, F., H.-G. Müller, and J.-L. Wang (2005). Functional Data Analysis for Sparse Longitudinal Data. In: Journal of the
American Statistical Association 100.470, pp. 577–590.
Zhu, H., P. Brown, and J. Morris (2011). Robust, Adaptive Functional Regression in Functional Mixed Model Framework. In:
Journal of the American Statistical Association 106.495, pp. 1167–1179.
Zhu, H., F. Versace, et al. (2018). Robust and Gaussian spatial functional regression models for analysis of event-related
potentials. In: NeuroImage 181, pp. 501–512.

Functional Regression Handout

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Functional Regression Handout

Uploaded by

Copyright:

Available Formats

Regression & other methods for Functional Data

adidas - March 2019

I lecturer at LMU Munich

Sarah Brockhaus, Jona Cederbaum, Sonja Greven, Clara Happ, Torsten

(All errors & ommissions mine, obviously....)

I. Background: Functional Data

Background: Functional Data

Descriptive Statistics for Functional Data

Basis Representation of Functional Data

Descriptive Statistics for Functional Data

Basis Representation of Functional Data

−0.03 −0.01 0.01 0.03

−0.03 −0.01 0.01

5 10 15 −0.03 −0.01 0.01 0.03

Aims of functional data analysis:

Function-on-Scalar: yi (t) = µ(t) + xi β(t) + ε(t)

Standard setting in multivariate data analysis:

I Observations xi = (xi1 , . . . , xip ) for i = 1, . . . , n

xi1 xi2 xi3 ... xip

xi1 xi2 xi3 ... xip

I Basic idea: Model discretely observed data by functions on domain T

I Observations xi (t), t ∈ T for i = 1, . . . , n

Descriptive Statistics for Functional Data

Basis Representation of Functional Data

100 120 140 160 180

deviation from mean height (cm)

age (years) age (years)

Sample mean function: Centered curves:

age (years) age (years)

Sample variance function: Standard deviation function:

Covariance / Correlation functions:

I Sample correlation function:

Example: Growth curves of 54 girls

Sample covariance function

Example: Growth curves of 54 girls

Sample correlation function

0.8 0.75 0.65

Descriptive Statistics for Functional Data

Basis Representation of Functional Data

Example bacterial growth curve i-th growth curve xi (t)

Example bacterial growth curve i-th growth curve xi (t)

⇒ Regular functional data:

⇒ Regular functional data:

Example bacterial growth curve Sample of curves x1 (t), . . . , xN (t)

Example bacterial growth curve Sample of curves x1 (t), . . . , xN (t)

Basis representation Construct functions as weighted sum

1 with basis coefficients θ1 , . . . , θK .

Basis representation Functional shape determined

Basis representation Functional shape determined

Basis representation Functional shape determined

Basis representation Functional shape determined

Basis representation Sample of curves x1 (t), . . . , xN (t)

Basis representation B-spline bases:

Other popular bases:

Basis representation I how many knots for the basis?

Fit with λ = 1000

I λ is typically estimated from the data, e.g. using cross validation

I Functional principal components: (J.-L. Wang et al. 2016)

Descriptive Statistics for Functional Data

Basis Representation of Functional Data

Recap: Generalized Linear Models

Recap: Non-Linear Effects

Recap: Mixed Models and Random Effects