Download as pdf or txt
Download as pdf or txt
You are on page 1of 435

Regression & other methods for Functional Data

Fabian Scheipl
Institut für Statistik
Ludwig-Maximilians-Universität München

adidas - March 2019


Who is he?

I lecturer at LMU Munich


I MCML working group “Functional Data Analysis”
I maintainer CRAN task view “Functional Data Analysis”
I contributor: refund, mboost, lme4, gamm4, ...
I Github: fabian-s

2 / 398
Who are you?

3 / 398
Credits

Slides and material in the following based, amongst others, on slides and
figures by:

Sarah Brockhaus, Jona Cederbaum, Sonja Greven, Clara Happ, Torsten


Hothorn, Thomas Kneib, Eva Maier, David Rügamer, Lisa Steyer, Almond
Stöcker, Alexander Volkmann

(All errors & ommissions mine, obviously....)

4 / 398
Global Outline

I. Background: Functional Data


II. Background: Regression
III. Functional Principal Component Analysis
IV. Background: Boosting
V. Functional Regression Models: Theory
VI. Functional Regression Models: Implementation
VII. Modeling Functional Data: Issues, Outlook & Advanced Topics
VIII. Interlude: Exploratory FDA with tidyfun
IX. Case Studies: DTI & Multiple Sclerosis
X. Case Studies: Canadian Weather
XI. Case Studies: EEG Vigilance & Depression

5 / 398
Part I

Background: Functional Data

6 / 398
Introduction

Descriptive Statistics for Functional Data

Basis Representation of Functional Data

Summary
Introduction
Overview
From high-dimensional to functional data

Descriptive Statistics for Functional Data

Basis Representation of Functional Data

Summary

8 / 398
Introduction
Overview
Examples of functional data: Berkeley growth study

200
160
Height
120 80

5 10 15
Age

8 / 398
Introduction
Overview
Examples of functional data: Handwriting

0.03
−0.03 −0.01 0.01
y(t)

−0.03 −0.01 0.01 0.03


x(t)

9 / 398
Introduction
Overview
Examples of functional data: Brain scan images

10 / 398
Introduction
Overview
Characteristics of functional data:
200

0.03
160

−0.03 −0.01 0.01


Height

y(t)
120 80

5 10 15 −0.03 −0.01 0.01 0.03


Age x(t)

I Several measurements for the same statistical unit, often over time
I Sampling grid is not necessarily equally spaced, sparse data
I Smooth variation, that could be assessed (in principle) as often as
desired
I Noisy observations
I Many observations of the same data generating process
↔ time series analysis
J. Ramsay and Silverman 2005
11 / 398
Introduction
Overview

Aims of functional data analysis:


I Represent the data → interpolation, smoothing
I Display the data → registration, outlier detection
I Study sources of pattern and variation → functional principal
component analysis, canonical correlation analysis
I Explain variation in a dependent variable by using independent
variable information → functional regression models
I No forecasting / extrapolation ↔ time series analysis

Z
Scalar-on-Function: yi = µ + xi (s)β(s)ds + ε

Function-on-Scalar: yi (t) = µ(t) + xi β(t) + ε(t)


Z
Function-on-Function: yi (t) = µ(t) + xJ.i (s)β(s,
Ramsay t)ds + ε(t) 2005
and Silverman

12 / 398
Introduction
From high-dimensional to functional data

Standard setting in multivariate data analysis:

...
n observations
...

...

...

p variables

I Observations xi = (xi1 , . . . , xip ) for i = 1, . . . , n


I Model complexity increases with p (Curse of Dimensionality )

13 / 398
Introduction
From high-dimensional to functional data
Data with natural ordering:

xi1 xi2 xi3 ... xip

t1 t2 t3 ... tp
t1 t2 t3 tp

I Longitudinal data
I Ordering along time domain (one-dimensional)
Functional data:

xi1 xi2 xi3 ... xip

t1 t2 t3 ... tp
T

I Basic idea: Model discretely observed data by functions on domain T


14 / 398
Introduction
From high-dimensional to functional data
Functional data:

I Observations xi (t), t ∈ T for i = 1, . . . , n


I Number of observable values xi (t1 ), . . . , xi (tp )
I in theory: p → ∞
I in practice: p < ∞
I Domain T
I Realizations x1 , . . . , xn of X are curves (d = 1), images (d = 2), 3D
arrays (d = 3), etc.

15 / 398
Introduction

Descriptive Statistics for Functional Data


Pointwise measures
Covariance and Correlation Functions

Basis Representation of Functional Data

Summary

16 / 398
Descriptive Statistics for Functional Data
Pointwise measures
Example: Growth curves of 54 girls

100 120 140 160 180


height (cm)

80

5 10 15

age (years)

Summary Statistics:
I Based on observed functions x1 (t), . . . , xn (t)
I Characterize location, variability, dependence between time points, ...
16 / 398
Descriptive Statistics for Functional Data
Pointwise measures
Example: Growth curves of 54 girls
100 120 140 160 180

deviation from mean height (cm)

20
10
height (cm)

0
−10
80

−20
5 10 15 5 10 15

age (years) age (years)

Sample mean function: Centered curves:


n
1 X xi (t) − µ̂X (t)
µ̂X (t) = xi (t)
n
i=1
I Pointwise calculation for each value t ∈ T
I Analogous to multivariate case 17 / 398
Descriptive Statistics for Functional Data
Pointwise measures
Example: Growth curves of 54 girls
50

7
height (cm^2)

40

6
height (cm)
30

5
20

4
10

3
5 10 15 5 10 15

age (years) age (years)

Sample variance function: Standard deviation function:


n
1 X
q
σ̂X2 (t) = (xi (t) − µ̂X (t))2 σ̂X (t) = σ̂X2 (t)
n−1
i=1

18 / 398
Descriptive Statistics for Functional Data
Covariance and Correlation Functions

Covariance / Correlation functions:


I Measure dependence between different (time) points s, t ∈ T
I Sample covariance function:
n
1 X
v̂X (s, t) = (xi (s) − µ̂X (s)) · (xi (t) − µ̂X (t))
n−1
i=1

I Sample correlation function:

v̂X (s, t)
ĉX (s, t) = q
σ̂X2 (s)σ̂X2 (t)

19 / 398
Descriptive Statistics for Functional Data
Covariance and Correlation Functions

Example: Growth curves of 54 girls

Sample covariance function

40

50

15
heigh

40
40
40 45
55
30

age (years)
t
(cm^2

50 35

10
20
10
)

30

15 25
ag

15)

5
20
s
e

10
10 ear
(y

15
y
ea

5 (
5 age
rs

10
)

5 10 15

age (years)

20 / 398
Descriptive Statistics for Functional Data
Covariance and Correlation Functions

Example: Growth curves of 54 girls

Sample correlation function

1.0
auto−

0.55
0.75

0.9

0.6
15
0.8

0.75
correla

0.8 0.8

age (years)
0.7 0.75

10
tion

0.6

0.65

0.85
15
ag

5
15) 0.9

5
s
e

10
10 ear

0.7
(y

0.9
y 0.95
ea

0.8
0.9

5 ( 0.85

5 age
rs

0.8 0.75 0.65


0.7 0.65 0.6 0.55
)

5 10 15

age (years)

21 / 398
Introduction

Descriptive Statistics for Functional Data

Basis Representation of Functional Data


Regularly and irregularly sampled functional data
Basis functions
Basis representations for functional data
Most popular choices of basis functions
Smoothness and regularization
Other representations of functional data

Summary

22 / 398
Basis Representation of Functional Data
Regularly and irregularly sampled functional data

Example bacterial growth curve i-th growth curve xi (t)


5 Observed measurements:
4
 
t1 xi (t1 )
3
..  .. 
.  . 

y

2
..  .. 
.  . 
1 tp xi (tp )
0
0 20 40
t

22 / 398
Basis Representation of Functional Data
Regularly and irregularly sampled functional data

Example bacterial growth curve i-th growth curve xi (t)


5 Observed measurements:
4
 
t1 xi (t1 )
3
..  .. 
.  . 

y

2
..  .. 
.  . 
1 tp xi (tp )
0
0 20 40
t

22 / 398
Basis Representation of Functional Data
Regularly and irregularly sampled functional data
Example bacterial growth curve Sample of curves x1 (t), . . . , xN (t)
5 Observed measurements
4 in  ’wide format’: 
t1 x1 (t1 ) x2 (t1 ) . . . xN (t1 )
3
..  .. .. 
.  . . 
y

2
..  .. .. 

.  . . 
1
tp x1 (tp ) x2 (tp ) . . . xN (tp )
0
0 20 40
t

⇒ Regular functional data:


I functions observed on common grid (often equi-distant)
I simpler case
I to some extend, methods of multivariate statistics can be directly
applied
23 / 398
Basis Representation of Functional Data
Regularly and irregularly sampled functional data
Example bacterial growth curve Sample of curves x1 (t), . . . , xN (t)
5 Observed measurements
4 in  ’wide format’: 
t1 x1 (t1 ) x2 (t1 ) . . . xN (t1 )
3
..  .. .. 
.  . . 
y

2
..  .. .. 

.  . . 
1
tp x1 (tp ) x2 (tp ) . . . xN (tp )
0
0 20 40
t

⇒ Regular functional data:


I functions observed on common grid (often equi-distant)
I simpler case
I to some extend, methods of multivariate statistics can be directly
applied
23 / 398
Basis Representation of Functional Data
Regularly and irregularly sampled functional data

Example bacterial growth curve Sample of curves x1 (t), . . . , xN (t)


5 Observed measurements
4 in ’long format’:
 
t1,1 x1 (t1,1 )
3
..  .. 
.  .
y


2  
t1,p1  x1 (t1,p1 ) 

1 ..  .. 
.  . 
0
 
tN,1  xN (tN,1 ) 

0 20 40
..  ..

t

.  . 
⇒ Irregular functional data: tN,pN xN (tN,pN )
I functions observed on different time points
I sometimes only sparsely sampled
I more difficult, but often given in practice
24 / 398
Basis Representation of Functional Data
Regularly and irregularly sampled functional data

Example bacterial growth curve Sample of curves x1 (t), . . . , xN (t)


5 Observed measurements
4 in ’long format’:
 
t1,1 x1 (t1,1 )
3
..  .. 
.  .
y


2  
t1,p1  x1 (t1,p1 ) 

1 ..  .. 
.  . 
0
 
tN,1  xN (tN,1 ) 

0 20 40
..  ..

t

.  . 
⇒ Irregular functional data: tN,pN xN (tN,pN )
I functions observed on different time points
I sometimes only sparsely sampled
I more difficult, but often given in practice
24 / 398
Basis Representation of Functional Data
Basis functions

Basis representation Construct functions as weighted sum


5
θi k bk ( t ) basis functions bk (t), k = 1, . . . , K :
4 f ( t ) = Σ k θ i k bi k ( t )
K
X
3 f (t) = θk bk (t)
y

2
k=1

1 with basis coefficients θ1 , . . . , θK .


0
0 20 40
t

25 / 398
Basis Representation of Functional Data
Basis functions

Basis representation Functional shape determined


5 via basis coefficients:
θ i k bk ( t )
f ( t ) = Σ k θ i k bi k ( t )
 
4 1 θ1
3
2  θ2 

3  θ3 
y


2 ..  .. 

.  . 
1
K θK
0
0 20 40
t

Function given by
K
X
f (t) = θk bk (t)
k=1

26 / 398
Basis Representation of Functional Data
Basis functions

Basis representation Functional shape determined


5 via basis coefficients:
θ i k bk ( t )
f ( t ) = Σ k θ i k bi k ( t )
 
4 1 1
3
2 1

3 1
y


2 ..  .. 

. .

1
K 1
0
0 20 40
t

Function given by
K
X
f (t) = θk bk (t)
k=1

26 / 398
Basis Representation of Functional Data
Basis functions

Basis representation Functional shape determined


5 via basis coefficients:
θ i k bk ( t )
f ( t ) = Σ k θ i k bi k ( t )
 
4 1 1
3
2 1

3 2
y


2 ..  .. 

. .

1
K 1
0
0 20 40
t

Function given by
K
X
f (t) = θk bk (t)
k=1

26 / 398
Basis Representation of Functional Data
Basis functions

Basis representation Functional shape determined


5 via basis coefficients:
θ i k bk ( t )
f ( t ) = Σ k θ i k bi k ( t )
 
4 1 1
3
2 2

3 3
y


2 ..  .. 

. .

1
K K
0
0 20 40
t

Function given by
K
X
f (t) = θk bk (t)
k=1

26 / 398
Basis Representation of Functional Data
Basis representations for functional data
Basis representation Approximate data with basis functions
4
θ i k bk ( t ) ⇒ seek to specify θ̂i,1 , . . . , θ̂i,K such
f ( t ) = Σ k θ i k bi k ( t ) that
3 K
X
xi (t) ≈ θ̂i,k bk (t) .
2
y

k=1

0
0 20 40
t

⇒ Popular criterion:
Specify θ̂i,1 , . . . , θ̂i,K such that quadratic distance becomes minimal,
i.e. !2
X q K
X
xi (tj ) − θi,k bk (tj ) −→ min
θi,k
j=1 k=1
27 / 398
Basis Representation of Functional Data
Basis representations for functional data

Basis representation Sample of curves x1 (t), . . . , xN (t)


Basis representations
of  observed measurements:
4
1 θ̂1,1 θ̂2,1 . . . θ̂N,1
..  .. .. 
.  . . 
y

2 ..  .. .. 

.  . . 
K θ̂1,K θ̂2,K . . . θ̂N,K
0
0 20 40 PK
t
Functional observations represented as xi (t) ≈ k=1 θ̂i,k bk (t).

28 / 398
Basis Representation of Functional Data
Most popular choices of basis functions

Basis representation B-spline bases:


4
B-Spline of degree 1 I piece-wise polynomials of degree d
3 I basis functions consist of
2
1 (d − 1)-times differentiably
0
connected polynomials
B-Spline of degree 2
4 I connection at knots determining the
3
2
number of basis functions
y

1
0
I cheap to compute & numerically
B-Spline of degree 3 stable
4
3 I local support: sparse matrix of basis
2
1
function evaluations
0
0 20 40
t

29 / 398
Basis Representation of Functional Data
Most popular choices of basis functions

Other popular bases:


I Fourier basis: containing harmonics with different frequencies
⇒ periodic functions
I Wavelets:
⇒ for peaked, ragged functions.
I Thin-plate splines
⇒ better theory, also for surfaces.

30 / 398
Basis Representation of Functional Data
Smoothness and regularization

Basis representation I how many knots for the basis?


5
q i bi(t) I trade-off between over-fitting
S q i bi(t)
4 and
3
under-fitting
y

0
0 20 40
t

31 / 398
Basis Representation of Functional Data
Smoothness and regularization

Penalization:
I minimize quadratic difference from data
+ a roughness penalty term
Specify θ̂i,1 , . . . , θ̂i,K to minimize

p K
!2
X X
xi (tj ) − θi,k bk (tj ) + λ pen(θi ) −→ min
θi,k
j=1 k=1

I with, e.g., P
quadratic penalty on second order differences, i.e.
pen(θi ) = K 2
k=3 ((θi,k − θi,k−1 ) − (θi,k−1 − θi,k−2 )) and λ > 0 a
smoothing parameter

32 / 398
Basis Representation of Functional Data
Smoothness and regularization
Fit with λ = 0 Fit with λ = 1
4 4
3 3
2 2
y

y
1 1
0 0
-1 -1
0 20 40 0 20 40
t t

Fit with λ = 1000


4
3
2
y

1
0
-1
0 20 40
t

I λ is typically estimated from the data, e.g. using cross validation

33 / 398
Basis Representation of Functional Data
Other representations of functional data

I Functional principal components: (J.-L. Wang et al. 2016)


I basis representation learned from observed data
I “optimal” (low-dimensional) basis
I more on this later
I Gaussian processes: x(t) ∼ GP (µX (t), σX (t, t 0 )) (Shi and Choi 2011)
I Gaussianity assumption
I σX (t, t 0 ) from some parametric family
I µX , σX estimated from data
I Differential equations / dynamics: (J. Ramsay and Hooker 2017)
I represent functional data in terms of differential equations describing
their behavior:
d
dt x(t) = f (x(t))
I seems very useful for physical systems, motion data etc.
I (available literature uses spline representations internally)

34 / 398
Introduction

Descriptive Statistics for Functional Data

Basis Representation of Functional Data

Summary

35 / 398
Summary
Functional Data:
I Arises in many different contexts and in many applications (curves,
images,...)
I Observation unit represents the full curve, typically discretized, i.e.
observed on a grid
I Important analysis techniques:
I Smoothing and basis representation
I Functional principal component analysis
I Functional regression

Summary Statistics:
I Give insights into location, variability and time dependence in a
sample of curves
I Pointwise calculation, mostly analogous to multivariate case

35 / 398
Summary

Basis representation:
I Different types of raw functional data: regularly and irregularly
sampled
I (Approximate) representation via bases of functions
I ’true functional representation’
I smoothing / vector representation
I Represent a functional datum in terms of a global, fixed, known
dictionary of basis functions and an observation-specific coefficient
vector.
I Different types of basis functions for different purposes / applications
I Obtain desired ’smoothness’ via penalization

36 / 398
Part II

Background: Regression

37 / 398
Recap: Linear Models

Recap: Generalized Linear Models

Recap: Non-Linear Effects

Recap: Mixed Models and Random Effects

Recap: Additive Models and Penalization


Recap: Linear Models
Linear Model: Basics
Inference
Model Diagnostics
R-Implementation: LM

Recap: Generalized Linear Models

Recap: Non-Linear Effects

Recap: Mixed Models and Random Effects

Recap: Additive Models and Penalization

39 / 398
Data & Model

Data:
I (yi , xi1 , . . . , xik ); i = 1, . . . , n
I metric target variable y
I metric or categorical covariates x1 , . . . , xp (categorical data in binary
coding)
Model:
I yi = β0 + β1 xi1 + · · · + βp xip + εi ; i = 1, . . . , n
⇒ y = Xβ + ε; X = [1, x1 , . . . , xp ]
I i. i. d. residuals/errors εi ∼ N(0, σ 2 ); i = 1, . . . , n
I estimates ŷi = β̂0 + β̂1 xi1 + · · · + β̂p xip

39 / 398
Interpreting the coefficients

Intercept:
β̂0 : estimate for y if all metric x = 0.
and all categorical x in their reference category.
metric covariates:
β̂m : estimated expected change in y if xm increases by 1 (ceteris
paribus).
categorical covariates: (dummy-/one-hot-encoding)
β̂mc : estimated expected difference in y between observations in
category c and the reference category of xm (ceteris paribus).

40 / 398
Linear Model Estimation

β̂ minimizes sum of quadratic errors (OLS-estimate):


n
!  
X
> 2 >
(yi − xi β) → min bzw. (y − Xβ) (y − Xβ) → min
β β
i=1
⇒β̂ = (X> X)−1 X> y

Estimated error variance:


n
1 X 1
σˆε2 2 = (yi − x> 2
i β̂) = ε̂> ε̂
n−p n−p
i=1

41 / 398
Properties of β̂

I unbiased: E(β̂) = β
I Cov(β̂) = σ 2 (X> X)−1
for Gaussian ε:
β̂ ∼ N(β, σ 2 (X> X)−1 )

42 / 398
Tests

Possible settings:
1. Testing for significance of a single coefficient:
H0 : βj = 0 vs HA : βj 6= 0
2. Testing for significance of a subvector βt = (βt1 , . . . , βtr )> :
H0 : βt = 0 vs HA : βt 6= 0
3. Testing for equality: H0 : βj − βr = 0 vs HA : βj − βr 6= 0
General:
Testing linear hypotheses H0 : Cβ = d

43 / 398
Tests
F-Test:
Compare sum of squared errors (SSE) of full model with SSE under
restriction H0 :
n − p SSEH0 − SSE
F =
r SSE
−1
(Cβ̂ − d) σ̂ 2 C(X> X)−1 C>
> (Cβ̂ − d) H0
= ∼ F (r , n − p)
r
t-Test:
Test significance of a single coefficient:
β̂j H0
t=q ∼ t(n − p)
\
Var(β̂j )
2
2 β̂j H0
F =t = ∼ F (1, n − p)
\
Var( β̂ ) j

44 / 398
Residuals in the linear model

Observed errors ε̂ typically not uncorrelated with identical variance:

ŷ = Xβ̂ = X(X> X)−1 X> y


| {z }
hat matrix H
⇒ ε̂ = y − ŷ = (I − H)y
⇒ Cov ε̂ = σ 2 (I − H)

45 / 398
Types of Residuals

I ordinary residuals: ε̂ (not independent, no constant variance)


I standardized residuals: ri = √ ε̂i (constant variance)
σ̂ 1−hii
ε̂i
I studentized residuals: ri∗ = √
σ̂(−i) 1−hii
:
use for anomaly / outlier detection.
I partial residuals: ε̂xj ,i = ε̂i + β̂j xij :
check linearity, additivity.

46 / 398
Graphical model checks:

I model structure: ri vs ŷi


I linearity: ε̂xj ,i vs xj
I variance homogeneity: ri vs ŷi , xj
I autocorrelation: ri , ε̂i vs i (i = time, e.g.)

47 / 398
Linear Model in R:

Linear Models in R: lm model specification:


I m <- lm(y ~ x1 + x2, data=XY)
interactions:
I lm(y ~ x1*x2) equivalent to lm(y ~ x1 + x2 + x1:x2)
methods for lm-objects:
I summary(),anova(),fitted(),predict(),resid()
I coef(), confint(), vcov(), influence()
I plot()
etc...

48 / 398
Example: Munich Rents 1999

I data: 3082 apartments


I target: net rent (DM/sqm)
I metric covariates: size, year of construction (metrisch)
I categorical covariates: area (normal/good/best), central heating
(yes/no), bathroom / kitchen fittings (normal/superior)

49 / 398
Model in R
no interaction:
y = β0 + β1 ∗ x1 + β2 ∗ x2.2 + β3 ∗ x2.3
miet1 <- lm(rentsqm ~ size + area)
(beta.miet1 <- coef(miet1))

## (Intercept) size areagood areabest


## 18.2429185 -0.0715132 0.9059416 3.4196824

with interaction:
y = β0 + β1 ∗ x1 + β2 ∗ x2.2 + β3 ∗ x2.3 + β4 x1 x2.2 + β5 x1 x2.3
miet2 <- lm(rentsqm ~ size * area)
(beta.miet2 <- coef(miet2))

## (Intercept) size areagood areabest


## 18.67890804 -0.07817872 0.11145940 0.87292650
## size:areagood size:areabest
## 0.01182596 0.03302475
Model Visualisation

miet1 miet2
35

35
area
● normal
● good
30

30
● best
25

25
net rent (DM/sqm)

net rent (DM/sqm)


20

20
15

15
10

10
5

5
0

20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160

size (sqm) size (sqm)


Tests

anova(update(miet2, . ~ -.), miet2)

## Analysis of Variance Table


##
## Model 1: rentsqm ~ 1
## Model 2: rentsqm ~ size * area
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3081 69521
## 2 3076 60064 5 9457.3 96.866 < 2.2e-16 ***
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Tests
round(summary(miet2)$coefficients, 3)

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 18.679 0.329 56.736 0.000
## size -0.078 0.005 -16.376 0.000
## areagood 0.111 0.494 0.226 0.821
## areabest 0.873 1.542 0.566 0.571
## size:areagood 0.012 0.007 1.716 0.086
## size:areabest 0.033 0.018 1.797 0.072

round(anova(miet2), 3)

## Analysis of Variance Table


##
## Response: rentsqm
## Df Sum Sq Mean Sq F value Pr(>F)
## size 1 8071 8071.3 413.346 <2e-16 ***
## area 2 1284 641.9 32.875 <2e-16 ***
## size:area 2 102 51.1 2.617 0.073 .
## Residuals 3076 60064 19.5
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Model Comparison

anova(miet1, miet2)

## Analysis of Variance Table


##
## Model 1: rentsqm ~ size + area
## Model 2: rentsqm ~ size * area
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3078 60166
## 2 3076 60064 2 102.19 2.6168 0.0732 .
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

complex hypotheses / multiple testing: package multcomp


Model diagnostics: plot.lm()
par(mfrow = c(2, 2))
plot(miet2)

Residuals vs Fitted Normal Q−Q

Standardized residuals
20

● ●

4
● ●
● ●
● ●
● ● ● ● ●
●● ● ●●●●
● ●
Residuals

● ● ●●●●
● ● ● ●
10

● ●

● ● ● ● ● ● ● ● ●● ● ●
●●
●●
●●

● ● ● ●● ● ● ● ● ● ● ● ● ● ●
●●

●●

●●


● ● ● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ●
●● ●●
●●


●●


●●
● ●●● ●●●● ●●● ●●● ● ●● ● ● ●

2
● ● ● ● ● ●● ● ● ●

●●
●●

●●
●●
● ●●● ● ●● ● ● ● ● ●● ● ● ●● ●●

●●

● ●● ●● ● ● ● ● ● ●●
●● ●●● ●●● ●● ● ● ● ● ●
●●●
●●● ●
● ● ●● ● ●● ●
●●


●●


● ● ●● ●●
● ● ●●
● ●●●●● ● ● ●● ● ●● ● ● ●●
●● ● ●
●●●● ●●●●●●●● ●●
● ● ●●●
● ● ● ●●●● ●

●●●●

●●
●●● ● ●

●●


●●

●●


●●

●●
● ● ● ● ● ● ●● ● ● ● ●● ● ● ●
● ● ●●● ●
●●● ● ● ●● ● ●●



● ●
●● ● ●● ● ●● ● ●●● ● ●● ●●● ● ●● ● ● ●●●
●●●
●●●● ●●●● ●● ●●
●●●●●●●●●●
●●●●●●●● ● ●
● ●●●● ● ● ●●
●●
●●●●●● ● ●


●●


●●


●●


● ●● ●●● ●
● ●●● ●●●●● ●● ●● ●● ●●●● ●●●
● ●● ●●●
● ●● ●●●●
●● ●●
● ●● ●● ● ●● ●
●●


●●


●●


● ●● ● ●●●● ●● ● ● ●●● ●● ● ●
● ●
●●
●●
●●●● ●●
● ●●
● ●●

●● ●
●●●● ●●

●●● ● ●
●●● ●
●●●
● ●● ●
●● ● ●

●●


●●

● ●● ● ● ●● ●●
● ●●● ●●
●●

●●●●
●●
● ●●● ●●●●●
●● ●
●●● ●●
●●● ● ●●●●●● ●●● ●●●●●●

● ●●●●●● ● ●
●●


●●


●●
● ●●● ● ● ●● ● ● ● ● ●● ●●
●●●
● ●●
●●●



●●●
●●●●●
●●●●
● ●●●●●●●
●●●
● ●●●●● ●
●●●
●●●●●


●● ●●
●●


●●●●
● ●●
●●●●
●●

● ●●●●
● ●●●●●● ●

● ●
●● ●●
● ● ●●
●● ●●
●●
● ●

●●
● ●● ● ● ●
●●● ●
●●


●●
●●


●●

● ●



●●●

● ● ● ● ●●
●● ● ● ● ● ●●●●●●●●●●● ●●
●●● ●●● ●
●● ●●

●● ● ●●●
●● ●

●● ●
●●●
● ●●
● ●

●●●●
●●●●●
● ●●● ●●
●●● ●
●●● ●●●● ●● ●●●●
●●● ●
●●●●●● ●●●● ● ● ●
●●


●●


●●

●● ●●●●●● ● ●●●
●●●●● ●●●●
●●
●● ● ●●●●●●
●●●●● ●● ●●
●●●●
● ●●
● ●
●●●●●●● ●
●●● ● ●

●●



●●
● ●

● ●● ● ● ● ● ●●



●● ●
●● ●●
● ●
●● ●
● ●●

●●
●● ●● ●

●●
● ●●
●●●
●●





● ●
●●●
● ●●
●●●●










● ●●●
●●●●●


●●




●●
●●●●



● ●
●●●


●●●

●●

● ● ● ●●● ●
●●
●●





●●●

●●
●●
●●●● ● ●
●●

●●

●●



●●
● ●


●●



● ●●● ● ● ● ●●● ● ●●● ●
●●●●● ●●●●
● ●
●●● ●●●●● ●●
●●●
●● ●●
●● ●● ●●●
●● ●●●
●●●●
●● ●● ●●
● ●●● ● ●
●●

● ●● ● ● ● ●● ●● ●●●● ●●●● ●●●●
● ●
0

●● ●● ●●
● ●●●● ●● ●● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ●●
● ●● ●● ●●● ●● ●● ●●●● ●●

●● ● ●●

● ●●
●●

●●
●●
● ●●● ●
●●
●●● ●●
●● ●●

●●
●●
●●
●●●


●●● ●●●●●●


●● ●●
●● ●●●●●●● ● ●●
●● ●●●●
●●

●● ● ●●●● ●
●●


●●


●●



● ● ● ●● ●● ● ●●● ●●

●● ● ●● ●● ●● ●
●●●
●● ● ●●
●● ●●

●●●●
● ●●●●●
●● ●●

●●●●●●●
●● ●●●
●●
●●●

●●
●●




●●
● ●●



●● ●●
●●
● ●
●●●●
●●●●● ● ● ●


●●●●
●● ●●● ● ● ●●


●●


●●


●●

0
● ●●●●●
●● ●●
●● ●
● ●● ●● ● ●● ● ●


● ● ● ●● ● ● ●●●
●●● ●● ●● ● ●●
●●●●●● ●●
●●
●●● ●●●●●●●●●●●
●● ●●
●●●●●●●
● ●●
●●●●●
●●●●
● ●●●

●●
●●● ●
● ●●
●●●●●●

●●● ●●

●● ●●●●●●●
●● ●●
● ● ●●● ●● ●

●●


●●
● ●


●●



● ●● ● ●
●● ● ●
●● ●●●
● ●●●
●● ●

●●●●
● ●●
●● ●
●●●●● ●

● ●●
●●
●●
● ●

●●●●●
● ●●
●●●
●●
● ●●●
● ●● ●●● ●● ●
●●●●●●● ● ●● ●●


●●


●●
● ●● ●● ●● ● ●●●●●
● ● ● ●
●●
●●●●●
● ●
●●●● ●●
●●●●
●●
●●●●●
●● ●●

●●
●●●●
●● ●●
●●●
● ●●●
●●
●●
● ●●


●●● ●
●●●●●●

● ●●●●●
●●
● ●●

●●
●●● ● ●● ● ●● ● ●●●
●● ●


●●


●●


● ● ● ●● ● ● ● ●●
●●● ●●
● ●●
●●
●●●
●●●● ● ●● ●●
● ●
●● ●
●●
●●●
●●
●●
● ● ●●
●●
●●●
●●
●●●●
●●

● ●●●●●●●●
●●● ● ●
●●● ●● ●● ● ●
● ● ●


●●


●●


●●



● ● ● ● ●● ● ● ●● ● ●●
●● ● ●●
●●●●●
●● ●●●●●●●● ●●●
●●
● ●●●
●●
●●


● ●
●●●●●
●●
●●
●● ●●
● ●
●●
●●
● ●●●●●
● ●● ●● ● ● ●●
● ● ● ● ●

●●


●●●




● ● ● ● ● ●● ●●●●●
● ●● ●●
●●●● ●●
●●
●●●
●●●

●●
●●
●●●●
●●● ●●

●●●●●●●●

● ●
●● ●●
●● ●
● ● ●● ●

●●


●●


●●



● ● ● ●
●● ●● ● ●● ●
● ●●
● ●●● ●●

●●●
●●

●●
●●●● ●●● ●● ●●
● ●●●
●●● ●


●●● ●●
●●●

●●
●●●●
●●●
●● ●
●●●●
●● ●● ● ●● ● ●


●●


●●


●●


●●

● ● ●●●●● ●● ●●● ● ●●●
● ● ●●●● ●● ●●●●●
●●● ●●
●● ●

●●●
●● ●●● ●●
●●
●●●
● ●● ●
● ●●●
●●●●
●●
●●
●●●●●
● ●● ●● ●

●●


●●


●●



● ●● ●●●● ●●● ● ● ●●●●●●●●
●●●●

● ●●● ●●
●●
● ●●
●● ●● ●

●●


●●


●●
● ●●● ● ● ●● ●●● ● ● ●● ●● ● ●● ●
● ●● ●● ● ● ● ●●


●●


●●
−10

● ● ● ●● ● ● ● ● ●●
●●●
● ● ●●●●●
●●● ●● ●● ●● ●●●●● ●●●●●● ●
● ● ●● ●
●●



●●
● ●


●●


●●

●●



● ● ● ●●●●●
● ●●
●● ● ●●
● ● ●●● ●●●
● ●●●●●● ●

● ●
●●


●●

●●
●●



●●

● ●
●●
●●

● ● ●● ● ●● ● ●●


●●

−2
● ● ● ●●● ● ● ●
●●


●● ● ● ●●●● ● ●●
● ● ● ●
● ●●

●●


●●

●●

●●

●●

●●
● ● ● ●

●●

●●

●●


●● ● ● ●●
●●

●●

● ●
●●

●●


●●

●●


●●
●●
●●

●●
●●
● ●●●


● ● ●●●●●●


● ●

8 10 12 14 16 18 −3 −2 −1 0 1 2 3
Fitted values Theoretical Quantiles
Scale−Location Residuals vs Leverage
Standardized residuals

Standardized residuals
0.0 0.5 1.0 1.5 2.0

● 0.5

4

● ●
● ● ●● ●
● ● ● ●
●● ● ●
● ●
● ● ● ● ●

●●●
● ● ● ● ● ● ●

● ● ●● ●● ● ●● ●● ● ●● ●●
●● ● ●
● ●● ●● ● ● ●● ● ●
● ● ● ●● ● ●●● ● ● ●● ● ● ●● ● ●● ●
●●●●●●

● ●

● ●●● ● ● ● ● ●●●● ●●
●● ●● ●●●●
● ● ●
●●●● ●
● ● ● ● ● ● ●●● ●● ● ● ● ●● ●●●
●●
● ●●
● ●● ●● ●
● ●● ●●



●●
● ● ●
● ● ●
2
● ●● ● ●●● ● ●●● ● ● ●● ●
●●

● ●
● ● ● ● ● ●●● ● ●●● ●●●●● ●●●
●● ● ●●●●●
● ● ●●
● ●● ●● ●







●●

●●
●●●
● ●● ●
● ● ●● ● ● ●
● ● ● ●● ●● ●●● ●● ● ●
● ● ●●●●
●● ●●
●● ●●●●● ●●●
●●●●●●●
● ●●● ●● ● ●● ●● ●●
●● ● ● ●







●●
●●●●● ● ●
● ● ●
● ● ●●● ●● ●● ●●●● ● ●

●●●●●●●●
● ●
●●●●●●● ● ●●●

● ●● ● ●●●
● ● ●● ●
●●● ● ●


●●



●●


●●





●●




●● ●● ● ● ● ●
● ● ● ●● ● ●● ●
● ●● ●
●●● ● ●●●●●● ●● ●●
● ● ●● ●●● ● ●●●●● ● ● ● ●


●●



●●

●●●●
●●●● ● ●● ● ●
● ● ● ●● ● ● ● ●● ● ● ●●● ● ●● ●●● ●●● ●●


● ●● ●

●●●●
●● ●●●
● ●●

● ●
●●●●

●●●●●● ●●●●
●●●●●●


●●
● ●●●● ● ●●
● ● ● ● ●●●● ● ●●


●●


●●





●●














●● ● ● ●●

● ● ● ● ●● ●●
● ●● ●● ●
●● ●●
●●
●●●●


●●
●●●●●● ●● ●●●●●● ●

●●

●●
●●

●● ●
● ●●●●

● ●●
●●●
●●
● ●
●●●
● ●●
●● ● ●●
●●● ● ●

●● ●● ● ●
●●●● ●● ● ●




●●






●●






●●●●● ●●
● ●● ● ● ● ● ●


● ● ● ●●●
● ●● ● ●● ●● ●●●
●●
● ● ●●
● ●●
●●●
●●●

●●●
●●

●●●




● ●






●●
● ●●
●●
●●●
●●
●●●●


●●●
● ●●●


●●● ●
●●●
● ●●● ●●





































●● ●●


● ● ●● ●●●
● ●
● ● ● ●● ● ●●●● ● ●● ● ● ●
●● ●●●●●●
●● ●
●●●
●● ●●●● ●
●● ● ●
●●●●
●●●●●●● ● ●
●● ●●● ● ● ● ●● ● ●


●●


●●

●●

●●●● ●
● ●● ●● ● ●●●●● ●●
● ●● ●●●●●● ● ●●● ● ● ●● ● ● ●


●●



●●
●●
●●● ● ● ●
● ● ●● ●● ● ● ●●● ●●●● ● ● ● ● ●● ●●●●
●●●●
●●●
●●●● ●●●
●● ● ●●

●●●●●●
● ●●●



●●
● ●
●●
●●
●● ● ●● ●●● ●●●● ●●●● ●
●● ●
● ● ●






●●










●●

●●
●●●●●● ● ● ●
● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●●
●● ●●●●
●●
●● ●
●●●●●●●●● ●●
● ●
● ●●
●●

●●●●
●●
●●●●●●
● ●●

● ●●
●●●● ●● ● ● ●● ●









●●




●●●

●● ● ● ● ●
● ● ● ● ●●
●● ●●●●● ●●
●●●
●●

●●

●●● ●●● ●●
●●●
●●●●
●●



●●●●
●●

●●

●●

●● ●
●●●●
●● ●●
● ●●
●●●

●●

●●
● ●●●● ● ● ●●●●
● ●●
●●●● ● ● ● ●





























●● ●
● ●● ● ● ● ● ●● ● ●
● ●● ● ● ●●●●●●● ● ●●●●
● ●●
●●●● ● ●● ●●●● ●●●
●● ● ●●
0

●● ●● ● ● ●
●● ●●●
●●●● ● ● ●● ●●

● ●●●●● ● ● ●
●●●●●●
● ●●●●
●●●●●●
●●
●●

●●●●

●●●●● ●●
●●●●●● ● ●●●●
●●●
●● ●● ● ● ●●



















●●



●●●
●●
● ● ●● ●●● ●● ●●●
● ● ●
● ● ●●●●● ●●●●●
●●●
●●●
●●
● ●● ●● ●
●●●● ●●●●
●●●●●●●
●●●●●
● ●●
● ●●
●●●● ●

●●●●● ●●● ●●
●●●● ●● ● ●





●●



●●



●●




●●
● ●●●●●●● ●●
● ● ● ●●● ● ● ● ● ●●● ● ●●● ●
●●
●●
●● ●

●●●●
●●
●●●
●●● ●● ●●●

● ●●● ●● ●

●●

●●●●
●●●●●●
● ●●
●●● ●● ●●● ●●
●● ● ●● ● ●●● ●





●●




●●



●●

●●●

● ●●●●●
● ● ● ●

●●● ●●●● ●● ●● ●
●●●●●
● ●●●
●● ●
●● ●●●●●●●●
●●●●●●


●●●
●● ●●●



●●●
● ●●●
●●
●●

●●●●
●● ●●●

●●●●●
● ●● ●●

●● ●●
●●●● ●






















●●
●●



●●● ● ● ●

● ● ●
●● ● ● ●●●● ●●● ● ●●● ●
● ●●●●●
●● ●● ●●
●●●●●
●●
●● ● ●●●●
●●●
●●●
●● ●
●●●●●

●●
●● ●● ●●
●●
● ● ●
●●● ●●
●● ●●●
●●●●
●●● ● ●● ●






●●


●●



●●
●●●
●●●●●●●
● ●● ●●●● ●●●● ●●●
●● ● ●●●
●●●●● ●●
●●● ●●● ●
●●
●● ●
●●● ●●
●● ●


●●●●

●●●● ●● ●●
● ●●
●●● ●●● ●● ●● ●●●●● ● ● ● ● ● ● ●




●●













●● ● ● ●● ● ● ●

●● ●●
● ● ● ●●● ●●
● ● ●●●●●● ●● ●●●
●● ●● ●●●● ●
●●●
● ●●●
●●●● ●● ● ● ● ● ●● ● ●


●●

●●

●●
● ● ●
● ●●● ● ●● ●●●●●

●●●●●●●●●
●●●●
●●
●● ●
●● ●●●●●●
● ●●●
● ● ●●●●
● ●
●●●●
●●● ●● ●●●
●●●● ●●● ●
●●
●●●● ●
● ● ● ●












●●


● ●

●● ●● ●
●●
● ●
● ● ●● ● ●●● ●●●
● ●●●● ● ●●
●● ●●
● ●●●
●● ● ●● ●

●●● ●● ●●●●
●●
●●

● ● ● ●● ●


●●


●●
●●
●●● ● ●
● ● ● ●●●● ● ●● ●●●●
●●●● ●●●
● ●●
●●●●● ●
●●
●● ●●● ●
●●●●●●
● ●● ●●●●
● ●●●●●●●●● ●●● ●●●●

● ●●●● ● ●●






●●



●●

●●
● ●
●● ● ● ● ● ●●●●●●● ● ● ● ●●
● ●●

●●

● ●●

● ●● ●
● ●●●●
● ●●
●●●●
●●●● ●●●
●●●

●●
●●● ●●
● ●●●●● ● ●●



● ●● ●
●●●●
● ●● ● ●● ●●
●●● ●● ●









●●
●●










●●●
● ●●

● ●
● ● ●
●●
● ●● ●● ●●●● ● ●●● ● ●● ●●
● ●●●●●●●●●
●● ●
●●● ● ●
● ● ● ●●● ●
● ● ● ●
● ● ●● ● ● ● ●


●●●●
● ●
● ● ●●● ●●
● ●●
●●●● ● ●

●●

● ●

−2

● ● ●●●●● ●● ● ●●● ● ●● ● ●
●●
● ● ●●● ● ●●●
●● ●● ●●●●
●● ● ●●●●●●● ●● ● ●
●● ● ●
● ● ● ●





●●
●●
●●● ● ●
● ● ●●● ● ● ●
●● ●●●●●●●●● ● ● ●
● ●● ● ● ● ●● ●●●●●●●
● ●●
●●●



●●●
●●● ●●●● ● ●●●●● ● ●●● ●●

●●










●●


● ● ● ●
● ● ● ● ●●●●
●● ●● ● ●●
● ●● ● ●● ● ● ●

●● ● ●
● ●● ● ●● ● ● ●●● ● ●

● ● ●●●

●● ●●● ●● ● ●●● ●●● ● ●
●●● ●●●
● ●● ●●
●● ●● ● ●● ● ● ●● ● ●


●●
●●
●● ●
●● ● ● ● ● ● ● ● ● ●●●● ● ●●● ●● ●● ● ●
● ●●● ●● ●● ●●● ●● ● ● ● ●


●●

●●● ●
● ●● ●
●●● ●● ●● ● ● ● ●●
●● ● ●
● ●● ●
●●● ● ●● ● ●● ●
●●● ●
●● ● ●● ● ● ●●
● ●● ● ● ● ●
● ● ● ●● ● ●●● ●●●●●
● ●●● ●
● ● ●● ● ●●●●● ● ● ●● ●● ● ●
● ● ● ● ●●● ● ● ●●● ●●
●● ● ●● ●● ●●
● ● ● ● ●● ●● ● ● ● ●● ●●●● ●● ●● ●
●●● ●● ● ●●● ● ● ● ●
● ● ● ●●● ● ● ●
●● ● ●● ●● ● ● ●●●●● ●
● ●● ●
● ● ● ●● ● ●● ● ●
● ● ●● ● ● ● ● ●●●● ● ● ● ● ●




● ●●● ● ● ● ● ●
● ● ●
● ● ●●●
●●

●●●
●●
●● ● ● ● ●

● ●● ●●
●●
●● ● ●●●● ● ●●●
● ●
● ● ●● ●
●●

● ●

Cook's distance
−4

8 10 12 14 16 18 0.00 0.02 0.04 0.06 0.08 0.10


Fitted values Leverage
Model criticism: Linear effect of size?
bla
blub
20
10
5

0
−10

20 40 60 80 100 120 140 160


size
Model criticism: Linear effect of size?
plot(size, res_size)
points(sort(unique(size)), tapply(res_size, size, mean), col = "red")
20


10




● ●●
5

● ● ●
● ●
● ● ●●● ●

● ●
● ●
● ●● ●
● ● ●●
●●● ●●●●●● ● ●
0

● ● ●
● ● ● ●● ●●● ● ● ●
●● ● ●●●● ●● ● ● ●● ● ● ●
● ● ● ●● ●●●●● ● ● ● ● ● ●● ● ●● ● ●

● ● ● ● ● ●

● ● ● ●● ● ● ● ●
● ● ●
● ● ● ●
●●
●● ●● ● ●


−10



20 40 60 80 100 120 140 160


size
Alternative Representation of linear models

I y = Xβ + ε
I Gaussian errors: ε ∼ N(0, σ 2 I)
⇒ y ∼ N(Xβ, σ 2 I)
⇒ E(y) = Xβ; Var(y) = σ 2 I

58 / 398
Recap: Linear Models

Recap: Generalized Linear Models


Motivation
GLMs: The General Approach
Inference

Recap: Non-Linear Effects

Recap: Mixed Models and Random Effects

Recap: Additive Models and Penalization

59 / 398
Binary Target: Naive Approach

Data:
I binary target y (0 or 1)
I metric and/or categorical x1 , . . . , xp
naive estimates:
ŷi = β̂0 + β̂1 xi1 + · · · + β̂p xip
I ŷi not binary
I could try to interpret ŷi as P̂(yi = 1)
I no variance homogeneity
I ŷi < 0 ? ŷi > 1? ⇒ ŷi must be between 0 and 1
Idea:
P̂(yi = 1) = h(x>
i β̂) with h : (−∞, +∞) → [0, 1]

59 / 398
Binary Target: GLM Approach

I yi ∼ B(1, πi )
I model for E(yi ) = P(yi = 1) = πi
I use response function h: π̂i = h(x> i β̂)
or linkfunktion g : g (π̂i ) = xi β̂ where g () = h−1 ()
>

Logit-Model:
exp(x>
i β̂)
π̂i = h(x>
i β̂) =
1 + exp(x>
i β̂)

60 / 398
Binary Target: Coefficients of the Logitmodel

exp(x>
i β) πi
πi = ⇔ log = x>
i β̂
1 + exp(x>
i β) 1 − πi

πi
⇔ = exp(β0 ) exp(β1 xi1 ) . . . exp(βp xi )
1 − πi

I linear model for log-odds (Logits)


π̂i
⇒ exp(β̂r ) as factor by which odds change 1−π̂i an, if xir increases by 1.

  P̂(y = 1|x)/P̂(y = 0|x)
exp (x − x̃)> β̂ =
P̂(y = 1|x̃)/P̂(y = 0|x̃)
odds ratio between 2 observations with x and x̃.

61 / 398
Binary Target: Probit- & cloglog-Models

Probit-Model:
use standard-Gaussian ECDF as response function:

π̂i = Φ(x>
i β̂)

cloglog-Model:
response function:
π̂i = 1 − exp(− exp(x>
i β̂))

62 / 398
Binary Targets: Expectation and Variance

I no direct connection between expectation (x> β) and variance (σ 2 ) in


linear model
I for binary y ∼ B(1, π):
E(y ) = π = P(y = 1) determines Var(y ) = π(1 − π)
Overdispersion:
observed variability greater than theory assumes:
I unobserved heterogeneity
I positively correlated observations
Solution: add dispersion φ : Var(y ) = φπ(1 − π)

63 / 398
Example: Patent Injunctions

I Data: 4832 European patents (Europäisches Patentamt)


I Target: patent injunctions (ja/nein)
I covariates (metric):
I year of patent (0=1980)
I citations (azit)
I scope (no. of countries; aland)
I patent claims (ansp)
I covariates (categorical):
I sector (Biotech&Pharma, IT&Semiconductor) (branche)
I US patent (uszw)
I patent holder origin (US/D, CH, GB/others; (herkunft))

64 / 398
Binary Target: R-Implementation

## The following objects are masked from patent (pos = 4):


##
## aland, ansp, azit, branche, einspruch, herkunft,
## jahr, uszw

pat1 <- glm(einspruch ~ ., data = patent, family = binomial())


round(summary(pat1)$coefficients, 3)

## Estimate Std. Error z value Pr(>|z|)


## (Intercept) -0.771 0.134 -5.765 0.000
## uszwUSPatent -0.392 0.068 -5.795 0.000
## jahr -0.071 0.009 -8.194 0.000
## azit 0.118 0.014 8.297 0.000
## aland 0.084 0.011 7.915 0.000
## ansp 0.018 0.003 5.219 0.000
## brancheBioPharma 0.681 0.084 8.128 0.000
## herkunftD/CH/GB 0.323 0.083 3.897 0.000
## herkunftUS -0.152 0.076 -2.002 0.045
Binary Target: R-Implementation
round(exp(cbind(coef(pat1), confint(pat1))), 3)

## Waiting for profiling to be done...

## 2.5 % 97.5 %
## (Intercept) 0.462 0.355 0.601
## uszwUSPatent 0.676 0.592 0.772
## jahr 0.931 0.915 0.947
## azit 1.125 1.095 1.157
## aland 1.088 1.066 1.111
## ansp 1.018 1.011 1.025
## brancheBioPharma 1.975 1.676 2.328
## herkunftD/CH/GB 1.381 1.174 1.625
## herkunftUS 0.859 0.741 0.997

table(einspruch, estimated = round(fitted(pat1)))

## estimated
## einspruch 0 1
## nein 2223 624
## ja 925 1094
Count Data as Targets

Daten:
I positive, whole number target y (counts, frequencies)
I metric and/or categorical x1 , . . . , xp
⇒ naive estimates Ê(yi ) = x>
i β̂ could become negative
⇒ model log(Ê(yi )), i.e,
 
Ê(yi ) = exp x> i β̂ = exp(βˆ0 ) exp(βˆ1 xi1 ) . . . exp(βˆp xip )

⇒ exponential-multiplicative covariate effects on target

67 / 398
Count Data as Targets: log-linear Model

Distributional assumption:
I yi |xi ∼ Po (λi ) ; λi = exp(x>
i β)
⇒ E(yi ) = Var(yi ) = λi
Overdispersion:
I Frequently Var(yi ) 6= λi :
⇒ more flexible model with dispersion parameter φ:
Var(yi ) = φλi
⇒ alternative distributions: Tweedie, Negative Binomial

68 / 398
Exampe: Patent Citations

pat2 <- glm(azit ~ ., family = poisson, data = patent)


pat3 <- MASS::glm.nb(azit ~ ., data = patent)
AIC(pat2, pat3)

## df AIC
## pat2 9 21021.23
## pat3 10 16341.48

round(cbind(
summary(pat2)$coefficients[2:5, -c(3, 4)],
summary(pat3)$coefficients[2:5, -c(3, 4)]
), 3)

## Estimate Std. Error Estimate Std. Error


## einspruchja 0.442 0.024 0.422 0.046
## uszwUSPatent -0.079 0.024 -0.047 0.046
## jahr -0.070 0.003 -0.079 0.006
## aland -0.026 0.004 -0.029 0.008

⇒ similar estimates, much bigger variability, better fit.


Definition: GLM

I Structural assumption: Connect conditional expectation and linear


predictor Xβ via link/response function:

E (yi |xi ) = µi = h(x> >


i β) ⇔ g (E (yi |xi )) = g (µi ) = xi β

exp(x>
i β)
I logit regression: E(yi |xi ) = P(yi = 1|xi ) = 1+exp(x>i β)
I log-linear model E(yi |xi ) = exp(xi β)
I Distributional assumption: Given independent (xi , yi ) with
exponential family density f (yi ):
 
yi θi −b(θi )
⇒ f (yi |θi ) = exp φ ωi − c(yi , φ, ωi ) ; θi = θ(µi )
I E(y |x ) = µ = b 0 (θ ) = h(x> β)
i i i i i
I Var(y |x ) = φb 00 (θ )/ω ; ω = n
i i i i i i

⇒ Connect mean structure and variance structure (and higher moments)

70 / 398
Simple Exponential Families

Distribution θ(µ) b(θ) φ

Normal N(µ, σ 2 ) µ θ2 /2 σ2
µ
Bernoulli B(1, µ) log( 1−µ ) log(1 + exp(θ)) 1
Poisson Po(µ) log(µ) exp(θ) 1
Gamma G (µ, ν) −1/µ − log(−θ)
√ 1/ν
Inverse Gauß IG (µ, σ 2 ) 1/µ2 − −2θ σ2

71 / 398
Simple Exponential Families

Distribution E(y ) = b 0 (θ) b 00 (θ) Var(y ) = b 00 (θ)φ/ω

Normal µ=θ 1 σ 2 /ω
exp(θ)
Bernoulli µ = 1+exp(θ) µ(1 − µ) µ(1 − µ)/ω
Poisson µ = exp(θ) µ µ/ω
Gamma µ=1− √ 1/θ µ2 µ2 /(νω)
Inverse Gauß µ = 1/ −2θ µ3 µ3 σ 2 /ω

72 / 398
R-Implementation: glm()

glm(formula, family , data, ...)


I formula: as in lm
I family: specify distribution (binomial, gamma, etc.)
and link function g (µ) = Xβ
(family=binomial(link=’probit’)).

73 / 398
Advantages of GLM-Formulation

Iunified approach for variety of data situations


⇒ unified methodology for
I estimation
I tests
I model choice and diagnostics
⇒ asymptotics
via Maximum Likelihood approach.

74 / 398
Recent Extensions:

GLM idea in combination with ML inference works similarly for many


other non-exponential family distributions, implemented in mgcv:
I t-distribution
I Tweedie
I Beta
I models for ordinal categorical responses
(Wood, Pya & Säfken, 2016)

75 / 398
ML Estimation: Idea

Pn > 2
OLS estimate in linear model: i=1 (yi − xi β) → min
I
√  Pn
(y −x> β)2

density for y: ni=1 f (yi |β, xi ) = ( 2πσ)−n exp − i=1 2σi 2 i
Q
I

⇒ OLS estimate maximizes joint density of observed data over model


parameters
⇒ Maximum Likelihood principle:
maximize (Log-)Likelihood l(β) = ni=1 log(f (yi |β, xi ))
P

76 / 398
ML Estimation: Procedure

Pn
I log-likelihood l(β) = i=1 log(f (yi |β, xi ))

I score function s(β) = ∂β l(β)
I (iterative) solution for s(β) = 0
via Fisher-Scoring or IWLS

77 / 398
ML Estimation: Fisher-Scoring
I basically Newton method:
 −1
β (k+1) = β (k) − ∂β∂> s(β) s(β)

Newton−Verfahren

β(0)
s(β)

β(1)
β(2)

β
78 / 398
ML Estimation: Fisher-Scoring & IWLS

I basically Newton method:


 −1
β̂ (k+1) = β̂ (k) − ∂β∂> s(β̂ (k) ) s(β̂ (k) )
 −1
I observed information matrix H(β̂ (k) ) = ∂β∂> s(β̂ (k) ) expensive to
compute
⇒ use expected Fisher information F(β) = E(H(β))
very efficiently computable:
represent in terms of iteratively re-weighted LS estimation (IWLS) with a
(k)
diagonal weight matrix W(k) and working observations ỹi .

79 / 398
Properties of ML Estimators

β̂ML is consistent, efficient, asymptotically Gaussian:


a
β̂ML ∼ N(β, F−1 (β))

80 / 398
Tests
Linear hypotheses H0 : Cβ = d vs HA : Cβ 6= d
Estimation for β under restriction H0 : β̃
I LR-Test:
lq = −2(l(β̃) − l(β̂))

I Wald-Test:

w = (Cβ̂ − d)> (CF−1 (β̂)C> )−1 (Cβ̂ − d)

I Score-Test:
u = s(β̃)> F−1 (β̃)s(β̃)

a
under H0 : lq, w , u ∼ χ2r , r = rank(C) (no. of restrictions)
⇒ rejecct H0 if lq, w , u > χ2r (1 − α).
81 / 398
Tests in R
√ a
summary.glm uses w ∼ N(0, 1) for H0 : βj = 0:
round(summary(pat2)$coefficients[8:9, ], 3)

## Estimate Std. Error z value Pr(>|z|)


## herkunftD/CH/GB -0.236 0.031 -7.524 0.000
## herkunftUS 0.061 0.026 2.358 0.018

anova.glm(..., test=’Chisq’) for LR-Tests:


anova(update(pat2, . ~ . - herkunft), pat2, test = "Chisq")

## Analysis of Deviance Table


##
## Model 1: azit ~ einspruch + uszw + jahr + aland + ansp + branche
## Model 2: azit ~ einspruch + uszw + jahr + aland + ansp + branche + herkunft
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 4859 13954
## 2 4857 13859 2 95.155 < 2.2e-16 ***
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Model Choice

Which probabilistic model offers best trade-off between fidelity to training


data (more complexity) and parsimony?
⇒ Information criteria:
I Akaike: AIC = −2l(β̂) + 2p → min (AIC())
I Bayes: BIC = −2l(β̂) + log(n)p → min

83 / 398
Model Diagnostics: Residuals

I Pearson residuals: (resid-Option: type=’pearson’)


yi −µ̂i
I riP = √
v (µ̂i )
I for grouped data approx. N(0, 1)
I deviance residuals (resid-Default)
riD = sgn(yi − µ̂i ) 2(li (yi ) − li (µ̂i ))
p
I
I for grouped data approx. N(0, 1)-verteilt.
I partial residuals (type=’partial’)
I prediction errors yi − ŷi (type=’response’)

84 / 398
Model validation: plot.glm()

par(mfrow = c(2, 2))


plot(pat2)

Residuals vs Fitted Normal Q−Q

Std. deviance resid.


2796
1871 ● ● 2796 ●
● 1871
● 4743 ● ● ●●● 4743
Residuals

●●●●● ●●●●● ●●




●●
● ●● ● ● ●● ●● ●● ●● ●
●●

●●


●●●● ●● ●● ●● ● ● ●
● ●
●● ●
● ●●

●●



●●
5

5
● ● ●
●● ● ●


●●●● ●

●●

●●●●●


●●
●●


●●

●●

●●
●●●

●●
●●

●●●



●● ●


●●●


●●●

●●




●●●●●
● ●
●●● ●
●●


●●



●●


●●
● ● ●


●●●

●●●


●●

●●

●●


●●


●●










































●●










●●



●●




●●

●●








●●





●●


●●



●●




●●


●●●











●●
●●


●●●

●●


● ●

●●●
● ● ●● ●

●●


●●


●●



●●


●●


●●●●
●●●

●●
●●


●●



●●



●●




●●





●●

●●

●●
●●

●●


●●


●●

●●


●●

●●


●●



●●




●●




●●



●●



●●

●●

●●


●●



●●




●●



●●




●●



●●

●●



●●



●●


●●


●●



●●

●●

●●


●●●

●●●

● ●●●●

● ●●



●●


●●



●●


●●

●●
●●●●
●●
●●
● ●


●●

●●
●●


●●


●●


●●


●●


●●


●●

●●


●●
●●


●●


●●●●


●●
●●


●●


●●


●●

●●

●●


●●



●●


●●

●●

●●
●●


●●●
●●
●●
●●● ●
● ● ●

●●




●●

●●
● ●



● ●●●
●●
●●
●●

●●


●●









●●





●●


●●

●●












●●●



●●


●●




●●


●●


●●


●●



●●
●●

●●

●●●


●●


●●

●●


●●
●●

●●
●●



●●


●●


●●


●●





●●






●●


●●



●●



●●



●●





●●





●●●



●●
●●


●●





●●





●●





●●

























































●●
●●




●●





















●●●
●●

●●● ●●

● ●



●●


●●



●●
●●
●●

● ●




●●

●●

●●

●●
●●

●●
●● ●


●●




●●
●●

●●


●●

●●
●●

●●

●●

●●
●●



●●

●●

●●

●●
●●


●●


●●

●●●


●●



●●


●●

●●


●●
●●●


●●

●●

●●

●●
●●
●●


●●

●●
●●


●●



●●





●●
●●



●●

●●



●●


●●

●●

●●

●●
●●


●●


●●

●●●



●●


●●

●●

●●
●●


●●


●●


●●











































●●●●
●●●
● ● ● ●


●●


●●


●●



●●

●●



●●
●●●

●●●



●●



● ●
●●
●●


●●
● ●

●●



●●

●●●




●●
●●●

●●



●● ●

●●

●●
● ●

●●
●●●

●●●

● ●


●●


●●


●●
● ●
●●


●●●

● ●
●●

●●

●●
●●
●●
●●

●●
●●


● ●●




●●

●●


●●
●●
●●



●●




●●
●●
●●●●●
●●● ●●
●●

●●

●●


●●

●●
●●



● ●●
−5

−5
●●
● ●

−0.5 0.0 0.5 1.0 1.5 2.0 2.5 −4 −2 0 2 4

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage


Std. deviance resid.

Std. Pearson resid.


2796

−5 5 15 25
● 4743 1871 ● ● ● 2796
3.0

● ●

● ●●●●● ●
● ●● ●● ● ●
● ● ● ●● ● ●●●● ●●●● ● ●● ●●
● ●
●●●●● ● ●● ●●●●●●●●
● ●

●● ●
●● ●●


●●







●●●●●






●●●





●●

● ●●
●●●●




●●

●●●●
●●
● ● ●




● ● ●
● ●●●●●●
●●
●●●

●●
●●
●●

●●


●●●
●●

●●●
●●●


●●
●●●● ●●●●
●●
●●●
● ●●●● ●●●●●●●●
● ● ●


1.5

●●
●●●
●● ●●●
●●
●●

●●●

●●
●●●●
●● ●●


●●●

●●

●●●

●●● ●
● ●
● ●
●●●


●●


●●

●●
●●


●●
●●
●●

●●
● ●


●●
●●●

●●








●●
●●
●●

●●
●●

●●
●●


●●





●●●
●●

●●●




●●

●●

● ●


●●



●●

●●

●●

●●




●●




●●

● ●






●●


●●●
●●
●●



●●

●●
● ●

●●

●●






●●



●●
●●

● ●











●●
●●●
●●
● ●●●●

● ●
●●
●●●● ●







● ●
●●
● ●




●●



●●
●●



●●

●●
●●

●●

●●

●●


● ●●





●●


●●
●●
● ●●



●●

●●

●●●



●●




● ●



●●


●●

●●


●●
● ●

●●
●●

●●●
●●



●●

●●● ●
●●

●●

●●


●●
● ●●
●●

●●
●●
●●

●●
● ●
●●●


●●


●●
●●

●●
●● ●● ● ● ●● ●●




●●
●●●


●●
● ●

●●


●●●

●●

●●


●●

●●
●●

● ●


●●


●●



●●



●●


●●●

●●


● ●

●●

●●


●●●

●●●
● ●
●●

●●

●●

●●●●

●●

●●


●●


●●




●●● ●
●●
●●
●●
● ●
●●●● ●






● 1
● ● ●●

●●


●●
●●


●●

●●

●●


●●


●●

●●

●●

●●




●●


●●


●●





●●


●●

●●
●●●
●●●

●●

●●
●●



●●

●●

●●

●●
●● ●● ●










●● 0.5
●●
●●●●
● ●
●●


●●

●●

●●●

●●

●●


●●


●●


●● ●

●●


●●
●●
● ●

●●

●●

●●
●●
●● ●●

● ● ●



●●

●●


●●

●●
●●


●● ●●


●●

●●

●●


● ●●●
●●

●●


● ●


●●

●●




● ●
●●● ●
●● ●●
●● ●●







●●

●● Cook's distance



●●


●●



●●
● ●

●●


●●


●●

● ●

●●

●●


● ●
●●

●●
●●
● ●

●●

●● ●
●●●●

●●

●● ●●
● 0.5
1

●● ●
●● ●● ●
● ●
● ● ● 3939 ● 4290 ●
0.0



●●
●●

● ●
●●

● ●
● ●
● ●

−0.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.1 0.2 0.3 0.4

Predicted values Leverage


Recap: Linear Models

Recap: Generalized Linear Models

Recap: Non-Linear Effects


Transformation of Covariates
Polynomial Splines

Recap: Mixed Models and Random Effects

Recap: Additive Models and Penalization

86 / 398
Motivation
bla
blub
20
10
5

0
−10

20 40 60 80 100 120 140 160


size
Motivation
par(mfrow = c(1, 2))
plot(size, res_size)
points(sort(unique(size)), tapply(res_size, size, mean), col = "red")
plot(log(size, base = 2), res_size)
points(log(sort(unique(size)), base = 2), tapply(res_size, size, mean), col = "red")
20

20
● ●

● ● ● ●
● ●
15

15
● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ● ●● ●
●● ● ●
● ● ●● ●
● ●● ● ● ●
● ●● ● ●●
● ●● ● ●

● ● ● ● ●● ● ●● ● ● ●
● ● ●
● ●● ● ●●
●●
● ● ● ● ●● ● ● ●
● ●
10

10
●● ●● ●● ● ●●● ● ● ●● ●
● ● ● ● ● ● ● ● ● ●
●●● ● ●●● ●●● ● ● ●

● ● ● ●● ● ● ●
● ●
● ●●● ●
●● ●
●●● ●

● ● ●● ● ● ● ●● ●●●● ● ●● ● ●● ● ● ●● ●● ● ● ●
●●
● ●●●●
● ●●
● ●●● ●●●● ● ● ●● ●

● ● ●●●
● ●●●● ● ●● ● ●●●●●● ●●●● ● ● ●
● ●
● ●
●● ●●● ● ● ●● ●● ● ● ● ●● ●
●●
● ● ● ●
●●●●● ●●●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●
● ● ●● ●●●●●● ● ● ● ● ●


● ●
● ●●●

●●●● ●●

● ●
●●


●● ●

● ● ● ●
● ● ●●
●● ●
●● ● ● ● ●●● ●
● ● ●●●●● ●●●●● ●

● ● ●
● ● ●●
●● ●
●● ● ● ●
●● ●●●●● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ●●●●●●●●● ●●●
●●●● ● ● ● ● ●● ● ●
●●
● ●
●●●
● ● ●●● ●●

●● ●●

● ● ● ● ● ●● ●
●●● ● ●
● ●
●● ●●● ●●●●●●●●●●●

● ● ● ● ●
●● ●● ●● ● ● ●● ● ● ●● ●● ● ●●●● ● ● ●● ● ● ●● ●

● ●●●●●


●●●

●●
●●●

●●

●●
●●
●● ●●


●●
● ●●●●●
●●● ●●●

●● ●
●●
● ● ● ●●
●●
● ● ● ● ● ●●●● ●●
●● ●●●●●● ●●● ●●●●
●● ● ● ●● ●●●● ● ●●●

●● ● ●● ● ●●
●●
● ● ●

●● ● ●●●
●●●●●
●●●
● ● ● ●
●●

●● ●

●●

●●


●●●●●
●● ●● ●
●●●●●●● ● ● ● ● ● ● ●
● ●● ●
●●● ●●●●
●●●●● ● ●●
●●
●●

●●●●
● ●●
●● ●●
●●●●
●●
●●● ● ●
●●●●●●●● ● ● ●● ●
5

5
● ●●● ● ●
●●
●●●
● ● ●

●●●●●
●●
● ●

●● ●
● ●●●●●
● ●●● ●

● ●● ● ●● ● ● ● ●
●● ●●● ●●●
●●● ● ●●●●●●
●●
● ●●● ● ● ●●

● ●●●
● ●
● ●●● ● ●●● ● ● ●
● ●● ● ●● ● ●● ● ● ●●● ● ● ●● ● ●●●● ● ● ●●●●
● ●● ● ●
●●
● ● ●


●●●●● ●●
●●●
●●●
●●●●
●●●

● ●●
●● ●

●●●● ● ●● ●● ● ● ● ● ●●● ●
●● ● ●
●●●●●● ●●
●●●●●●● ●●●●●
●●●●● ●●
●●●


● ●●
●●● ●●●● ● ●● ●●
● ● ● ●
● ● ●●
●●
●●
●●

●●●

●●●
● ●

● ●
●●

●●●
●●●●

●●●

●●●
●●●
●●

●●●
● ● ●● ●
● ●●●● ● ●● ● ● ●●
● ●

● ● ● ●●●●●
●● ●●
● ●●●●●●●●
● ●●●
●●●
● ●●●


●●●● ●


●●
●●

●●●
● ● ●● ●
● ●● ●● ● ●●● ● ●● ● ● ●

● ● ●
●●
●●● ●●●

●●● ●●● ●
●●●●●●●
●●
●●

●● ●

●●● ●●●

●●●● ●
●● ●
●● ● ● ●
●● ●● ● ●● ● ●●● ● ● ●●
● ● ● ● ●● ●●● ●
●●●● ● ●● ●●
● ●
●●●
●● ●●
●●
● ●●●

●●●● ●
●● ●
●● ● ● ●● ● ●
● ● ●● ●


●● ●●● ● ●● ●●●● ● ●●
●●
● ●
●●●

● ●● ●

● ●●●



●●
●●

●●

●●●


●●

●●●

●●

● ●
●●●



● ●
●●●
●●
●●

● ●● ●●
●● ● ● ●●● ●●●●● ● ● ●●
● ●●●● ●●● ●
● ●●● ●●● ●●●● ●●●
●●
●●
●●
●●
●●





●●●



● ●
●●●
●●
●●

● ●● ● ●
● ● ● ● ●●●●
● ●● ●
●●●●● ●
●●
●● ●●●●
●●
● ●
●● ●● ●●●●●
●●●
●●●● ●
● ●
●●●● ●
●●
● ●
●●● ●
● ● ● ● ●● ● ● ● ● ● ● ●
●●●●● ● ● ●
●● ● ● ●●●●●●●● ●● ● ●●
●●
●●●●●●●● ● ● ●
●● ● ●
●●
● ●●● ●
● ● ●● ●● ● ● ●● ●●
●● ● ● ●●
●●●●
● ●●●●●● ●

●● ●●● ●● ●● ● ● ●●
● ● ● ●●
● ● ● ● ●●
● ●●●
●●
●●● ● ●●●●●●●●●
●● ●●●
●● ● ●●● ●●● ● ● ●●● ●●●● ●●
● ●
●●● ●●
● ● ●●
●●
●●
● ●●
●●●●●●●●●● ●
●●● ●●●●●
●●●
●●
●●●
●●
●● ●●● ●●●●● ● ● ● ● ●●●● ●●● ●● ● ●● ●● ●●●●●
●●●
●●
●●●
●●
●● ●● ● ● ●
● ● ●
●●●
● ●● ●●● ● ●●● ●
● ●●●●
●●● ●●●●●●● ●●●●●●● ●●● ● ● ●●●●● ●● ● ● ●●●●●●● ●●●●●●● ●●●
●●●●●●
●●● ●●●●●●● ●●● ●●●●●● ● ●
● ●● ●●
● ● ●●●
● ●● ●
● ●
● ● ●●
●●●● ● ● ●●● ●●● ● ● ● ● ●● ●●●●● ●● ● ●●
●●●● ● ● ●● ● ● ●●● ●● ●

●●● ●
●●●●●



●● ●
● ●
● ●●
●●
●●●
●●
●●●●
●●●
●●
●●●

●●





● ●



● ●●
●●
●● ●
●●
●●●
●●●

● ●● ●●●●

● ● ●
● ● ● ●●●
●●●●●● ●●●●●● ● ●●●●●●●●
●●●●●●●
●● ●●●
● ●
●●
●●●

●●





● ●



● ●●
●●
●● ●
●●●●●●●

●●●●
●●●●

●● ● ●

● ● ●


●● ●

●●●● ●●●●●●●
●●●●
●●●


●●
●●
● ●●
● ●●●●●
● ●

●●●●
●●
●●● ●●● ● ●●
● ● ●
●● ●●●●● ●●●
●●●● ●●● ●●● ●
●●●●●
● ●
●● ●

● ● ●●●
● ●

●●●●
●●
●●● ●● ● ●●●

0

0
● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●
●●●●
●● ●●●●
●●●●● ●●
●●
● ●
●●

●●●
●●●●
●●● ●
●●
● ●●
●●
●●●
●● ●● ●●






●●
●●● ● ●●

●● ● ●● ● ●●
● ● ●●
● ●●●
● ●
● ●●●● ●● ● ●● ●●●●●●●
● ●●●
●● ●● ●
●●● ●

●● ●
●●
●●

●●●
●● ●● ●●




●●
●●
●●●● ●●●
●● ●● ● ●●●
● ● ●●
●● ● ●
●●

●●
●●● ● ● ●●
● ●●
● ●


●●● ●
●●
●●
● ●



●●●●●●●●
●●

● ●●●●
● ● ●
●●
●● ● ●
● ● ●●

●●●

●●
●●●● ● ●●●●●●● ●
●● ●●

●●●●


● ●



●●●●●●●●
●●

● ●●●●
● ● ●
●●
●●●●●
●● ●●
●● ●
● ●● ●● ●
●● ●●●

●●●
●●
●●●●●
●●●

●●
●●●

●●●●
●●●●
●●
●●●●●●


●●
●●●

●●●
● ●●
● ●●●
●● ●

●●●
● ●● ● ● ● ●
● ● ●● ● ● ●●●●●●
●●●● ●● ●
●●
●●●●
●● ●●●
● ●

●●●●
●●
●●●●●●


●●
●●●

●●●
● ● ●
● ●●
●●
● ●

●●
●●●● ●●
● ●

●●
●● ●



●●





●●
● ●









●●




●●●● ●


●●
●●


●●●
●●
●●●


●●
●●


●●
●●

●●
●●
●●●●●
●● ● ●●● ● ● ● ● ● ●●● ● ●● ●●

●●


● ● ●●●
● ●●







●●




●●●● ●


●●
●●


●●●
●●
●●●
●●
●●
●●●


●●


●●
●●
● ●●●

●● ● ●●
● ● ●● ●
● ● ● ●●● ● ●●
● ●●
●● ●
●● ●●

● ● ●● ●
●●●●●
●● ●
● ●
● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●
●●●●●●●●●● ●●●● ● ●●

● ● ●● ●
●●●●●
●● ●
● ●
● ●● ● ● ●●●

●●
●●
●●●●●
●● ●
●● ●●●●●● ●●
● ●● ● ● ● ●● ● ● ●
●● ● ●●●●●●●●●●

●●●●● ●●
● ●● ● ●● ●●● ●●
● ● ● ●●● ●●●●●

● ●●● ●
●●
●●●

●●●
● ●●
●●● ●●

●●●●

● ●
●●●
●●

●●●●●●●●
●●●●●●
● ●
● ● ● ● ● ●
● ● ● ●● ●●● ● ●●
●●●● ●●●●●●●●●
● ●●
●●● ●●

●●●●

● ●
●●●
●●●
●●●●●●●

●●●●●●
● ●
●● ● ● ● ●

●●●●●
●●● ●●
●●●

●●
●●●●



●●●●●
●●
●●

●●


●●●


●● ●

●●

●●


●●


●●




●●
●●
● ●● ●
●● ●● ●●●● ● ●● ● ● ●
● ●● ●●●●
●●●● ●● ●●●●
●●●● ●
●●

●●
●●●
●●
●●

●●


●●●


●●●


●●●


●●


● ●



●●●

●● ●●●●● ●
● ●●●●
●● ●●

●● ● ●● ● ●
●●●
●●●●
●●●
● ●●

●● ●
●●● ●●

●●
●●●●●
● ●
●●


●●●
●●


●●●

● ● ● ●


●● ●●● ●●●● ●
● ● ●
● ● ● ● ●●● ● ●●●
●●●●●●●●●● ●● ●● ●
● ●

●● ●●

●●
●●●●●
● ●
●●




●●
●●
● ●●●●

●● ● ●


●● ●●
● ●●●
●●●
●● ●
● ●●

●● ●
● ●●
●●
●●●●●
●●●●●●●



●●●

●●

●●
● ●●
●●●
●●●
● ●


●●●

● ●●

●●● ●
●●


●●



●● ●● ●
●●●●
● ● ● ●
● ● ● ●● ●●●●● ●●●●●●●
●● ●●●
●●●●
●● ●


●●●
● ●●
●●●
●●●
● ●


●●●

● ●●

●● ● ●

●● ●
●●●



●●●●● ●
●●
● ●●● ● ●
−10 −5

−10 −5

● ● ●● ●● ●●
●●● ●●● ● ●
● ● ● ● ● ●●●●
● ● ● ● ●● ●● ●●●● ● ● ● ●●●●
● ●● ● ●●● ● ●
●●
● ●●●
●● ●

●●●
● ●
● ●●●●●●
●● ●

● ●
●●
●● ●●● ●● ●

●● ● ●
● ● ● ●● ●● ● ●● ● ●●● ●●●

● ●
●●●●●●●●
●● ●

● ●
●●
●● ● ●●
● ●●●● ●
●●

● ●●
●●●●●●● ● ● ●

●●● ● ●●● ●

●●●●
●●

●●
●●●●
●●
●●●●● ●
●●●
●●●
● ●


●●●
●●●


●●●

● ●●●●●●
●●



● ● ●●
● ●●
●● ●●
●●
● ●
●●●●
● ● ● ●
●● ●

● ●●●●●●●●●●●●
●●●
●●

●●●●
●●● ●
●●● ●
●●●
●●●

●●

●●●
●●●


●●●
●●●
● ●●●●
●●



●●●●


● ●●
●●

● ●●● ● ●
● ● ●
● ●●● ●
●● ●
●●● ●
● ●● ●
● ●●● ●●●● ● ●●

● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●
● ●●●● ●●● ●● ●● ● ●●

● ●
● ● ● ● ●● ●●●●● ● ●
● ● ● ●●● ● ● ●● ● ● ● ● ●● ●● ● ●● ●
●●●● ● ●
● ● ● ●●● ● ● ● ● ●
● ● ● ●●

●● ●●

●● ●● ●








● ●● ●● ●●
●●
● ●






● ●●●●●
● ●
●●●● ● ●● ● ●● ● ● ● ●
● ●● ●●●● ●● ● ● ●●●●●●●
● ●●● ● ●●
●●
● ●






● ●● ●●●
● ●
●●
●● ●●● ● ●●● ● ● ●
●● ● ● ● ● ●● ● ● ●
● ● ● ● ●● ● ●
●● ●● ● ● ●● ● ● ●● ● ● ● ● ●
●●● ●●● ● ● ●● ●
●●● ● ●●●●● ● ● ● ● ●● ● ●●
●●● ● ●
●● ●●

● ●● ● ●
●●● ● ●●●●● ● ● ●●● ●●● ● ●●
● ● ●● ● ● ●●● ● ●● ●
●●●
● ●
●● ● ● ● ● ● ● ● ● ●●●●● ●●● ● ●● ● ●● ●●●

●●●
●●● ● ● ● ●●
● ● ●
●● ●
●● ●●●●●
● ● ● ● ● ●●
●● ●●
●●●●
● ●●● ●●
● ● ●
●●● ● ●
●●●●● ● ●● ●●●●
●● ●●
●●● ●● ●
●●● ● ●
● ● ●
● ● ● ●●● ●
● ●
●● ●●● ●
● ●● ● ● ● ●● ● ● ● ● ●● ● ●●●●● ●
● ●

●● ●●●● ●●●● ● ●
●●
●● ● ● ● ●● ●●
●● ●
●●●
● ● ● ●● ● ●
●● ●
●● ●
●●●
● ● ● ●● ● ●
●● ● ● ●● ●●● ● ● ● ●●●● ●● ● ● ● ●
● ● ● ●● ●●●
● ● ●●● ● ● ●●●● ●● ● ● ● ●

● ●●● ●● ●● ● ●

●●● ● ● ● ● ●● ●●
● ● ●●● ● ●
● ● ● ●
● ● ● ● ● ●● ●
● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●● ● ●
● ●
● ●
● ● ● ●
Simple Transformation

I linearity often too restrictive assumption


I gain flexibility without complex models by using log or polynomials of
x
⇒ replace y = βx + ε

by y = βf (x) + ε; f (x) = log(x), x3 , x etc...
I Issues:
I interpretation of β
I choice/selection of f (x)

88 / 398
Polynomial Transformation

I Polynomial Model
I y = f (x) + ε = β0 + β1 x + β2 x2 + · · · + βl xl + ε
I In R: Use poly(x,degree) to avoid collinearity

89 / 398
Polynomial Transformation: Collinearity
x <- seq(0, 1, l = 200)
X <- outer(x, 1:5, "^")
X.c <- poly(x, 5)
round(cor(X), 2)

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1.00 0.97 0.92 0.87 0.82
## [2,] 0.97 1.00 0.99 0.96 0.93
## [3,] 0.92 0.99 1.00 0.99 0.97
## [4,] 0.87 0.96 0.99 1.00 0.99
## [5,] 0.82 0.93 0.97 0.99 1.00

round(cor(X.c), 2)

## 1 2 3 4 5
## 1 1 0 0 0 0
## 2 0 1 0 0 0
## 3 0 0 1 0 0
## 4 0 0 0 1 0
## 5 0 0 0 0 1

⇒ use orthogonal polynomials


Polynomial Transformation: Synthetic example

x <- seq(0, 1, l = 300)


fx <- function(x) {
sin(2 * (4 * x - 2)) + 2 * exp(-16^2 * (x - 0.5)^2)
}
y <- fx(x) + rnorm(300, sd = .3)
X.c <- poly(x, 15)
m.poly3 <- lm(y ~ X.c[, 1:3])
m.poly7 <- lm(y ~ X.c[, 1:7])
m.poly11 <- lm(y ~ X.c[, 1:11])
m.poly15 <- lm(y ~ X.c)
plot(x, y, pch = 19, col = "grey")
lines(x, fx(x), col = 1, lwd = 2)

91 / 398
Polynomial Transformation: Synthetic example


●●
● ●●
2


●●
●● ● ●

● ●
● ● ●●● ●
● ●●●
● ● ●
● ●● ● ● ●●
● ● ● ●

● ● ● ●● ● ●●
● ● ●
● ● ● ● ●● ● ●●● ●
1

● ● ● ● ●
● ● ●
● ● ●● ● ● ●●
● ●● ● ● ●● ●● ●● ●
● ● ●
●● ●● ● ● ● ● ● ● ● ● ●
●● ●● ● ●● ●● ●●● ●● ● ●

y

●● ● ●● ● ●
●● ●● ● ● ●
● ● ● ● ●
● ● ●
● ●● ● ● ● ●● ● ●
● ● ● ● ● ●●
● ● ●
● ●● ● ● ●●
0

● ● ●● ● ● ● ●
●● ● ● ●●●
● ● ●● ●●●
●● ● ● ● ●
● ● ● ●● ●●● ●
●● ●
● ● ● ● ● ● ●●
●● ● ●
● ●●
● ●● ● ●
● ●●
●●
●● ● ● ●
● ● ● ● ● ● ●● ● ●
● ● ● ● ● ●● ● ● ● ● ●

−1

● ●● ●● ●
●●● ● ●
● ● ● ● ●
● ●● ●● ● ●
● ●
● ● ●● ●● ●

0.0 0.2 0.4 0.6 0.8 1.0

92 / 398
Piecewise Polynomials

I polynomial transformations have problems:


I choice of degree (= flexibility)
I oscillations, boundary effects for higher degrees
⇒ piece-wise polynomials:
Idecompose range of x in sub-intervals
Iapproximate f (x) by low-degree polynomial in each sub-interval
⇒ removes oscillations, boundary effects

93 / 398
Piecewise Polynomials
Piecewise Polynomials


true f
degree 15

5 piecewise ●●
● ●●
2


quadratic polynomials ●●
●● ● ●

● ●
● ● ●●● ●
● ●●●
● ● ●
● ●● ● ● ●●
● ● ● ●

● ● ● ●● ● ●●
● ● ●
● ● ● ● ●● ● ●●● ●
1

● ● ●
● ● ●
● ● ●●●
● ● ● ●
●● ●
● ●● ●●●● ● ●●● ● ● ●●●● ●
●● ●● ● ● ● ● ● ●
●● ●● ● ●● ●
● ● ● ●● ●●
y

●● ● ●● ● ●
●● ●● ● ● ● ●
● ● ● ●
● ● ●
● ●● ● ● ● ● ●●

● ● ● ● ● ●●
● ● ●
● ●● ● ● ●●
0

● ● ●● ● ● ● ●
● ●● ● ●●●
● ● ●● ●●●
●● ● ● ● ●
● ● ● ●● ●●● ●
●● ●
● ● ● ● ● ● ●●
● ●●


●●●●● ● ●
● ●●
●●
●● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ●● ● ●
−1

● ●● ●● ●
●●● ● ●
● ● ●● ● ● ●
● ● ● ●● ●
● ● ● ●● ●● ●

0.0 0.2 0.4 0.6 0.8 1.0

⇒ fˆ(x) for piecewise polynomials not continous


Definition: Polynomial Splines

I better piecewise polynomials


I require continuous differentiability at subinterval boundaries
I formally:
f : [a, b] → R is polynomial spline of degree l ≥ 0 at knots
a = κ1 < · · · < κm = b if
1. f (x) is l − 1-times continuously differentiable
2. f (x) is polynmial with degree l on [κj , κj+1 )
⇒ choice of degree l determines smoothness of function
⇒ knot set κ defines flexibility/complexity f

95 / 398
Polynomial Splines: Example
polynomial spline degree 0 polynomial spline degree 1

● ●

5+2 knots ●●● 5+2 knots ●●●


2

2
● ●● ● ●●
● ●
●● ●●
●● ● ●
● ●● ● ●

● ● ● ●
● ● ●●● ● ● ● ●●● ●
● ● ●●● ● ● ● ●●● ●
● ● ● ● ● ●● ● ● ● ● ● ●●
● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ●
● ● ● ●●● ●● ● ● ● ●●● ●●
1

1
● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ●● ●●●

● ● ●● ● ● ● ● ●● ●●●

● ● ●● ● ●
● ●● ●● ●●●● ● ● ● ● ●● ●● ●●●● ● ● ●
● ● ●● ● ● ●●●● ● ● ● ●● ● ● ●●●● ●
●●●●● ● ● ● ●● ●● ●●● ●● ● ●●●●● ● ● ● ●● ●● ●●● ●● ●
● ● ● ● ● ● ● ● ● ●
y

y
●● ●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●● ●● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ●● ● ●
● ● ●● ● ● ●● ● ● ●● ● ● ●●
● ●
0

0
● ●● ● ● ●●
● ● ●● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●
● ●
●● ●●● ● ● ●● ● ●●● ●● ●●● ● ● ●● ● ●●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ●●● ● ● ● ●● ●●● ●

●●●● ● ● ● ● ●● ● ● ●
●●●● ● ● ● ● ●● ● ●
● ● ●●●
● ● ● ● ●● ● ● ●●●
● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●● ●● ● ● ● ● ● ●● ●● ●
−1

−1
● ●
● ●
●●● ● ● ●● ●● ● ● ●
● ● ●
●●● ● ● ●● ●● ● ● ●

● ● ●● ● ● ● ●● ●
●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ●
●● ●● ● ● ●● ● ●● ●● ● ● ●● ●
● ● ● ● ●●● ● ● ● ● ●●●

● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

polynomial spline degree 2 polynomial spline degree 3

● ●

5+2 knots ●●● 5+2 knots ●●●


2

2
● ●● ● ●●
● ●
●● ●●
●● ● ●
● ●● ● ●

● ● ● ●
● ● ●●● ● ● ● ●●● ●
● ● ●●● ● ● ● ●●● ●
● ● ● ● ● ●● ● ● ● ● ● ●●
● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ●
● ● ● ●●● ●● ● ● ● ●●● ●●
1

1
● ● ● ●● ● ● ● ● ● ● ●● ● ● ●
● ● ●● ●●●

● ● ●● ● ● ● ● ●● ●●●

● ● ●● ● ●
● ●● ●● ●●●● ● ● ● ● ●● ●● ●●●● ● ● ●
● ● ●● ● ● ●●●● ● ● ● ●● ● ● ●●●● ●
●●●●● ● ● ● ●● ●● ●●● ●● ● ●●●●● ● ● ● ●● ●● ●●● ●● ●
● ● ● ● ● ● ● ● ● ●
y

y
●● ●● ● ● ●● ● ● ●● ●● ● ● ●● ● ●
● ●
●● ● ● ● ● ●● ● ● ● ●
●●●
● ●● ●● ● ● ● ●● ● ● ●●●
● ●● ●● ● ● ● ●● ● ●
● ● ● ● ● ●● ● ● ● ● ● ●●
● ● ● ●
0


0
● ●● ● ● ●●
● ● ●● ● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●
● ●
● ● ●●● ● ● ●● ● ●●● ● ● ●●● ● ● ●● ● ●●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ●●● ● ● ● ●● ●●● ●

●●●● ● ● ● ● ●● ● ● ●
●●●● ● ● ● ● ●● ● ●
● ● ●
●● ●
● ● ● ●● ● ● ●
●● ●
● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
−1

−1
● ● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ●● ●● ● ●
● ● ● ●●● ● ● ● ● ● ● ●●● ● ● ●
●●● ● ●● ● ●●● ● ●● ●
● ● ●● ● ● ● ● ● ●● ● ● ●
● ●● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ●●● ● ● ● ● ●●●

● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

polynomial spline degree 0 polynomial spline degree 1 96 / 398


Polynomial Splines: Discussion

I Standard: cubic splines:


I visually smooth
I twice continuous differentiable (i.e., curvature well defined.)
I knot set:
Isize: trade-off flexibility and overfitting
Ipositioning: equidistant? quantile-based? domain knowledge?
→ more on this in the context of penalization

97 / 398
Truncated Polynomials

I simplest polynomial splines


I basis representation for degree l and knots κ = (κ1 , . . . , κm ):

f (x) =γ1 + γ2 x + · · · + γl+1 x l +


+ γl+2 (x − κ2 )l+ + · · · + γl+m−1 (x − κm−1 )l+

I first l + 1 Koeffizienten determine global polynomial with degree l


I coefficient of highest power can change at each knot κ
⇒ f is of degree l everywhere and continuously differentiable

98 / 398
Truncated Polynomials: Example
basis functions


●●
2


●● ●



●●● ●
● ●
● ● ●●● ●
● ●●
● ● ● ●
● ●● ● ● ●●
● ●
● ● ● ●● ● ●●●
● ● ● ● ●
● ● ●● ●●
1

● ● ● ●
● ● ● ●
● ● ● ●●
● ● ●
● ●● ● ● ●●
●●
●● ● ●● ● ● ● ● ● ●●
● ● ● ● ● ●
●● ●● ● ● ● ●
●● ●● ● ● ●● ● ● ●●
● ●

y

● ● ● ●● ● ●
● ● ● ●
●● ●
● ● ● ● ● ● ●
●●● ● ● ● ● ● ●
● ● ●
● ● ● ● ●●
● ● ●
0

● ●● ●●
●● ● ● ●


●● ● ● ●● ● ●●●
● ●●
● ●
● ● ● ● ● ●
● ● ● ● ●● ●
● ●● ●
●●● ● ● ●
● ● ● ●●
●● ● ●●
● ●
●● ● ●

● ●● ●●
●● ● ●
● ● ●
● ● ● ●
● ●●● ● ● ●
● ● ● ●● ●● ●
−1

● ●
● ●● ● ●

● ●
● ● ● ● ●
● ● ●●
● ●● ● ● ●
● ●
● ● ●● ●
●●

0.0 0.2 0.4 0.6 0.8 1.0

scaled basis functions


basis functions

● ●

● ●
●● ●●
2

● ●
●● ● ●● ●
● ●
● ●
● ●
●● ●
● ●● ●

● ● ● ●
● ● ●●● ● ● ● ●●● ●
● ●● ● ●●
● ● ● ● ● ● ● ●
● ●● ● ● ●● ● ●● ● ● ●●
● ● ● ●
● ● ● ●● ● ●●● ● ● ● ●● ● ●●●
● ● ● ● ● ● ● ● ● ●
● ● ●● ●● ● ● ●● ●●
1

● ● ● ● ● ●

99 / 398
● ● ● ● ● ● ● ● ● ●
● ●● ● ●● ● ● ● ●● ● ●● ● ●
● ●● ● ● ●●
●●
● ●● ● ● ●●
●●
●● ● ●● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ●●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
Truncated Polynomials: Discussion

Numerical disadvantages
I Basis function values can become very large
I strong colinearity of basis functions
⇒ numerically preferable: B-spline-basis functions

100 / 398
B-splines: Idea

I B-Spline-basis function is itself a piecewise polynomial, connecting


I (l + 1) polynomial fragments
I of degree l
I (l − 1)-times continuously differentiable at connection points.
⇒ weighted sum of such basis functions is degree l and (l − 1)-times
continuously differentiable everywhere

101 / 398
B-Splines: Basis Functions
B−spline basis functions
1.0

degree l=0
0.8
0.6
B(x)

0.4
0.2
0.0

κ1 κ2 κ3 κ4 κ5 κ6 κ7 κ8 κ9 κ10 κ11

B−spline basis functions 102 / 398


B-Splines: Properties

I local Basis: basis functions 6= 0 only between l + 2-knots


I bounded range
⇒ avoids problems of truncated polynomials
I overlap with 2l adjacent basis functions

103 / 398
(B-)Splines as Linear Models

Model: y = f (x) + ε
How to estimate f (x)?
Idefine basis functions bk (x); k = 1, . . . , K
PK
I f (x) ≈
k=1 θk bk (x)
⇒ ŷ = f (x) = K
ˆ
P
k=1 θ̂k bk (x)
⇒ this is a linear model ŷ =Bθ̂ 
b1 (x1 ) . . . bK (x1 )
 .. .. 
with design matrix B =  . . 
b1 (xn ) . . . bK (xn )
I analogously applicable to GLMs: g (µ̂) = Bθ̂

104 / 398
B-Splines: R-Implementation
bs in splines package creates a B-spline Designmatrix B:
library("splines")
B <- bs(x, df = 12, intercept = T)
m_bspline <- lm(y ~ B - 1)
B_scaled <- t(t(B) * coef(m_bspline))
plot(x, y, pch = 19, cex = .5, col = "grey")
matlines(x, B, lty = 1, col = 1, lwd = 2)


●●
2


● ●●
● ●
● ●
●● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ●● ● ● ●●
● ● ● ● ● ●●● ●
● ● ● ●
● ● ●
● ● ● ●●
1

● ● ● ● ● ●
● ● ● ●●● ● ●● ●
●● ●
● ● ● ●
● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ●● ●●
y

● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ●

● ● ● ●

●● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
0

● ● ● ●● ●● ●
● ● ●● ● ● ● ●
● ● ● ● ●
● ● ● ●● ●


● ● ●
● ● ● ● ● ● ●● ●● ● ●
● ●● ● ● ●
● ● ● ●● ●
● ● ●● ● ●● ●

● ● ● ●● ● ●
● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ●● ●●

−1

● ● ● ● ●
● ● ● ● ● ● ● ●
●● ●●
●● ● ● ●
● ●● ● ● ●● ● ●
● ● ● ●
● ● ● ●

0.0 0.2 0.4 0.6 0.8 1.0

x
B-Splines: R-Implementation
library("splines")
B <- bs(x, df = 12, intercept = T)
m_bspline <- lm(y ~ B - 1)
B_scaled <- t(t(B) * coef(m_bspline))
plot(x, y, pch = 19, cex = .5, col = "grey")
matlines(x, B, lty = 1, col = scales::alpha(1, .7), lwd = .5)
matlines(x, B_scaled, lty = 1, col = 2, lwd = 2)


●●
2


● ●●
● ●
● ●
●● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ●● ● ● ●●
● ● ● ● ● ●●● ●
● ● ● ●
● ● ●
● ● ● ●●
1

● ● ● ● ● ●
● ● ● ●●● ● ●● ●
●● ●
● ● ● ●
● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ●● ●●
y

● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ●

● ● ● ●

●● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
0

● ● ● ●● ●● ●
● ● ●● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●● ●● ● ●
● ●● ● ● ●
● ● ● ●● ●
● ● ●● ● ●● ●

● ● ● ●● ● ●
● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ●● ●●

−1

● ● ● ● ●
● ● ● ● ● ● ● ●
●● ●●
●● ● ● ●
● ●● ● ● ●● ● ●
● ● ● ●
● ● ● ●

0.0 0.2 0.4 0.6 0.8 1.0

x
B-Splines: R-Implementation
library("splines")
B <- bs(x, df = 12, intercept = T)
m_bspline <- lm(y ~ B - 1)
B_scaled <- t(t(B) * coef(m_bspline))
plot(x, y, pch = 19, cex = .5, col = "grey")
matlines(x, B, lty = 1, col = scales::alpha(1, .7), lwd = .5)
matlines(x, B_scaled, lty = 1, col = scales::alpha(2, .7), lwd = 1)
lines(x, fitted(m_bspline), lty = 1, col = 3, lwd = 2)


●●
2


● ●●
● ●
● ●
●● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ●● ● ● ●●
● ● ● ● ● ●●● ●
● ● ● ●
● ● ●
● ● ● ●●
1

● ● ● ● ● ●
● ● ● ●●● ● ●● ●
●● ●
● ● ● ●
● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ● ● ●
●● ●● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ●● ●●
y

● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ●

● ● ● ●

●● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ●
0

● ● ● ●● ●● ●
● ● ●● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●● ●● ● ●
● ●● ● ● ●
● ● ● ●● ●
● ● ●● ● ●● ●

● ● ● ●● ● ●
● ● ● ● ●
● ● ● ●
●● ●● ● ● ● ●● ●●

−1

● ● ● ● ●
● ● ● ● ● ● ● ●
●● ●●
●● ● ● ●
● ●● ● ● ●● ● ●
● ● ● ●
● ● ● ●

0.0 0.2 0.4 0.6 0.8 1.0

x
Splines: Summary

I basis function representation linearizes problem of function


estimation
I dimension of basis controls maximal complexity
I basis type determines properties of function estimate: continuity,
differentiability, periodicity, . . .

108 / 398
Recap: Linear Models

Recap: Generalized Linear Models

Recap: Non-Linear Effects

Recap: Mixed Models and Random Effects


Exemplary Longitudinal Study: Sleep Deprivation
Motivation: From LM to LMM
Advantages of a Mixed Models Representation
Linear Mixed Models
LMM Estimation
Generalized Linear Mixed Models
GLMM Estimation

Recap: Additive Models and Penalization

109 / 398
Example: Sleep Deprivation Data

I laboratory experiment to measure effect of sleep deprivation on


cognitive performance
I 18 subjects, restricted to 3 hours of sleep per night for 10 days
I operationalization of cognitive performance: reaction time

109 / 398
Example: Sleep Deprivation Data

data(sleepstudy, package = "lme4")


summary(sleepstudy)

## Reaction Days Subject


## Min. :194.3 Min. :0.0 308 : 10
## 1st Qu.:255.4 1st Qu.:2.0 309 : 10
## Median :288.7 Median :4.5 310 : 10
## Mean :298.5 Mean :4.5 330 : 10
## 3rd Qu.:336.8 3rd Qu.:7.0 331 : 10
## Max. :466.4 Max. :9.0 332 : 10
## (Other):120

110 / 398
Example: Sleep Deprivation Data

308 309 310 330 331 332


● ●


400 ●



Average reaction time [ms]

● ●
● ● ● ● ● ● ●
● ● ●
300 ● ● ●

● ● ●

● ● ●

● ● ●
● ● ● ●


● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●
200 ● ●

333 334 335 337 349 350


● ●

400 ●
● ●
● ●



● ●

● ● ● ● ● ●
● ● ●
● ●
300 ● ● ●
● ●
● ●
● ●

● ●

● ●
● ●

● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
200
351 352 369 370 371 372

400 ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
300 ●

● ● ●




● ●




● ● ● ● ● ●



● ●
● ● ●

● ●
● ● ●
200
1 4 7 1 4 7 1 4 7 1 4 7 1 4 7 1 4 7
Days of sleep deprivation
Example: Sleep Deprivation Data
Model global trend: Reactionij ≈ β0 + β1 Daysij

m_sleep_global <- lm(Reaction ~ Days, data = sleepstudy)


summary(m_sleep_global)

##
## Call:
## lm(formula = Reaction ~ Days, data = sleepstudy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -110.848 -27.483 1.546 26.142 139.953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 251.405 6.610 38.033 < 2e-16 ***
## Days 10.467 1.238 8.454 9.89e-15 ***
## ---
## Signif. codes:
## 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 47.71 on 178 degrees of freedom
## Multiple R-squared: 0.2865, Adjusted R-squared: 0.2825
## F-statistic: 71.46 on 1 and 178 DF, p-value: 9.894e-15
Example: Sleep Deprivation Data
With estimated global level and trend added:
308 309 310 330 331 332
● ●


400 ●



Average reaction time [ms]

● ●
● ● ● ● ● ● ●
● ● ●
300 ● ● ●

● ● ●

● ● ●

● ● ●
● ● ● ●


● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ●
200 ● ●

333 334 335 337 349 350


● ●

400 ●
● ●
● ●



● ●

● ● ● ● ● ●
● ● ●
● ●
300 ● ● ●
● ●
● ●
● ●

● ●

● ●
● ●

● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
200
351 352 369 370 371 372

400 ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
300 ●

● ● ●




● ●




● ● ● ● ● ●



● ●
● ● ●

● ●
● ● ●
200
1 4 7 1 4 7 1 4 7 1 4 7 1 4 7 1 4 7
Days of sleep deprivation
⇒ obviously inappropriate model
Example: Sleep Deprivation Data

I subjects obviously differ in level and trend for reaction time


I idea: model subject-specific levels and trends
Reactionij ≈ β0i + β1i Daysij
# similar: m_sleep_indiv <- lm(Reaction ~ 0 + Subject + Subject:Days)
m_sleep_indiv <- lmList(Reaction ~ Days | Subject, data = sleepstudy)
head(coef(m_sleep_indiv))
## (Intercept) Days
## 308 244.1927 21.764702
## 309 205.0549 2.261785
## 310 203.4842 6.114899
## 330 289.6851 3.008073
## 331 285.7390 5.266019
## 332 264.2516 9.566768
Example: Sleep Deprivation Data
With estimated individual level and trend added:
308 309 310 330 331 332
500 ● ●

400 ●
● ●
Average reaction time [ms]

● ● ●
● ● ● ● ● ●
● ● ● ● ●
300 ● ● ●

● ● ●

● ● ● ● ● ● ● ● ● ● ●


● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
200 ● ●

333 334 335 337 349 350


500 ● ●

400 ● ● ●
● ● ●

● ●
● ● ● ● ● ● ● ● ● ●
300 ● ● ● ● ●
● ●
● ●
● ●





● ● ●
● ●

● ● ● ●

● ● ● ● ●
● ● ● ● ● ●
200

351 352 369 370 371 372


500
400 ● ● ● ● ● ●
● ●
● ●

● ●
● ● ●
● ● ●
● ● ● ● ● ● ●
300 ●

● ● ●




● ● ● ●

● ● ● ● ● ● ● ● ●

● ●
● ● ●

● ● ● ● ●
200
0.02.55.07.5 0.02.55.07.5 0.02.55.07.5 0.02.55.07.5 0.02.55.07.5 0.02.55.07.5
Days of sleep deprivation
⇒ better fit
Motivation: From LM to LMM

I global model yij = β0 + β1 x1ij + εij :


I ignores within-subject correlation
⇒ variability of coefficients underestimated since correlated data
contain less information than independent data
⇒ invalid inference (tests, CIs)
⇒ complete pooling
I subject-specific models yij = β0i + β1i x1ij + εij :
I can be interpreted only with regard to the data in the sample
⇒ no generalization to “typical” subjects / population
I many many parameters to estimate
⇒ estimates may be unstable, imprecise
⇒ no pooling

116 / 398
Motivation: From LM to LMM

alternative representation of subject-specific models:

yij = β̄0 + (β0i − β̄0 ) + β̄1 x1ij + (β1i − β̄1 )x1ij + εij

with means of subject-specific parameters


18 18
1 X 1 X
β̄0 = β0i ; β̄1 = β1i
18 18
i=1 i=1

117 / 398
Motivation: From LM to LMM
I idea of a random effect model
I β̄ is the population level effect β.
I express subject-specific deviations βi − β̄ as Gaussian random variables
bi ∼ N(0, σb2 ).
I this yields
yij = β0 + β1 x1ij + b0i + b1i x1ij + εij
with
εij ∼ N(0, σ 2 ), (b0i , b1i )> ∼ N2 (0, Σ).

I or alternatively:
yij = b0i + b1i x1ij + εij
with
 
εij ∼ N(0, σ 2 ), (b0i , b1i )> ∼ N2 (β0 , β1 )> , Σ .

118 / 398
Partial Pooling

I regression coefficients β0 , β1 , . . . in random effects models retain


their interpretation as population level parameters.
I subject-specific deviations from the population mean are modeled by
random effects – the implicit assumption is that subjects are a
random sample from the population of interest
⇒ partial pooling, with strength of pooling determined by random effect
variance.

119 / 398
Advantages of the Random Effects Approach
I decomposition of random variability in data into
I subject-specific deviations from population mean
I deviation of observations from subject means
⇒ more precise estimates of population trends
I some degree of protection against bias caused by drop-out
I random effects serve as surrogates for effects of unobserved
subject-level covariates
⇒ control for unobserved heterogeneity
I distributional assumption bi ∼ N stabilizes estimates b̂i (shrinkage
effect) compared to fixed subject-specific estimates β̂i without
distributional assumption
I intuition: estimates are stabilized by including prior knowledge in the
model, i.e., assuming that subjects from the population are mostly
similar to each other

120 / 398
Advantages of the Random Effects Approach
I random effects model the correlation structure between observations:

yij = β0 + bi + εij
i.i.d. i.i.d.
with bi ∼ N(0, σb2 ); εij ∼ N(0, σε2 )
Cov(bi , bi ) σ2
=⇒ Corr(yij , yij 0 ) = p = 2 b 2.
Var(yij ) Var(yij 0 ) σb + σε

yij = β0 + b0i + b1i tj + εij


i.i.d. i.i.d.
with (b0i , b1i )> ∼ N2 (0, Σ); εij ∼ N(0, σε2 )
2 2 2
=⇒ Var(yij ) = σb0 + 2σb01 tj + σb1 tj + σε2
2 2
=⇒ Cov(yij , yij 0 ) = σb0 + σb01 (tj + tj 0 ) + σb1 tj tj 0

I independence between observations on different subjects is retained


(for the kind of correlation structure we discuss here):
121 / 398
General Form of Linear Mixed Models

Linear Mixed Model:

y = Xβ + Ub + ε
b ∼ N(0, G)
ε ∼ N(0, R)

I U: design matrix for random effects


I independence between ε and b.
I entries in G, R determined by (co-)variance parameters ϑ
I we’ll focus on independent errors with R = σ 2 I

122 / 398
Conditional and Marginal Perspective

Conditional perspective:

y|b ∼ N(Xβ + Ub, R); b ∼ N(0, G)

Interpretation:
random effects b are subject-specific effects that vary across the
population.
Hierarchical formulation:
expected response is a function of population-level effects (fixed effects)
and subject-level effects (random effects).

123 / 398
Conditional and Marginal Perspective

Marginal perspective:

y ∼ N(Xβ, V) V = Cov(y) = UGU> + R

Interpretation:
random effects b induce a correlation structure in y defined by U and G,
and thereby allow valid analyses of correlated data.
Marginal formulation:
model is concerned with the marginal expectation of y averaged over the
population as a function of population-level effects.

The marginal model is more general than the hierarchical model.

general estimating equations: geepack

124 / 398
Linear Mixed Model for Longitudinal Data
For subjects i = 1, . . . , m, each with observations j = 1, . . . , ni :

yij = xij β + uij bi + εi j


bi ∼ Nq (0, Σ)
⇔ y = Xβ + Ub + ε

with
Pm
I y = (y> > >
1 , y2 , . . . , ym ) (n = i=1 ni entries)
I > > >
ε = (ε1 , ε2 , . . . , εm ) (n entries)
I β = (β0 , β1 , . . . , βp )>
I X = [1 x1 . . . xp ]
I b = (b1 , b2 , . . . , bm ) of length mq, with b ∼ Nmq (0, G)
I G = diag(Σ, . . . , Σ))
I U = diag(U1 , . . . , Um ) with dimension n × mq
 
I Ui = 1 u1i . . . u(q−1)i with dimension ni × q. Variables in Ui are
typically a subset of those in X.
125 / 398
Other Types of Mixed Models

I hierarchical/multi-level model:
e.g., test score yijk of a pupil i in class j in school k:

yijk = β0 + x>
ijk β + b1j + b2k + εijk

with random intercepts for class (b1j ∼ N(0, σ12 )) and school
(b2k ∼ N(0, σ22 ))
I crossed designs:
e.g., score yij of a subject i on an item j:

yij = β0 + x>
ij β + b1i + b2j + εijk

with random intercepts for subject (b1i ∼ N(0, σ12 ), subject ability)
and item (b2j ∼ N(0, σ22 ), item difficulty)

126 / 398
Likelihood-Based Estimation of Linear Mixed Models
ML-Estimation
I determine ϑ̂ML so that profile likelihood in V of the marginal model
is maximal:

y ∼ N(Xβ, V(ϑ))
1n o
l(β, ϑ) = − log |V(ϑ)| + (y − Xβ)> V(ϑ)−1 (y − Xβ)
2
 −1
β(ϑ)
b = arg max (l(β, ϑ)) = X> V(ϑ)−1 X X> V(ϑ)−1 y
β
1n >
o
lP (ϑ) = − log |V(ϑ)| + (y − Xβ(ϑ))
b V(ϑ)−1 (y − Xβ(ϑ))
b
2
→ max
ϑ

I for given ϑ, closed form solutions for β̂ and b.


b
I b ϑ̂) = GZ> V(ϑ̂)−1 (y − Xβ(
simple generalized least squares: b( b ϑ̂)).
I Cov(β)
\ and Cov(b) \ computable for tests, CIs.
127 / 398
Likelihood-Based Estimation of Linear Mixed Models
REML estimation:
I ML-estimates ϑ̂ are biased, unbiased variance component estimates
from marginal-marginal likelihood of ϑ (“restricted”):
Z 
lR (ϑ) = log L(β, ϑ)dβ
1
∝ lP (ϑ) − log |X> V−1 X| → max
2 ϑ

I closed form solutions for β̂ and b


b and their covariances given ϑ still
apply
I both are tricky optimization problems:
I positivity constraints for most entries in ϑ
I computationally expensive, numerically unstable log-determinants
I SOTA implementation for large sub-class: mgcv

128 / 398
Generalized Linear Mixed Models

I GLM generalizes LM via addition of a link function


I mapping the linear predictor to a range appropriate for the response
distribution,
I and linking the variance to the expected value in a way appropriate for
the response distribution.
I carries over directly for a generalized linear mixed model (GLMM):
E(y|b) = h(Xβ + Ub)
with known response function h()
I BUT: estimation much harder problem than for LMMs or GLMs,
especially for binary responses (more later).
I BUT: GLMMs can only be interpreted in the conditional/hierarchical
perspective. Use GEEs for marginal models.

129 / 398
Generalized Linear Mixed Models

Model:

y|b : yi |b ∼ Expo.fam.(E(yi |b) = h(Xβ + Zb), φ)


b|ϑ : b|ϑ ∼ N(0, G(ϑ))

130 / 398
Caveat: Effect Attenuation in GLMMs
LMM Logit−GLMM
1.00

4
0.75
h(xiβ + b0 i)

0
0.50

−4 0.25

−8 0.00
−2 −1 0 1 2 −2 −1 0 1 2
x
1
For random intercept logit-models: βmar ≈ √ βcond
1+0.346σb2
131 / 398
GLMM Estimation
LMM estimation exploits analytically accessible marginal likelihood:
Z
L(β, ϑ, φ) = L(b, β, φ, ϑ)db

is the density of

y|β, φ, ϑ ∼ N(Xβ, ZG(ϑ)Z> + R(φ, ϑ)).

For GLMMs:
n
Z !
Y
L(β, ϑ, φ) = f (yi |β, φ, b, ϑ) f (b|ϑ)db
i=1

(...sucks)

132 / 398
GLMM Estimation Algorithms

I Laplace approximation based : iterate


1. Compute b̂ = arg maxb L(β, φ, ϑ, b) for given β, φ, ϑ via penalized
IWLS-Algorithmus (P-IRLS).
2. Maximize a Laplace-Approximation L̃(β, φ, ϑ) of L(β, φ, ϑ) in b̂
(numerically, typically gradient based)
(mgcv, with lots of tricksy tricks; lme4 for large b)
I (Gaussian) quadrature based methods: more accurate, much slower
(lme4, gamm4)
I penalized quasi likelihood: replace GLMM by LMM with IWLS
working reponses and weights. Biased, not guaranteed to converge,
fairly fast. (nlme, mgcv:gamm)
I do (full) Bayes: flexible choice of effect distributions, hyperpriors,
likelihoods; very slow (STAN: stanarm, brms)

133 / 398
Mixed Models in a Nutshell
I standard regression models can model only the structure of the
expected values of the response
I mixed models are regression models in which a subset of coefficients
are assumed to be random unknown quantities from a known
distribution instead of fixed unknowns, and this means we can
I model the covariance structure of the data (marginal perspective)
I estimate (a large number of) subject-level coefficients without too
much trouble (conditional perspective)
I random intercepts can be used to model subject-specific differences in
the level of the response
→ grouping variable as a special kind of nominal covariate
I a random slope for a covariate is like an interaction between the
grouping variable and that covariate
→ grouping variable as a special kind of effect modifier for that
covariate
I hard estimation problems: variance components difficult to optimize,
often very high-dim. b
134 / 398
Recap: Linear Models

Recap: Generalized Linear Models

Recap: Non-Linear Effects

Recap: Mixed Models and Random Effects

Recap: Additive Models and Penalization


Penalization: Controlling smoothness
Smoothing Parameter Optimization
Generalized Additive Models
Surface Estimation
Varying coefficients

135 / 398
Splines

I Splines
I piecewise polynomials with smoothness properties at knot locations
I can be embedded into (generalized) linear models (e.g. ML estimates)
I Problem: choice of optimal knot setting.
I Two-fold problem:
I how many knots?
I where to put them?
I two possible solutions, one of them good:
I adaptive knot choice: make no. of knots and their positioning part of
optimization procedure
I penalization: use large number of knots to guarantee sufficient model
capacity, but add a cost (penalty) for wiggliness / complexity to
optimization procedure

135 / 398
Function estimation example: climate reconstruction

0.5
0.0

0.0
−0.5

−0.5
−1.0

0 500 1000 1500 2000 0 500 1000 1500 2000


0.5

0.5
0.0

0.0
−0.5

−0.5

0 500 1000 1500 2000 0 500 1000 1500 2000

136 / 398
Sensitivity to number of basis functions
5 basis functions 10 basis functions
0.5

0.5
0.0

0.0
−0.5

−0.5
0 500 1000 1500 2000 0 500 1000 1500 2000

20 basis functions 40 basis functions


0.5

0.5
0.0

0.0
−0.5

−0.5

0 500 1000 1500 2000 0 500 1000 1500 2000

137 / 398
Penalized ML-Estimation

I Main idea: Add a penalty term to the likelihood for regularization:

lpen (f ) = l(f ) − λ pen(f ) → max .


f

I l(f ) measures fit of estimated function f (x) = B(x)θ to data


I penalty term pen(f ) ∈ R+ 0 measures wiggliness of estimated function
I f smooth ⇒ pen(f ) small.
I f rough ⇒ pen(f ) large
I smoothing parameter λ ∈ R+
0 controls influence of penalty
I λ → 0: unpenalized estimate.
I λ → ∞: maximally smooth estimate (regardless of data)

138 / 398
Penalties
I Frequently used: Z
pen(f ) = (f 00 (z))2 dz.

I since second derivative is curvature:“total wiggliness squared”, in


some sense.
I minimal for linear functions.
I Less frequently used:
Z
pen(f ) = (f 0 (z))2 dz.

I since first derivative is slope: “total rate of change squared”, in some


sense.
I minimal for constant functions.
I So: Z Z
pen(f ) = (f 00 (z))2 dz or (f 0 (z))2 dz

I order of derivatives expresses which kind of complexity to measure139 / 398


Penalties

I penalty term can be simplified to quadratic form

pen(f ) = θ 0 Pθ,

with P = D0 D and differencing matrix D, i.e. for first differences


 
−1 1
 −1 1 
D=  ∈ RK −1×K
 
. . . .
 . . 
−1 1

140 / 398
Penalized LS-Estimation

I penalized ML-Criterion for Gaussian errors equivalent to

(y − Bθ)0 (y − Bθ) + λθ 0 Pθ → min .


θ

I analytic solution: penalized LS-estimator

θ̂ = (B0 B + λP)−1 B0 y.

I penalized ML-estimate for non-Gaussian errors/targets via penalized


IWLS:
θ̂ (k+1) = (B0 W(k) B + λP)−1 B0 W(k) ỹ(k) .

141 / 398
Penalization as Prior

I penalty term motivated pragmatically / heuristically


I we may want a probabilistic interpretation
⇒ use a prior distribution concentrated on simple, smooth functions
I pen(f ) = θ 0 Pθ is the (negative log-)kernel of that prior

I prior: θ ∼ N (0, λ2 P−1 )


I prior p(θ; λ, P) = c(λ, P) exp (−λθ 0 Pθ)
I since posterior ∝ likelihood· prior:
adding −λ pen(f ) to log-likelihood is equivalent to using the prior
above
I λ is an inverse scale parameter: more diffuse prior as λ → 0 (no effect
of prior, no penalization)

142 / 398
Penalization as Prior

Another perspective:
I penalty/prior encodes assumption about likely differences between
coefficients of adjacent basis functions
I random walk prior: e.g., (θj+1 − θj ) ∼ N(0, λ2 )
I obvious question: what about the first d coefficients, for d-th
differences?

143 / 398
Penalization as Prior

I Since f (x) = Bθ and Gaussian distribution is stable w.r.t. linear


functions, we impose a (low-rank) Gaussian process prior on the
function estimate:

f (x) ∼ GP(0, λ2 B(x)P−1 B(x 0 )> )

I not really: P is not full-rank =⇒ P−1 does not exist, use


Moore-Penrose.
I improper prior: infinite variance for functions in null(P)
I interpretation: no regularization of maximally smooth functions
I intuition: polynomials functions up to order of differences−1 are
estimated unregularized.

144 / 398
Penalization as Prior

I powerful idea: connects probabilistic models and function


approximation heuristics
I very useful for bringing probabilistic “machinery” to bear: penalized
inference equivalent to estimating coefficients that are random
quantities.
I very general idea: can be applied to any penalty that can be written
as quadratic form in coefficients.
I very general idea: can be applied to any type of effect estimation that
can be linearized in terms of a weighted sum of basis functions (over
arbitrary domains, for arbitrary outcomes)
I e.g.: any type of random effect with fixed correlation structure,
Markov Random Fields, and arbitrary combinations of them (cf.
tensor products, later)

145 / 398
Mixed Model Decomposition of Penalized Terms
Decompose a regularized effect into its penalized (“random”) and
unpenalized (“fixed”) components for better numerical stability & direct
applicability of mixed model algorithms:
Let h = rank(P) and p = dim(θ). Decompose
θ= X̃ β + Z̃ b
p×(p−h) (p−h)×1 p×h p×1

Choose X̃ und Z̃, so that


I PX̃ = 0: β is not penalized by P,
>
I Z̃ PZ̃ = Ih , so that pen(θ) = b> b:
>
θ > Pθ = (X̃β + Z̃b)> P(X̃ β + Z̃b) = b> |Z̃ {zPZ̃} b = b> b.
|{z}
0 Ih

=⇒ simple i. i. d. random effects b associated with design matrix BZ̃;


fixed effects β with BX̃:
bθ = (BX̃)β + (BZ̃) b
| {z } | {z }
unpenalized penalized
146 / 398
Mixed Model Decomposition of Penalized Terms

I X̃, Z̃ not unique


I for exampe: X̃ from basis of null(P): eigenvectors of P with 0
eigenvalues.
1/2
I for exampe: Z̃ = L(L> L)−1 with P = Γ Ω+ Γ> and L = ΓΩ+ , for
p×h h×h h×p

Ω+ diagonal matrix of positive eigenvalues.


I (can use any matrix root of P for Z̃)

147 / 398
Influence of λ (1st differences)
λ=0.001 λ=1
0.5

0.5
0.0

0.0
−0.5

−0.5
0 500 1000 1500 2000 0 500 1000 1500 2000

λ=1000
0.5
0.0
−0.5

0 500 1000 1500 2000

148 / 398
Influence of λ (2nd differences)
λ=0.001 λ=1
0.5

0.5
0.0

0.0
−0.5

−0.5
0 500 1000 1500 2000 0 500 1000 1500 2000

λ=1000
0.5
0.0
−0.5

0 500 1000 1500 2000

149 / 398
Optimizing smoothing parameters

I optimizing smoothing parameter instead of number, locations of knots


I much easier: positive scalar instead of variable length vector of
locations
I typical strategies:
I optimizing (pseudo-)predictive criteria: AIC, GCV, test set error, ...
I empirical Bayes: estimation via mixed models
I fully Bayes: hyperparameter λ with with suitable hyperprior.

150 / 398
Implementation in R

library("mgcv")
formula <- temp ~ s(year, bs = "ps", m = c(2, 2), k = 20)

gam1 <- gam(formula,


method = "REML", data = tempdata,
family = gaussian
)
gam2 <- gam(formula,
method = "GCV.Cp", data = tempdata,
family = gaussian
)

plot(gam1, residuals = TRUE)


plot(gam2, residuals = TRUE)

151 / 398
Implementation in R

REML fit GCV fit


0.5

0.5
s(year,18.78)
s(year,16.5)

0.0

0.0
−0.5

−0.5

0 500 1000 1500 2000 0 500 1000 1500 2000

year year

152 / 398
I gam similar syntax to glm:
I s(z) requests a smooth effect of z
I bs="ps" specifies a P-spline basis
I m=c(2,2) controls order of spline (spline order+1, m[1]) and
difference order of penalty (m[2]).
I k=20 controls number of basis functions
I method="REML" to control optimization criteria

153 / 398
I equivalent degrees of freedom (edf) measure complexity of a function
estimate
I for the linear model,

X(X0 X)−1 X0 = dim(β).



tr
| {z }
:=Hat-Matrix H: Hy=ŷ

I equivalently:
edf(λ) = tr B(B0 B + λP)−1 B0 .


I for P-splines with d-th difference penalty:

d ≤ edf(λ) ≤ dim(θ).

154 / 398
Generalized Additive Models
I Generalized Additive Models (GAM) extend generalized linear models
as:
E(y |η) = h(η), η = x0 β + f1 (z1 ) + . . . + fq (zq ).

I f1 , . . . , fq are smooth effect functions modeled via penalized splines


I not identifiable as such:

f˜1 (z1 ) = f1 (z1 ) + c, f˜2 (z2 ) = f2 (z2 ) − c.

I so center effect functions


Z n
X
fj (zj )dzj = 0, i.e., fj (zji ) = 0.
i=1

155 / 398
GAM estimation

I estimate via maximization of penalized Likelihood:


q
X
l(β, θ1 , . . . , θq ) − λj θj0 Pj θj .
j=1

I uses penalized IWLS


I multiple smoothing parameters to optimize: much harder
optimization problem

156 / 398
I additivity possibly too strong assumption: ignores interactions
I More general:
η = x0 β + f (z1 , . . . , zq )

I ... infeasible for fairly small q.


I but: can try to estimate at least lower-order interactions like f (z1 , z2 )
I interpret and visualize as effect surfaces. Direct analogy to and useful
for spatial effects.

157 / 398
Surface Estimation

I Two flavors:
I tensor product splines
I radial basis functions: basis functions over (subsets of) R2

158 / 398
Tensor Products

I linear models represent interaction effects via elementwise products of


the respective design matrix column vectors.
I analogous idea for surface estimation: create spline bases for z1 and
z2 and compute all pairwise products of their basis functions.
⇒ tensor product basis functions

159 / 398
Tensorproducts

(1) (1) (2) (2)


I define univariate bases b1 (z1 ), . . . , bK (z1 ) and b1 (z2 ), . . . , bK 0 (z2 )
for z1 und z2 .
I define tensor product basis for surface f (z1 , z2 ) by
(1) (2)
Bjk (z1 , z2 ) = Bj (z1 )Bk (z2 ), j = 1, . . . , K , k = 1, . . . , K 0 .

I represent surface as
K X
K 0
X
f (z1 , z2 ) = θjk Bjk (z1 , z2 ).
j=1 k=1

160 / 398
161 / 398
162 / 398
163 / 398
Penalty terms for Tensor Product Splines

I need sensible penalties for tensor product P-splines


I consider coefficient vector arranged as array:

θ = (θ11 . . . θK 1
.. ..
. .
θ1K . . . θKK )>

I every row θr = (θ1r , . . . , θKr ) represents a univariate spline in


z1 -direction
I every column θc = (θc1 , . . . , θcK )> represents a univariate spline in
z2 -direction

164 / 398
I let P1 penalty matrix for spline in z1 . Then

θr P1 θr>

quantifies wiggliness of the r -th row and “total row-wiggliness” is


K
X
θr P1 θr> .
r =1

I more compact with Kronecker product:


K
X
θr P1 θr> = θ > (I ⊗ P1 )θ.
k=1

165 / 398
I let P2 penalty matrix for spline in z1 . Then

θc> P2 θc

quantifies wiggliness of the j-th column and “total


column-wiggliness” is
XK
θc> P2 θc .
c=1

I more compact with Kronecker product:


K
X
θc> P2 θc = θ > (P2 ⊗ I)θ.
c=1

166 / 398
I in combination we get a penalty term

θ > (λ1 I ⊗ P1 + λ2 P2 ⊗ I) θ.
| {z }
=P(λ1 ,λ2 )

I and, as usual, we can optimize the penalized likelihood

l(θ) − θ > P(λ1 , λ2 )θ.

167 / 398
Radial Basis Functions
I For a given knot κ = (κ1 , κ2 ), a radial basis function is defined as

Bκ (z) = B(||z − κ||).

I contour lines of such basis functions are circles ⇒ radial functions


I examples (with d = ||z − κ||):

B(d) = d 2 log(d) Thin Plate Spline,


k
B(d) = d , k even,
p
B(d) = d 2 + c 2 with c > 0.

I penalty terms typically based on integrals of (squared) bivariate


derivatives

168 / 398
Tensor Product Splines vs. Radial Basis Functions

I tensor products:
I invariant against linear covariable transformations
I allow for combination of covariables on different domains, scales, units.
I allow for anisotropy: different roughness over different axes.
I radial basis functions:
I invariant against rotations of covariate space
I useful for spatial/isotropic effects
I only have a single smoothing parameter

169 / 398
I surface estimates can represent interactions of metric covariates
I how to represent interactions between categorical and metric
covariates?

170 / 398
I model equation

η = . . . + u1 f1 (z1 ) + u2 f2 (z2 ) + . . .

where u1 , u2 etc are dummy variables for the different levels of a


categorical variable u
I separate effect functions for each level of u

171 / 398
I more generally, varying coefficient effects are written as

η = . . . + uf (z) + . . .

I effect of u varies over the domain of z.


I “effect modifier” z
I useful also, e.g. for time-varying effects etc:

η = . . . + uf (t) + . . .

172 / 398
I model representation: actually a simpler special case of tensor
product bases:
I “basis” matrix for categorical covariate u is a matrix of dummy
variables
I “basis” matrix for linear effect of a metric covariate u is simply u
I in both cases, the design matrix for the varying coefficient term is
created by the tensor product of the effect modifier’s spline basis and
the covariate’s design matrix.
I for f = (f (z1 ), . . . , f (zn ))0 = Bθ, multiplication with u simply means

u · f = diag(u1 , . . . , un )Bθ :

a rescaled basis matrix.


I for categorical u, we do the same thing for each dummy variable, to
end up with “copies” of the original spline basis that are set to zero
in rows wich don’t belong to the respective level of u

173 / 398
Part III

Functional Principal Component


Analysis

174 / 398
Multivariate Principal Component Analysis (Review)

Functional Principal Component Analysis

Functional PCA for Sparse Functional Data


Multivariate Principal Component Analysis (Review)

Functional Principal Component Analysis

Functional PCA for Sparse Functional Data

176 / 398
Multivariate Principal Component Analysis (Review)
Other representations of functional data
Idea:
I Find normalized weight vectors φk ∈ Rp that maximize the sample
variance of ξik = φ> p
k xi for x1 , . . . , xn ∈ R .
I Identify most important modes of variation in the data

Definition
Given observations x1 , . . . , xn ∈ Rp with zero mean, the first principal
component (PC) is defined by

1 X n  > 2
φ1 = arg max φ xi .
kφk2 =1 n i=1

The subsequent principal components φk can be found analogously


subject to the additional orthogonality constraint

φ>
j φk = 0 for all j < k.

176 / 398
Multivariate Principal Component Analysis (Review)
Other representations of functional data

Remarks:
I Principal components φk ∈ Rp have same length/structure as the
data
I Normalizing restriction kφk k2 = 1 makes sure that PCs are well
defined, but no unique specification (for any solution φ the vector
−φ is a solution, too)
I Orthogonality constraint φ>j φk = 0 ensures that φk indicates a new
mode of variation that is not explained by the preceding components
φ1 , . . . , φk−1
I PCA can be seen as a dimension reduction tool: use

ξi = (ξi1 , . . . , ξiK ) K p
instead of the observed values xi = (xi1 , . . . , xip ) .

177 / 398
Multivariate Principal Component Analysis (Review)
Other representations of functional data
Alternative characterization:
Theorem (Multivariate PCA as eigenanalysis problem)
Let V = n1 X> X be the sample covariance matrix of (mean-centered)
x1 , . . . , xn with eigenvalues ν1 ≥ ν2 ≥ . . . ≥ νm > 0.
Then
n
1 X  > 2
φ1 = arg max φ xi = arg max φ> V φ
kφk2 =1 n kφk2 =1
i=1

is equivalent to the solution of

Vφ1 = ν1 φ1 , kφ1 k = 1.

The principal components are hence orthonormal solutions of

Vφk = νk φk k = 1, . . . , m.

178 / 398
Multivariate Principal Component Analysis (Review)
Other representations of functional data

Remarks:
I Eigenanalysis of V is equivalent to singular value decomposition
(SVD) of X
I computationally vastly more efficient
I usually only need to compute first few leading singular vectors and
values
I very efficient algos using e.g. random projections, for sparse matrices
I variants for partially observed data (→ simple recommender systems)
I The eigenvalues ν1 , . . . , νm describe the amount of variability
explained by their principal components.
I Proportion of variability explained by the k-th principal component
νk
0 < Pm ≤ 1.
j=1 νj

179 / 398
Multivariate Principal Component Analysis (Review)

Functional Principal Component Analysis


Definition
Estimation in basis representation:
Example: FPCA
Regularized Functional Principal Component Analysis

Functional PCA for Sparse Functional Data

180 / 398
Functional Principal Component Analysis
Definition

Basic setting:
I Data generating process: smooth random function X (t) with
I unknown mean function µX (t) = E[X (t)]
I unknown covariance function vX (s, t) = Cov(X (s), X (t))
I Observed functions: x1 (t), . . . , xn (t)
I often given on a set of sampling points t1 , . . . , tp
I preprocessing, smoothing
I For simplicity, assume that the functions are centered, i.e.
n
1X
µ̂X (t) = xi (t) ≡ 0
n
i=1

180 / 398
Functional Principal Component Analysis
Definition

Example: Berkeley growth study


200

girls
boys
180

39 boys, 54 girls
160

I p = 31 measurements
140
Height

between 1 and 18 years


120

(same timepoints for each


100

child)
I measurements not equally
80

spaced
60

5 10 15

Age

181 / 398
Functional Principal Component Analysis
Definition

Functional PCA: Idea


Extend multivariate PCA to the functional case.

I Functional data xi (t) require functional weights φ(t)


I Scalar product for functional data
Z
hφ, xi i = φ(t)xi (t)dt

1 1
φ2 (t)dt
R
induces norm kφk = hφ, φi 2 = 2

182 / 398
Functional Principal Component Analysis
Definition

Definition (Functional Principal Components)


The first functional principal component φ1 (t) of X is defined by
n Z 2
1X
φ1 (t) = arg max φ(t)xi (t)dt .
kφk2 =1 n
i=1

The k-th functional principal component φk (t) is found analogously,


subject to the additional constraint
Z
φj (t)φk (t)dt = 0 for all j < k.
R
The values ξik = φk (t)xi (t)dt are called functional principal component
scores.

183 / 398
Functional Principal Component Analysis
Definition
Remarks:
I The definition of the functional principal components is equivalent to
the multivariate case:
I vectors xi → functions xPi (t)
p R
I scalar product hxi , φi = j=1 xij φj → hxi , φi = xi (t)φ(t)dt
I Functional principal components φk (t) are again only defined up to a
sign change
I One can show that φk is the k-th eigenfunction of the sample
covariance operator
1 Xn
Z
(Vf )(s) = xi (s)xi (t) ·f (t)dt.
|n
i=1
{z }
v̂X (s,t)

The eigenvalue νk quantifies the amount of variability represented by


the k-th functional principal component φk (t).
184 / 398
I Expand xi (t) in known basis functions bk (t):
K
X
xi (t) ≈ θik bk (t)
k=1

I Estimation results in a standard matrix eigenanalysis problem


involving coefficients Θ = [θik ] i=1,...,n and the basis functions
k=1,...,K
b(t) = (b1 (t), . . . , bk (t)):
1
vX (s, t) = b(s)Θ> Θb(t)>
Zn
(V b(·)θ̃)(s) = vX (s, t)b(s)θ̃dt
Z 
1
= b(s)Θ> ΘWθ̃ with W = bk (t)bk 0 (t)dt k=1,...,K
n k 0 =1,...,K

I eigenvectors uj of n1 W1/2 Θ> ΘW1/2 for eigenvectors uj


I coefficient vectors for eigenfunction φj (t) = b(t)θj with
θj = W−1/2 uj
I results depend e.g. on choice of the basis functions bk (t) and their
185 / 398
Functional Principal Component Analysis
Example: FPCA

Berkeley growth study: R-Code


library(fda)
data(growth)
age <- growth$age
height <- cbind(growth$hgtm, growth$hgtf)

# choose basis (cubic B-spline functions)


bb <- create.bspline.basis(age, norder = 4)

# create functional data object


growth.fd <- Data2fd(y = height, argvals = age, basis = bb)

# functional PCA
growth.pca <- pca.fd(growth.fd, nharm = 2, centerfns = T)

186 / 398
Functional Principal Component Analysis
Example: FPCA
Berkeley growth study: Principal components

0.4
0.2
0.0
−0.2
−0.4

PC 1
PC 2

5 10 15

Age

I Proportion of variance explained: 80.9% (PC 1), 13.6% (PC 2)


I Rather hard to interpret

187 / 398
Functional Principal Component Analysis
Example: FPCA
Interpretation:
I Karhunen-Loève representation:


X
xi (t) = µX (t) + ξik φk (t)
k=1

The score ξik describes the weight of the k-th principal component
for the i-th observation.
I Effect of the k-th principal component:
Assume new scores
(
cc1 c2 k = 112
ξ˜k =
0 otherwise.

Then x̃(t) = µX (t) + ∞ ˜


P
k=1 ξk φk (t) = µX (t) + cc1 c2 φk (t)
√ √
I Typical choices for ck : ± νk , ±2 νk , as eigenvalues νk reflect the
variability that is explained by φk
188 / 398
Functional Principal Component Analysis
Example: FPCA
Berkeley growth study: Principal components
Effect of PC 1 Effect of PC 2
200

200
++++++++++++++
++++++++++++
++++++
180

180
−−−−−−−−−−
++++ −−−−−−−−−
++++ −−−−−
+
++ −−−−
++++ ++
++
++
++
+−

+
+−
+
++
++++ ++
+++ − − ++
++++++++++++++++++++++++
160

160
++++ ++++ −−−
++++
++++ −−−−−−−−−−−
−−−−−−−−−−−−−−−− ++++ −−−−
++ −−−−−−−− ++++ −−−−
+
++++
+
−−−−−−− +++++ −−−−−−−−−
++++ −−−−− +
140

140
+ −− +
+++−−−−−−−−−
++ −
−−−−−
− +++++
Height

Height
+++ +++++ −−−−−−
++++ −−−−− +++++ −−−−−
++++ −−−−−− +++++ −−−−−
++++ −−−−− + −−−
120

120
++++ −−−−−−−−−− +
+
+ −− −
++
+ ++++−−−−−
++++ −−−−−
− ++++−−−−−
++++ −−−−− ++++−−−−−
+++ −−−− −−−−−
+
++
+−
+++ −−−− + −
100

100
++−
+−−−
+++ −−−−−−−
+ +−
+−
+
+++ −−−−
− ++−−+
++−−
++ −−−− ++
+−−−−
++ −−−− +−
−−
+++

++−−−−
80

80

+
− −+−
+−
−−
60

60
5 10 15 5 10 15

Age Age


I Effect shows µ̂X ± 2 νk φk .
I PC 1 as individual growth effect
189 / 398
Functional Principal Component Analysis
Example: FPCA

Dimension reduction:
I Use K -dimensional individual score vectors

ξi = (ξi1 , . . . , ξiK )

instead of the (infinite dimensional!) functions xi (t)


I If we choose K large enough, we loose only little information

190 / 398
Functional Principal Component Analysis
Example: FPCA
Berkeley growth study: Principal component scores


girls

20


boys
● ●




● ● ●

●●●

10
● ●


● ● ● ●
●● ●
Score PC 2




● ● ●● ●
● ●


● ● ● ● ●


0

● ●
● ● ●



● ● ●




● ●
● ●
● ●
● ●

−10


● ● ● ● ●
● ● ●
●●

●●
● ● ● ●
● ●

● ●

−60 −40 −20 0 20 40 60

Score PC 1

I PC 2 as genderspecific effect
191 / 398
Idea
Aim at smooth functional principal components
for better interpretation.

I smooth functions before doing FPCA:


e.g. via (penalized / low-rank) basis representation.
Simple, ad-hoc, can work well for regular data, scales well.
I penalize roughness of eigenfunctions:
for data in basis representation / on equidistant grids.
refund::fpca2s, refund::fpca.ssvd, fda::pca.fd
I smooth estimated covariance function before doing eigenanalysis:
computationally more challenging, applicable to sparse or irregular
data without pre-smoothing / basis representation.
refund::fpca.face, refund::fpca.sc
Enforcing smoothness of eigenfunctions acts as a low-pass filter:
leading eigenfunctions will be smoothed to represent only low-frequency
variation.
Truncated basis representation using only first few eigenfunctions can then
be used for smoothing.
192 / 398
Penalization of Eigenfunctions

I Penalization of second derivatives (cf. regularized regression)


Z
2
PEN(φ) = φ00 (t) dt

I Maximize penalized sample variance


n Z 2
1 1X
PSV(φ) = · φ(t)xi (t)dt
kφk2 + λ PEN(φ) n
| {z } | i=1
penalization
{z }
from standard fPCA

I Smoothing parameter α controls influence of the penalty.

193 / 398
Influence of α:
α = 0.1 α = 10 α = 1000
Effect of PC 1 Effect of PC 1 Effect of PC 1

200

200

200
+++++
+++++
+++++++ +++++
++++++++++++++++++++++ +++++++++ +++++
+++++++ +++++++ +++++
180

180

180
+++++ ++++++ +++++
++++ ++++++ ++++
++++ +++++ +
++
++++ +++++ ++++
++++
++++ +++++ ++++ −−−−−−
−−−−−−
160

160

160
+
+
++++ ++++
+ −− ++++ −−−−−−
++++ −−−−−−−−−−−−−−−−−−−−−− ++++ −−−−−−−−−−−− ++++ −−−−−−
+
++++
++
−−−−−−−−
−−− ++++ −−−−−−−−−− ++++
++++ −−−−−−− ++++ −−−−−−−− ++++ −−−−−−
++++ ++++ −−−−−−− ++++ −−−−−−
−−−−−− −−−−−− −−−−−−
140

140

140
++++ −−−−− ++++ −−−−−− ++++ −−−−−−
++++ −−−−− ++++ −−−−−− ++++
Height

Height

Height
+
+ −−−

++++ −−−−− ++++ −−−−− +
+++ −
−−−
++++ −−−−− ++++ −−−−− ++++ −−−−−−
+
++
+
+ −−−−
−−− ++
++
−−−−
−− ++++
++++ −−−−−−
−−−−−
120

120

120
++++ −−−−−−−−−− ++++ −−−−−−−−− ++++ −−−−−
++++ −− ++++ −−− ++++ −−−−−−−−−−
++++ −−−−−− ++++ −−−−− ++++
++++ −−−− ++++ −−−− ++++ −−−−−−−−−

+++ −−−−− +++ −−−− ++++
100

100

100
+++ −−−− +++ −−−− ++++ −−−−−−−−−−

+++ −−−−− +++ −−−− ++
+
+++ −−−−−−− +++ −−−− ++ −−
++++ −−−−−−−−−
++ +++ −−−−−−−
+
+ −− − +
++ −−−−− −−−
+++−−−−− −−−−−
80

80

80
−−

−−− −−−−
60

60

60
5 10 15 5 10 15 5 10 15

Age Age Age

Effect of PC 2 Effect of PC 2 Effect of PC 2


200

200

200
−−− −−
−−−−− −−−−−
180

180

180
−−−−−−−−−−−
−−−−−−−− −−−−− −−−−− ++
−−−−− −−−−− −−−−− ++++++
−−−− −−−−− −−−−−+++++++
+−
+− +−
−−−− −−−−−+++++++
+++++++++− ++++++
++++++ −−− +++++++++++++++++++++++ +−
+−
+−
+− ++++++++++++++++++++++++++++ −−−−−+++++
160

160

160
+−
++++ −−− ++
++++++−−
+−
++− +−
+−
+−
−−

+


+
+
−++++++
+−
++++ −−−− ++
+++ −− −−
+
+++
+−
+−
+
++++ −−−−−−− +++++ −−−−− +−
+− +−
+−
+−
++
+ +++++ −−−− −+
++ +−
−− +−
+−
+++ −−−− ++++ −−−− +−
+−
140

140

140
+−
++++ −−−−−−− ++++ −−−−− +−
+++−
+−
+− +−
+−
+++++ −−−−−− ++++ −−−−
Height

Height

Height
++
+−−−−−
+++++−−−−−− ++++ −−−−− +−
++−
+−
+−
+−
+++++−−−−− ++++ −−−−− +++++++
−−−−
+++++−−−−− ++++ −−−−− +++−−−−−
++++−−−−−−−−
120

120

120
++++−−−−−−− + ++++++−−−
++++
++− −− ++
+−−−−−
+++−
++
+++++ −−−−−
+++− −−−−− ++++ −− +++++ −−−−
++
+−−−−− − ++ −−−− +++++ −−−−−
+−
+−
+−
+− +−
+−
+− −−−− +++++ −−−−
100

100

+−

100
+++− +++− +++++ −−−−
+−
+−
+− −−−−
++ ++
+−
+− −−−− +++++ −−−−−
+−
+−
+− ++
+−
+−
+−
+− +++++ −−−−−−−−
++
+−
+− −− +−
+−
+− −− −−−−
+−
+− +−
+−
+− −−−−
80

+−
80

80
+− +−

+−

60

60

60
5 10 15 5 10 15 5 10 15

Age Age Age

I An appropriate value of α can be found by cross-validation


(one-curve-leave-out)
194 / 398
Smoothing the eigenfunctions

SSVD: Smooth SVD (Xiao et al. 2016, e.g.)


Simpler implementation for raw data on equidistant grid:
Center function evaluations X = [xi (tj )] i=1,...,n , for k = 1 . . . K do:
j=1,...,T

1. compute first right singular vector of X (first eigenfunction of XT X)


2. smooth with simple difference penalty and cross-validated smoothing
parameter λk to get φk = (φk (t1 ), . . . , φk (tT ))
3. compute loadings ξk
4. update X ← X − ξk φk

195 / 398
Smoothing the covariance surface

FACE: Fast Covariance Estimation (Xiao et al. 2016)


I uses lots of computational shortcuts to efficiently smooth covariance
surface via penalized tensor product splines: symmetry, array
arithmetic, clever optimization of smoothing parameter
I never actually computes covariance or tensor product
⇒ scales to large datasets with high resolution
I only easily applicable for regular data

196 / 398
R-Code: FACE & SSVD

library(refund)
growthmat <- rbind(t(growth$hgtm), t(growth$hgtf))
growth_face <- fpca.face(growthmat, argvals = growth$age, knots = 25, npc = 2)
growth_ssvd <- fpca.ssvd(growthmat, argvals = growth$age, npc = 2)

FACE SSVD

0.3
0.3

0.2
0.2

0.1
0.1

0.0
0.0

−0.1
−0.1
−0.2

−0.2
PC 1 PC 1
−0.3

PC 2 −0.3 PC 2

0 5 10 15 20 25 30 0 5 10 15 20 25 30

Age Age

197 / 398
Multivariate Principal Component Analysis (Review)

Functional Principal Component Analysis

Functional PCA for Sparse Functional Data


Sparse Functional Data
Borrowing Strength across Observations
Functional Principal Component Scores as Conditional Expectations
Example

198 / 398
Functional PCA for Sparse Functional Data
Sparse Functional Data

Sparse functional data:


I Many functional data sets are sparse (or irregular)
I e.g. most longitudinal data sets
I Number and location of observation points ti1 , . . . , tiTi for each curve
may vary
I Can result
R in bad approximation for smoothed functions and scores
ξik = (xi (t) − µX (t)) φk (t)dt, in particular if Ti is small (numerical
integration fails!)

198 / 398
Functional PCA for Sparse Functional Data
Sparse Functional Data

Example: Sparsified Berkeley growth data


200

full growth curve


sparsified observations
Artificial sparsification:

180


I Observations per child:
160




Ti ∈ {2, . . . , 6}, median of 4
140


Height



Time points: tij ∈ {t1 , . . . , tTi }


I
120

I Figure shows sparse versions for 4


100

●●

random children
80


● ●

I In total 370 observations (full


60

5 10 15 dataset: 2883)
Age

199 / 398
Functional PCA for Sparse Functional Data
Borrowing Strength across Observations

Principal component Analysis through Conditional Expectation:


(Yao, H.-G. Müller, et al. 2005, PACE)
Idea
Develop a new FPCA method that is applicable to sparse functional data.

Basic setting:
I Use observation points directly without previous smoothing
I Account for additional measurement errors εij ∼ N (0, σ 2 )
I Model:
X∞
yij = xi (tij ) + εij = µX (tij ) + ξik φk (tij ) +εij
| {zk=1 }
Karhunen-Loève

200 / 398
Functional PCA for Sparse Functional Data
Borrowing Strength across Observations
Estimation:
I Estimate mean and covariance functions using the pooled data
(”borrowing strength”)
I µ̂(t): by local linear smoother (or splines)
I v̂ (s, t) and σ̂ 2 :
Cov(yij , yil ) = Cov(xi (tij ), xi (til )) + σ 2 δjl = vX (tij , til ) + σ 2 δjl

→ Smooth ”raw” covariances


v̂i (tij , til ) = (yij − µ̂(tij ))(yil − µ̂(til )); tij 6= til

→ Consider diagonal values separately


I Estimate φk (t) and νk using the smoothed covariance estimate
v̂X (tij , til )
201 / 398
Smoothing the crossproduct surface
“Interesting” problem:
I scales very badly: quadratic in n, Ti
I but: symmetric, so only “need” upper/lower triangle.

● ● ● ●

0.01 ● ● ● ●

2 0.0
12
● ● ●
0.0
● ● ● ● 08 ●

0.0
● ●
0.01

08

● ●●

0.01 ● ● ●






● ●

● ●
4 ● ●


● ● ●

● ● ● ● ● ●
● ● ●
● ●
● ● ●
● ●
● ● ● ●
● ●
0.01
20

20




● ●
● ●
4

0.0
0.020

1
● ● ● ●● ● ● ●
● ● ● ● ● ● ●● ●● ● ● ● ●

● ●
●● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ●
● ● ●
● ●
0.01 ● ●





● ●
● ● ● ● ● ●
● ● ● ●
6 ●


● ●



● ●
0.01 ●



● ●

● ● ● ● ●

●●


●●
● ●

● ●● 6 ●
● ●●
● ● ●● ● ● ● ●
● ● ● ●●●● ● ●●● ● ● ●●● ●
● ● ● ●
● ●● ●
● ● ● ● ● ●● ●
● ● ●
● ● ● ●
● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ● ●
●●● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●

Age

0. ● ●
● ● ●
15

● ●● ●

15
● ● ● ●● ●
01 ● ● ●●


●●

8 0.015
●● ●
●● ●● ● ● ●● ●● ●
● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ●● ● ● ● ● ●
● ● ● ● ● ● ●
● ●●● ●● ●● ● ● ● ● ● ●● ●
● ● ●
● ● ●
● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●● ● ●
● ● ●

● ● ●● ●
●●●●
● ●


●● ● ●

● 0.0 ●
●●●●

● ●● ● ● ●



● ● ●
● ●
● ●

● ●
18 ● ● ● ●

● ●

● ● ● ●● ● ● ●● ● ●
● ● ● ●● ● ● ●● ●●

● ●

●●
● ● ● ●
● ● ● ● ● ●
● ●
●● ●

● ●● ● ● ● ● ● ●
●● ● ●
● ● ● ● ●
● ●● ●
● ●
● ● ● ● ●
● ● ● ●● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●● ● ●

● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ●●● ● ●● ● ● ● ● ●
● ●●●

● ●● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ●
0. ● ●
●●
● ●
●●
● ●
●●

● 02 ●●


●● ● ●











● ● ●● ● ● ●
● ● ● ● ●
● ● ●

● ● ● ●

●● ● ●
0.0 ● ● ● ●● ● ●
● ● ●
2 ● ●

10

● ● ● ●

10
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●

0.010
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●● ● ● ● ●●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●●● ● ● ● ● ●●
● ●●● ● ● ● ●●
● ● ● ● ●

●● ● ● ● ● ● ●
● ● ● ● ●
●● ● ● ● ●
●● ● ● ●
● ●
● ● ● ●
● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●●● ● ● ● ●
● ● ● ● ● ● ●
● ●●
● ● ● ●●
● ● ● ●
● ● ● ● ● ● ●
●● ●● ● ● ● ●
● ● ● ●●
● ● ● ● ●
●● 0.0 ● ● ●





● 0.0 ●
● ● ●


● ●




● ●
22 ●



22 ●
● ● ●

● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ●

● ● ●
● ● ●
● ● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ●
● ●● ● ● ●
● ● ● ● ●
● ● ● ●


● ● ●
● ● ●

10 15 20 10 15 20

Age Age
0.00

0.00
Scaled eigenfunctions
−0.05

−0.05
−0.10

−0.10

100% 97.3%
0.000458% 1.15%
4.19e−06% 0.764%
−0.15

−0.15

3.24e−06% 0.268%

5 10 15 20 25 5 10 15 20 25

Age Age

Figure 2: Above: Covariance function estimates Ĉsq (left) and Ĉtr (right) for the cortical
thickness data, with locations (tij1 , tij2 ) of the values (6) that are smoothed to produce
these estimates; thus only points with tij1 > tij2 are included at right. Below: Scaled
(P. T. Reiss and Xu 2018; Cederbaum, Scheipl, et al. 2018)

I surface estimate under positive-definiteness constraint!
eigenfunction estimates λ̂ φ̂ , for j = 1, 2, 3, 4, for the resulting estimated covariance j j 202 / 398
Functional PCA for Sparse Functional Data
Borrowing Strength across Observations
Berkeley Growth Study:
Mean estimation based on pooled data

200


180
● ●
● ● ● ● ● ●
● ●
● ● ●
● ● ● ●
● ● ●

● ●
● ● ●
● ●
● ●
● ● ●
● ● ● ●
● ●
● ● ●
● ● ●
● ● ● ●

● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
160 ●
● ● ● ●

● ●
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ●
● ●
● ●
● ● ● ●

● ● ●
● ● ● ●

● ● ● ● ● ●
● ●
● ● ●
● ● ●

● ● ●
● ● ●

● ● ● ● ● ●
● ●
● ●
● ● ●
● ● ● ●
140

● ● ● ●
● ● ● ●


● ● ● ●
● ● ● ● ●
Height

● ● ● ● ●

● ●
● ●
● ●

● ●
● ● ● ●
● ●

● ●
● ● ● ●
● ●

120

● ●
● ●
● ●

● ●
● ● ●

● ●

● ●
● ●
● ●


● ●

100


● ●

● ●




● ●



●●
● ●
●●


●●
80


●●●
●●
●●
●● ●
●●●

●●


60

5 10 15

Age

203 / 398
Functional PCA for Sparse Functional Data
Borrowing Strength across Observations
Berkeley Growth Study: Sparsified Data
Covariance estimation based on pooled data (raw & smoothed):

15
15

10
10
Age

Age

5
5

5 10 15 5 10 15

Age Age

I Diagonal values removed in raw covariance


204 / 398
Functional PCA for Sparse Functional Data
Functional Principal Component Scores as Conditional Expectations
Theorem (FPCA through conditional expectation)
Assuming ξik and εij to be independent and jointly Gaussian, the best
prediction for ξik is given by

ξ˜ik = E(ξik | yi , T ) = νk φ> −1


ik Σyi (yi − µi ),

with Σyi = Cov(yi , yi | T).

I Notation:
yi = (yi1 , . . . , yiTi ) , µi = (µX (ti1 ), . . . , µX (tiTi ))
φik = (φk (ti1 ), . . . , φk (tiTi )) T = {tij , i = 1, . . . , n, j = 1, . . . , Ti }

I Intuition: ξ˜ik is the best prediction of the true score ξik given the
observed values ỹi and the pooled information from all observation
points (T).
205 / 398
Functional PCA for Sparse Functional Data
Example

Berkeley growth study: R-Code


library("refund")
# calculate the first two principal components
sparsePCA <- fpca.sc(argvals = age, Y = height_sparse, npc = 2)

I The matrix height sparse contains the artificially sparsified heights


(row-wise).
I Data is fed directly into the function fpca.sc, no pre-smoothing
with basis functions as in fda package.

206 / 398
Functional PCA for Sparse Functional Data
Example
Berkeley growth study: Estimated principal components
Effect of PC 1 Effect of PC 2
200

200
−−
−−
180

180
−−−
−−− +
+−
−− ++−
+ + −−
+ −−

+− +−
+−
−− −
160

160
− −
+ + +++++++ + +−
− +
−− ++ +−+−
− + + −
− + +−
140

140

+ ++−
Height

Height
− + −
+ ++
− −

+ ++ +

120

120
− ++ +

− + −
+
+ −
+
100

100
− +
+ −
+
− +
−−+ +−
−−
+
−− ++
80

80
++ +−

++
60

60
5 10 15 5 10 15

Age Age

I Proportion of variance explained: 93.2% (PC 1), 6.7% (PC 2)


I Reminder: Principal components are only defined up to a sign change
I Results for first PC very similar to full data analysis, even using only
13% of the original data!
207 / 398
Functional PCA for Sparse Functional Data
Example
Berkeley growth study: Estimated scores


girls ●
● ●

10

boys ●

● ●

●● ●

5


● ●
● ●
● ●
● ● ●
● ● ●
● ● ●●
● ● ●●
Score PC 2

● ● ●
0

● ● ● ●
●●
● ● ●

● ●
● ● ● ●● ● ●● ●

●● ●
● ● ●
● ●
−5

● ●●
● ● ●
● ●

● ●
−10

● ●

● ●

● ● ● ●


−15

−60 −40 −20 0 20 40

Score PC 1

I PC 2: Separation of girls/boys less clear


208 / 398
Summary

PACE approach by Yao, H.-G. Müller, et al. (2005):


I Suitable for sparse functional data
I Estimation of mean and variance function based on all observations
(”borrowing strength”)
I Deals with (white-noise) measurement errors
I BLUP estimates of functional principal component scores via
conditional expectation

209 / 398
Summary
Functional PCA:
I Directly extends multivariate PCA to functional data
I Dimension reduction technique
I ”optimal” low-rank basis representation for given data:
most variance explained with smallest K
I eigenvalue decay gives indication of inherent complexity of the data
I ”low-pass” filter via truncated FPC basis representations
I clustering
I Key tool for exploring functional data and further analyses
Iclustering, anomaly detection, ...
Isupervised learning with functional features
⇒ simply use FPC scores ξi as scalar feature vectors.
I Penalized, smoothed versions often show clearer effects and facilitate
interpretation
I Phase variation can be a problem: slow eigenvalue decay

210 / 398
900
200
800

Extensions: 150
1000
800
700

Spot prices
600 600
400
“functional fragment” data: via (low-rank) matrix completion500

Downloaded from https://academic.oup.com/biomet/article-abstract/106/1/145/5250868 by Universitaetsbiblio


I 100 200

400 5

methods 50
300
10
15
10
15 s
5

t
20 200
(Descary and Panaretos
0 2018, e.g.) 20

5 10 15 20
Hours
(c) Subsample of the original dataset (d) Fragmented subsample (d=0.5)
200 200

150 150
Spot prices

Spot prices
100 100

50 50

0 0
5 10 15 20 5 10 15 20
Hours Hours
(e) (f) ~K
Fragmented and discretized subsample (d= 0.5) Rn
900
200
800
1000
150 800
700
Spot prices

600 600
400
100 200
500
0
5 400
10 5
50
10 300
15
t

15 s
20 20 200
0
5 10 15 20
Hours 211 / 398
Part IV

Background: Boosting

212 / 398
Introduction

Definition and Properties of Gradient Boosting

Regularization & Selection

Implementation
Introduction
Motivation
Why boosting?

Definition and Properties of Gradient Boosting

Regularization & Selection

Implementation

214 / 398
Aims and scope

I Consider a sample containing the values of a response variable y and the


values of some predictor variables x = (x1 , . . . , xp )>
I Aim: Find the “optimal” function f ∗ (x) to predict y
I f ∗ (x) should have a “nice” structure, for example,

f ∗ (x) = β0 + β1 x1 + · · · + βp xp (GLM) or

f (x) = β0 + f1 (x1 ) + · · · + fp (xp ) (GAM)

⇒ f ∗ should be interpretable

214 / 398
Example 1 - Birth weight data

I Prediction of birth weight by means of ultrasound measures


(Schild et al. 2008)
I Outcome: birth weight (BW) in g
I Predictor variables:
I abdominal volume (volABDO)
I biparietal diameter (BPD, ”cranium width”)
I head circumference (HC)
I other predictors (measured one week before delivery)
I Data from n = 150 children with birth weight ≤ 1600g
⇒ Find f ∗ to predict BW

215 / 398
Birth weight data (2)

I Idea: Use 3D ultrasound measurements (left) in addition to conventional 2D


ultrasound measurements (right)

www.yourultrasound.com, www.fetalultrasoundutah.com

⇒ Improve established formulas for weight prediction

216 / 398
Example 2 - Breast cancer data

I Data collected by the Netherlands Cancer Institute


(van de Vijver et al. 2002)
I 295 female patients younger than 53 years
I Outcome: time to death after surgery (in years)
I Predictor variables: microarray data (4919 genes) + 9 clinical variables
(age, tumor diameter, ...)
⇒ Select a small set of marker genes (“sparse model”)
⇒ Use clinical variables and marker genes to predict survival

217 / 398
Classical modeling approaches
I Classical approach to obtain predictions from birth weight data and breast
cancer data:
Fit additive regression models (Gaussian regression, Cox regression) using
maximum likelihood (ML) estimation
I Example: Additive Gaussian model with smooth effects (represented by
P-splines) for birth weight data
⇒ f ∗ (x) = β0 + f1 (x1 ) + · · · + fp (xp )
abdominal volume biparietal diameter
200

60
100

40
f(volabdo)

f(bpd)

20
0
−100

0
−200

−20

50 100 150 200 250 5 6 7 8 9

volabdo bpd

218 / 398
Problems with ML estimation

I Predictor variables are highly correlated


⇒ Variable selection is of interest because of multicollinearity
(“Do we really need 9 highly correlated predictor variables?”)
I In case of the breast cancer data: Maximum (partial) likelihood estimates
for Cox regression do not exist (there are 4928 predictor variables but only
295 observations, p  n)
⇒ Variable selection because of extreme multicollinearity
⇒ We want to have a sparse (interpretable) model including the relevant
predictor variables only
I Classical methods for variable selection (univariate, forward, backward, etc.)
are known to be unstable and/or require the model to be fitted multiple
times: post-selection inference?

219 / 398
Boosting - General properties

I Gradient boosting (boosting for short) is a fitting method to minimize


general types of risk functions w.r.t. a prediction function f
I Examples of risk functions: Squared error loss in Gaussian regression,
negative log likelihood loss, quantile/pinball loss, ...
I Boosting generally results in an additive prediction function, i.e.,
f ∗ (x) = β0 + f1 (x1 ) + · · · + fp (xp )
⇒ Prediction function is interpretable
⇒ If run until convergence, boosting can be regarded as a more generally
applicable alternative to conventional fitting methods (Fisher scoring,
backfitting) for generalized additive (mixed) models.

220 / 398
Why boosting?

In contrast to conventional fitting methods, ...


... boosting is applicable to many different risk functions (absolute loss,
quantile regression)
... boosting can be used to carry out variable selection during the fitting process
⇒ No separation of model fitting and variable selection
... boosting is applicable even if p  n
... boosting addresses multicollinearity problems (by shrinking effect estimates
towards zero)
... boosting directly optimizes prediction accuracy (w.r.t. the risk function)

221 / 398
Introduction

Definition and Properties of Gradient Boosting


Problem Statement
Functional Gradient Descent
Componentwise Gradient Boosting

Regularization & Selection

Implementation

222 / 398
Gradient boosting - estimation problem

I Consider a one-dimensional response variable y and a p-dimensional set of


predictors x = (x1 , . . . , xp )>
I Aim: Estimation of

f ∗ := arg min EXY [ρ(y , f (x))] ,


f (·)

where ρ is a loss function that is assumed to be differentiable (almost


everywhere) with respect to a prediction function f (x)
I Examples of loss functions:
I ρ := (y − f (x))2 → squared error loss in Gaussian regression
I Negative
( log likelihood function of a statistical model
(1 − τ )(f (x) − y ) if y < f (x)
I ρ :=
τ (y − f (x)) if y ≥ f (x)
→ pinball loss for τ -quantile regression

222 / 398
Gradient boosting - estimation problem (2)

I In practice, we usually have a set of realizations


X = (x1 , . . . , xn ), y = (y1 , . . . , yn ) of x and y , respectively
⇒ Minimization of the empirical risk
n
1X
R(f ) = ρ(yi , f (xi )) → min
n f
i=1

Pn
I Example: R(f ) = n1 i=1 (yi − f (xi ))2 corresponds to minimizing the
expected squared error loss
I optimization over a function space =⇒ we’re in trouble...

223 / 398
Naive functional gradient descent (FGD)

I Idea: use gradient descent methods to minimize


R(f ) = R(f(1) , . . . , f(n) ) w.r.t. f(1) = f (x1 ), . . . , f(n) = f (xn )
=⇒ optimization over standard vector space!
[0] [0]
I Start with offset values fˆ(1) , . . . , fˆ(n)
I In iteration m:
∂R ˆ[m−1]
 [m]   [m−1]
fˆ fˆ
  
− ∂f (f(1) )
 (1) .   (1).   (1)
.. 
 . = .. +ν·  ,
 .     . 
[m] [m−1] [m−1]
fˆ(n) fˆ(n) ∂R ˆ
− ∂f(n) (f(n) )

where ν is a step length factor


⇒ Principle of steepest descent

224 / 398
Naive functional gradient descent (2)
(Very) simple example: n = 2, y1 = y2 = 0, ρ = squared error loss
1
(f(1) − 0)2 + (f(2) − 0)2

=⇒ R(f ) =
2
∂R ˆ[m−1] [m−1]
=⇒ (f ) = fˆ(i)
∂f(i) (i)
z = f12 + f22

3
10 10 14
14 12 16
16 12
2

6
1

2
f2

0
−1
−2

16 14 12 16
−3

10 10 12 14

−3 −2 −1 0 1 2 3

f1

225 / 398
Naive functional gradient descent (3)

I Increase m until the algorithm converges to some values


[m ] [m ]
fˆ stop , . . . , fˆ stop
(1) (n)

I Problem with naive gradient descent:


I No predictor variables involved
[m ] [m ]
I Structural relationships between fˆ(1) stop , . . . , fˆ(n) stop is ignored
 
[m] [m]
fˆ(1) → y1 , . . . , fˆ(n) → yn
I “Predictions” only for observed values y1 , . . . , yn

226 / 398
Componentwise Gradient Boosting

I Solution: Estimate the negative gradient in each iteration


I Estimation is performed by some base-learning procedure regressing the
negative gradient on the predictor variables
[m ] [m ]
=⇒ base-learning procedure ensures that fˆ(1) stop , . . . , fˆ(n) stop are
predictions from a statistical model depending on the predictor variables.
=⇒ fˆ is a learnable function of x
I To do this, we specify a set of regression models (base-learners) with the
negative gradient as the dependent variable
I In many applications, the set of base-learners will consist of p̃ = p simple
regression models
(e.g. one univariate (linear) model for each of the p predictor variables)

227 / 398
Componentwise Gradient Boosting (2)
Functional gradient descent (FGD) boosting algorithm:
[0]
1. Initialize the n-dimensional vector f̂ with some offset values (e.g., ȳ ).
Set m = 0 and specify the set of base-learners.
Denote the number of base-learners by p̃.
2. Increase m by 1.

Compute the negative gradient − ∂f ρ(Y , f ) and
ˆ
evaluate at f [m−1]
(xi ), i = 1, . . . , n.
This yields the negative gradient vector
 
[m−1] ∂
u = − ρ(y , f )
∂f y =yi ,f =fˆ[m−1] (xi ) i=1,...,n

..
.

228 / 398
Componentwise Gradient Boosting (3)

..
.

3. Approximate the negative gradient u[m−1] by each base-learner specified in


Step 1 by a simple LS(!) fit.

This yields p̃ vectors, where each vector is an estimate of the negative


gradient vector u[m−1] in terms of (parts of) x.

Select the base-learner that fits u[m−1] best (→ min. SSE).


Set û[m−1] equal to the fitted values from the corresponding best model.

..
.

229 / 398
Componentwise Gradient Boosting (4)

..
.
[m] [m−1]
4. Update f̂ = f̂ + ν û[m−1] , where 0 < ν ≤ 1 is a real-valued step
length factor.
5. Iterate Steps 2 - 4 until m = mstop .

Hothorn, Bühlmann, et al. 2010, e.g.

230 / 398
Simple example
I In case of Gaussian regression, gradient boosting is equivalent to iteratively
re-fitting the residuals of the model.
I use a B-spline basis with 20 basis functions and ridge penalty as base-learner:

y = (0.5 − 0.9 e−50 x ) x + 0.02 ε


2
Residuals
● ●

0.10 ●
0.10 ●

m=0 ●●● m=0 ●●●


●● ●●
● ●
● ●
● ●
0.05 ●

● ● 0.05 ●

● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ●

●● ●● ●● ●●
● ● ● ● ● ● ●● ● ● ● ● ● ● ●●
● ● ● ●

Residuals
● ●● ●
● ● ● ● ● ● ●● ●
● ● ● ● ●
● ● ●● ● ● ● ●● ●
● ● ●● ● ● ● ●● ●
●● ●● ● ● ●● ●● ● ●
● ● ●● ● ●● ● ● ● ● ●● ● ●● ● ●
y

0.00 ●
● ● ●● 0.00 ●
● ● ●●
● ●●●● ● ● ●●●● ●
● ●● ● ●●
● ● ● ●● ● ●●
● ●
●●● ● ● ●
● ● ●●● ● ● ●
● ●
● ●● ● ●● ● ● ●● ● ●● ●
● ●● ● ● ● ●● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ●
●● ● ● ●● ● ●
● ●
● ● ● ● ● ●
●● ● ●● ●
● ●
−0.05 ● ● ● −0.05 ● ● ●

●● ● ● ●● ● ●
● ●
● ● ● ●
● ●
● ●
● ●

−0.10 ● −0.10 ●

−0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2

x x

y = (0.5 − 0.9 e−50 x ) x + 0.02 ε


2
Residuals
● ●

0.10 ●
0.10 ●

m=1 ●●● m=1 ●●●


●● ●●
● ●
● ●
● ●
0.05 ●
● ●
● ●

0.05 ●
● ●
● ●

● ● ● ● ● ● ● ●
● ● ●● ● ● ●●
● ● ● ●
●● ●● ●● ●●
●● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ●●
iduals

● ●● ●
● ● ● ● ● ● ●● ●
● ● ● ● ●
● ● ●● ● ● ● ●● ●

●● ●●
● ● ●● ●
● ●●

●● ●

●● ● ●


●● ●●
● ● ●● ●
● ●●

●● ●

●● ● ●

231 / 398
y

0.00 ●
● 0.00 ●

Properties of gradient boosting

I It is clear from Step 4 that the predictions of y1 , . . . , yn in iteration mstop


take the form of an additive function:
[mstop ] [0]
f̂ = f̂ + ν û[0] + · · · + ν û[mstop −1]

I The structure of the prediction function depends on the choice of the


base-learners
I For example, linear base-learners result in linear prediction functions
I Smooth base-learners result in additive prediction functions with
smooth components
⇒ final fˆ[mstop ] (x) has a meaningful interpretation

232 / 398
Properties of gradient boosting

I gradient boosting can optimize any loss function via a series of simple LS
steps
=⇒ huge flexibility, scales well
I local linear approximation to loss surface good enough for ν  1

I The step length factor ν could be chosen adaptively. Legend has it that
adaptive strategies do not improve the estimates of f ∗ and lead to an
increase in running time
=⇒ set ν small (ν = 0.1) but fixed.
Fixed ν also required for (unbiased) variable selection, easy tuning.

233 / 398
Introduction

Definition and Properties of Gradient Boosting

Regularization & Selection

Implementation

234 / 398
Gradient boosting with early stopping

I Gradient boosting has a built-in“ mechanism for base-learner selection in



each iteration.
⇒ This mechanism will carry out variable selection.
I Gradient boosting is applicable even if p > n.
I In case p > n, it is usually desirable to select a small number of informative
predictor variables (“sparse solution”).
I If m → ∞, the algorithm will select non-informative predictor variables.
=⇒ Overfitting can be avoided if the algorithm is stopped early, i.e., if
mstop is considered as a tuning parameter of the algorithm

234 / 398
Illustration of variable selection and early stopping
I Very simple example: 3 predictor variables x1 , x2 , x3 ,
[m]
3 linear base-learners with coefficient estimates β̂j , j = 1, 2, 3
I Assume that mstop = 5
I Assume that x1 was selected in iteration 1, 2, 5
I Assume that x3 was selected in iteration 3 & 4
[mstop ] [0]
f̂ = f̂ + νû[0] + νû[1] + νû[2] + νû[3] + νû[4]
   
[0] [0] [0] [1] [1]
= β̂0 + ν β̂0 + β̂1 x1 + ν β̂0 + β̂1 x1 +
     
[2] [2] [3] [3] [4] [4]
ν β̂0 + β̂3 x3 + ν β̂0 + β̂3 x3 + ν β̂0 + β̂1 x1
   
[1] [4] [2] [3]
= β̂0∗ + ν β̂1 + β̂1 x1 + ν β̂3 + β̂3 x3

= β̂0∗ + β̂1∗ x1 + β̂3∗ x3

=⇒ Linear prediction function


I x2 is not included in the model, since its base-learner was never selected
=⇒ variable selection
235 / 398
How should the stopping iteration be chosen?

I Use cross-validation techniques to determine mstop

(Hofner, Mayr, Robinzonov, et al. 2014)

⇒ The stopping iteration is chosen such that it maximizes prediction accuracy.

236 / 398
Shrinkage

I Early stopping will not only result in sparse solutions but will also lead to
shrunken effect estimates (→ only a small fraction of û is added to the
estimates in each iteration).
I Shrinkage leads to a downward bias (in absolute value) but to a smaller
variance of the effect estimates (similar to Lasso or Ridge regression).
⇒ Multicollinearity problems are addressed.

237 / 398
Variable selection: complications & improvements

I selection is biased towards more flexible base learners:


better able to fit gradient in each iteration, so get picked as winners
more often
=⇒ need to handicap flexible base learners accordingly:
e.g., adding penalty with fixed number of edf for spline base learners
(Hofner, Hothorn, et al. 2011)
I theoretical guarantees on FWER using stability selection:
(Hofner, Boccuto, et al. 2015; Meinshausen and Bühlmann 2010)
Idea: Use inclusion frequencies on resampled datasets as a function of
m to determine a set of stable, relevant baselearners.

238 / 398
Introduction

Definition and Properties of Gradient Boosting

Regularization & Selection

Implementation

239 / 398
mboost

Package mboost:
I baselearners: linear, (tensor product) splines, trees, radial basis
functions, random effects, ...
I wide variety of loss functions
I parallelized cross validation, stability selection
I computationally fairly efficient: sparse matrix algebra, index
compression, array arithmetic for tensor product designs (I. D. Currie
et al. 2006)
I ... but creates huge model objects ...
(Hothorn, Buehlmann, et al. 2018)

239 / 398
gamboostLSS

Package gamboostLSS :
I extensions to models with multiple additive predictors
I loss is a negative log-likelihood, additive predictors for different
distribution parameters
I e.g.: model conditional variances and means
I e.g.: bivariate Poisson for modeling soccer scores (Groll et al. 2018):
model rates (attacking strengths) and association (tactic effects).
I mboost as computational engine
(Mayr et al. 2012)

240 / 398
Summary

I very general, performant method for additive regression, classification


I yields regularized point estimates with feature selection
I uncertainty quantification and tuning require resampling
I good R implementation: mboost and its extensions
I many other variants (e.g. tree-based: gbm (Greenwell et al. 2019),
XGboost (T. Chen et al. 2019))

241 / 398
Part V

Functional Regression Models: Theory

242 / 398
Introduction

Model

Covariate Effects

Effect Representation

Estimation & Inference

Applications
Introduction
Motivation
Framework

Model

Covariate Effects

Effect Representation

Estimation & Inference

Applications

244 / 398
Functional Data

I Size and complexity of data on the rise. Increasingly, data collected


for which observations are curves.
I may be sampled on a (dense) regular grid or sparsely/irregularly.
I Examples: Spectroscopy, medical imaging, accelerometers,
longitudinal blood marker profiles, . . . .

244 / 398
Functional Data
Tissue type
Corticalis

60
Nerve
S.Gland
Spongiosa

50
Diffuse Reflectance [%]
40
30
20
10
0

350 400 450 500 550 600 650


Wavelength [nm]

Average spectra for 12 animals and four tissue types.


245 / 398
Functional Data

3500

3500
2500

2500
Total CD4 Cell Count

Total CD4 Cell Count


1500

1500
500

500
0

0
-20 -10 0 10 20 30 40 -20 -1
Months since seroconversion

Comparison of Pointwise Interval Widths Comparis


CD4 cell count trajectories in 366 HIV infected individuals.
1.5 1.6

1.5 1.6
Sampled Data
!^ CI)

!^ CI)
Full Data
246 / 398
Structured Functional Data

I More and more often, functional data exhibit additional structures


known from scalar data, e.g. longitudinal studies, spatial,
hierarchical or crossed designs.
I Examples: Longitudinal neuroimaging study on MS, precipitation
curves at spatial locations in Canada, . . . .

247 / 398
Structured Functional Data: Longitudinal

Patient B

0.8 0.7
Fractional Anisotropy
0.4 0.5 0.6

visit
0.3

1 4
2 5
3 6
0.2

0 20 40 60 80 100 120

248 / 398
Structured Functional Data: Spatial

1.0 0.5

log(Precipitation)

0.0
● ●

−0.5

● ● ● Arctic
● ●●
● Atlantic

−1.0
● ●
● ● Continental
● ● ●


● ● Pacific
●●● ●

● ●
●● 2 4 6 8 10 12
Months

249 / 398
Non-Gaussian Functional Data

Data with latent functional structure:


I Time-series of counts, absence/presence, states
I Examples: Sleep studies (REM/Non-REM etc), feeding behavior of
pigs, . . . .

250 / 398
Application: The Piggy Panopticon
PIGWISE Project: RFID surveillance of pig behaviour (Maselyne et al. 2014)
I measure proximity to trough (yes-no) every 10 secs over 102 days for
100 pigs
I additionally: humidity, temperature over time
⇒ models of feeding behaviour potentially useful for ethology (porcine
sociology, clustering) & quality control (disease, quality of feed stock)
pig 57
100
60
day

20

02:00 07:00 12:00 17:00 22:00

251 / 398
Functional Data Analysis

Questions of interest similar to those for scalar data:


I Regression with functional responses and/or functional covariates
I Quantification of uncertainty, structured variability
I Prediction
I (Not our focus right now: classification, clustering, description, . . . )
⇒ Functional data analysis

252 / 398
Aims and Means

I Functional data analogon of Generalized Additive Mixed Models:


Flexible, modular regression models with
I non-Gaussian functional or scalar responses
I multiple scalar and functional covariates
I linear and non-linear covariate effects and interactions
I correlated data (longitudinal/spatial/hierarchical)
I valid inference tests, CIs
I feature selection and model choice
I dense or sparse/irregular functional responses
I FPC- and spline-based techniques
I implementation in R packages: refund’s pffr(), FDboost
I Key idea:
Represent functional regression in terms of solved problems for
scalar data.

253 / 398
Key Idea

I Flexible, modular regression models for functional responses and/or


covariates
I Represent functional regression in terms of solved problems for
scalar data.
I Key idea for functional responses:
model observations within curves,
shift all functional structure into additive predictor → penalized scalar
regression
I model raw functional data, not basis representations
I use point-wise loss functions , integrated over functional domain
I recycle/adapt existing methodology and algorithms for scalar data
models
(Scheipl, Staicu, et al. 2015; Scheipl, Gertheiss, et al. 2016; Brockhaus, Scheipl, et al. 2015;
Greven and Scheipl 2017)

254 / 398
Introduction

Model
Generic Framework
Generalized Functional Additive Mixed Models

Covariate Effects

Effect Representation

Estimation & Inference

Applications

255 / 398
Generic Additive Regression Model
Observations (Yi , Xi ), i = 1, . . . , N, with
I Yi a functional (scalar) response over interval T = [a, b], [t, t]
I Xi a set of scalar and/or functional covariates.

Generic additive regression model


R
X
ξ(Yi |Xi = xi ) = f (xi ) = fr (xi ),
r =1

I ξ the modeled feature of the conditional response distribution, e.g.


expectation (with link function), median, a quantile, . . . .
I Partial effects fr (xi ) are real valued functions over T depending on
one or more covariates.

(Greven and Scheipl 2017)


255 / 398
Some transformation functions ξ and loss functions ρ
P R
Model: ξ(Yi |Xi = xi ) = f (xi ) = r =1 fr (xi )

Choose loss function ρ corresponding to transformation function ξ.


For scalars e.g.:
ξ ρ(Y , h(x))
mean regression E L2 -loss
median regression q0.5 L1 -loss
quantile regression qτ check function
generalized regression g ◦E neg. log-likelihood
GAMLSS (E, Var), e.g. neg. log-likelihood

Loss for functional responses: Integrate loss ρ(Y , h(x))(t) over T .


Goal: Minimize the expected loss, the risk, w.r.t. f :
Z
ρ(Y , f (x))dµ → min
f

Typically, dµ(t) = v (t)dt for some weight function v (t) ≥ 0.


256 / 398
Partial effects fr (x)
P
Model: ξ(Y |X = x) = f (x) = r fr (x)

covariate(s) type of effect fr (x)(t)


(none) smooth intercept α(t)
scalar covariate z linear effect zβ(t)
smooth effect γ(z, t)
functional covariate x(t) linear concurrent effect x(t)β(t)
R
functional covariate x(s) linear functional effect S
x(s)β(s, t)ds
R u(t)
historical effect `(t)
x(s)β(s, t)ds
R
smooth functional effect f (x(s), s, t)ds
grouping variable g functional random intercept Bg (t)
g and scalar z functional random slope zBg (t)
curve indicator i curve-specific smooth residual Ei (t)

Plus interactions. No dependence on t for scalar responses.


257 / 398
Covariate effects: function-on-function

I concurrent effect x(t)β(t)


R
I linear effect of functional covariate S x(s)β(s, t) ds

RS
:
R u(t)
I constrained effect of functional covariate l(t) x(s)β(s, t) ds

Rt Rt
0 : ; t−δ :

(Brockhaus, Melcher, et al. 2017; Brockhaus, Fuest, et al. 2018)

258 / 398
Generalized Functional Additive Mixed Models
Structured additive regression models of the general form

yi (t)|Xi ∼ P(µi (t), ν)


R
!
X
µi (t) = E (yi (t)|Xi ) = g fr (Xri , t)
r =1

I functional responses yi (t) over domain T ⊂ R, i = 1, . . . , n


observed on grids ti = (ti1 , . . . , tiTi ) ⊂ T
I P: exponential family, Beta, scaled t, Tweedie, . . .
I Xri a subset of covariate set X containing
I scalar covariates (metric or categorical)
I functional covariates
I grouping factors (random effects)
I known response function g (), additional nuisance parameters ν.

259 / 398
Application: Model
Model feeding rate
I for each day i = 1, . . . , 102 for a single pig
I as smooth function over time t
Response: binary feeding indicators ỹi (t) summed over 10min intervals
Z t
y (t) ∝ ỹi (s)ds
t−10min

y (t)|Xi ∼ Bin(n = 60, p = µi (t))


R
!
X
−1
µi (t) = logit fr (Xri , t)
r =1
fr (Xri , t) could be:
I baseline rate (functional intercept)
I effect of humidity & temperature in pig sty
I aging effect over days
I day-specific effect (functional random effect)
I auto-regressive effect of earlier feeding behaviour 260 / 398
Introduction

Model

Covariate Effects
Motivation
Recap: Penalized Splines

Effect Representation

Estimation & Inference

Applications

261 / 398
Covariate Effects: Examples

Xr fr (Xr , t)
∅ functional intercept β0 (t)
R u(t)
humidity hum(t) linear functional effect l(t) hum(s)β(s, t)ds;
R u(t)
smooth functional effect l(t) f (hum(s), s, t)ds;
concurrent effects f (hum(t), t) or hum(t)βh (t)
R t−1
yi (t) (auto-regressive) functional effects t−δ y (s)β(s, t)ds;
lagged effects f (yi (t − δ), t) or yi (t − δ)β(t)
hum(t), temp(t) concurrent interaction effects
e.g., f (hum(t), temp(t), t) or temp(t)β(hum(t))
scalar covariate i aging effect iβ(t) or f (i, t)
(day indicator) functional random intercepts bi (t)

261 / 398
Spline regression

Represent nonlinear effects as weighted sums of basis functions:

0 1 2 3 4 5 6 0 1 2 3 4 5 6
Unpenalized Splines

K=6 knots K=12 knots K=24 knots


1.0

1.0

1.0
0.5

0.5

0.5
0.0

0.0

0.0
−0.5

−0.5

−0.5
−1.0

−1.0

−1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

K=6 knots K=12 knots K=24 knots


1.0

1.0

1.0
0.5

0.5

0.5
0.0

0.0

0.0
−1.0 −0.5

−1.0 −0.5

−1.0 −0.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Penalized Splines
Use very flexible basis and add penalization for excessively wiggly fits:
=⇒ trade-off between goodness-of-fit and simplicity/generalizability

λ = 1e−04 λ=1

● ●
● ●
●● ● ●● ●

1.0

1.0
● ● ●●● ● ● ● ●●● ●
●● ● ● ●●● ● ●● ● ● ●●● ●
●● ● ● ●●
● ●● ●● ● ● ●●
● ●●
● ●
● ● ● ●

0.5

0.5
● ●●●● ● ●●●●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ●●
●●● ●●●
● ●
y

y
0.0

0.0
● ● ● ● ● ●
● ●● ●
●●● ● ●● ●
●●●
●●
●●
● ●● ●●
●●
● ●●
●● ● ●● ●● ● ●●
● ●
●● ●

● ● ●
●● ●● ●

● ● ●
●●
●● ●● ●● ● ● ●● ●● ●● ● ●
● ●● ●● ●
●● ●● ● ●● ●● ●
●● ●●
● ●● ● ●●
−1.0

−1.0
● ● ● ●● ●
●● ● ● ● ●● ●
●●
●● ●● ●● ●●
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

λ = 1000 λ = λGCV

● ●
● ●
●● ● ●● ●
1.0

1.0
● ● ●●● ● ● ● ●●● ●
●● ● ● ●●● ● ●● ● ● ●●● ●
●● ● ● ●●
● ●● ●● ● ● ●●
● ●●
● ●
● ● ● ●
0.5

0.5
● ●●●● ● ●●●●
● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ● ●● ● ● ●●
●●● ●●●
● ●
y

y
0.0

● ● ● 0.0 ● ● ●
● ●● ●
●●● ● ●● ●
●●●
●●
●●
● ●● ●●
●●
● ●●
●● ● ●● ●● ● ●●
● ●
●● ●

● ● ●
●● ●● ●

● ● ●
●●
●● ●● ●● ● ● ●● ●● ●● ● ●
●● ●●
● ● ●
● ●● ●●
● ● ●

● ● ● ●
● ●● ● ●●
−1.0

−1.0

● ● ● ●● ●
●● ● ● ● ●● ●
●●
●● ●● ●● ●●
● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x
Tensor Product Splines

Source:http://www.web-spline.de/web-method/b-splines/tpspline.png
Introduction

Model

Covariate Effects

Effect Representation
Tensor Product Representation
Penalization
Spline Bases & Penalties
FPC Bases & Penalties

Estimation & Inference

Applications

266 / 398
Effect Representation

I model for nT -vector y = (y1 (t11 ), . . . , yn (tnT ))> (t i ≡ t here).


I rows of Xr contain Xri , i = 1, . . . , n
I Bxr and Btr contain evaluations of marginal bases over Xr and t,
respectively.
Linearize effect estimation via tensor product basis of basis functions in
Xr and t:  
fr (Xr )(t) ≈ Bxr ⊗ Btr θr = Br θr
nT ×1 n×Kx T ×Kt Kx Kt ×1

266 / 398
Effect Representation

Slight complication for irregular data:


I t = (t 1 , . . . , t n )> with t i 6= t i 0
I subvectors yi of y with variable lengths Ti
I rows of Xr contain Xri , each repeated Ti times
=⇒
!
fr (Xr )(t) ≈ ( Bxr ⊗ 1>
Kt )· ( 1>
Kx ⊗ Btr ) θr = B r θr
nT ×1 nT ×Kx 1×Kt 1×Kx nT ×Kt Kx Kt ×1

I also needed if Xri depends on t i (e.g. concurrent functional


covariates)

267 / 398
Tensor Product Representation and Penalization

Define a regularization term via the Kronecker sum of marginal penalty


matrices

pen(θr |λtr , λxr ) = θrT (λxr Pxr ⊗ IKt + λtr IKx ⊗ Ptr )θr
= θrT Pr (λtr , λxr )θr .

=⇒ very flexible:
I Combine any basis & penalty for Xr with any basis & penalty for t!
→ Huge variety available in pffr() via interface to mgcv.
I Penalization parameters λtr , λxr separately control the relative
complexity of effects over the functional domain and the covariate
space, respectively.

268 / 398
Tensor Product Basis & Kronecker Sum Penalties
P ⊗ I, I ⊗ P: repeated penalties that apply to each subvector of θ
associated with a specific marginal basis function (Wood 2006a):
⊥
1 ·z1 ⊥1 ·F1 ∇1 ·z1 ∇1 ·F1 n1 ·z1 n1 ·F1

⊥ ·z ⊥1 ·F2 ∇1 ·z2 ∇1 ·F2 n1 ·z2 n1 ·F2
 ⊥11 ·z32 ⊥1 ·F3 ∇1 ·z3 ∇1 ·F3 n1 ·z3 n1 ·F3 
 ⊥2 ·z1 ⊥2 ·F1 ∇2 ·z1 ∇2 ·F1 n2 ·z1 n2 ·F1

 
 ⊥2 ·z2 ⊥2 ·F2 ∇2 ·z2 ∇2 ·F2 n2 ·z2 n2 ·F2
!
⊥1 ∇1 n1 
z1 F1
 
⊥2 ∇2  ⊥2 ·z3 ⊥2 ·F3 ∇2 ·z3 ∇2 ·F3 n2 ·z3 n2 ·F3
Bxr ⊗ Btr = n2
⊗ =
z2 F2

⊥3 ∇3 n3  ⊥3 ·z1 ⊥3 ·F1 ∇3 ·z1 ∇3 ·F1 n3 ·z1 n3 ·F1 
z3 F3
⊥4 ∇4 n4
 ⊥ ·z ⊥3 ·F2 ∇3 ·z2 ∇3 ·F2 n3 ·z2 n3 ·F2

 3 2 
 ⊥3 ·z3 ⊥3 ·F3 ∇3 ·z3 ∇3 ·F3 n3 ·z3 n3 ·F3 
 ⊥4 ·z1 ⊥4 ·F1 ∇4 ·z1 ∇4 ·F1 n4 ·z1 n4 ·F1 
⊥4 ·z2 ⊥4 ·F2 ∇4 ·z2 ∇4 ·F2 n4 ·z2 n4 ·F2
⊥ ·z ⊥4 ·F3 ∇4 ·z3 ∇4 ·F3 n4 ·z3 n4 ·F3
 P4 3 PFz

z
1    PFz PF
Pz PFz  Pz PFz 
IKx ⊗ Pt = 1 ⊗ PFz PF = PFz PF

1  
Pz PFz
PFz PF
P⊥ P∇⊥ Pn⊥
 
P⊥ P∇⊥ Pn⊥
P⊥ P∇⊥ Pn⊥
 
 P∇⊥ P∇ P∇n 
Px ⊗ IKt = P∇⊥ P∇ P∇n ⊗ (1 1) = P∇⊥ P∇ P∇n

Pn⊥ Pn∇ Pn  
Pn⊥ Pn∇ Pn
Pn⊥ Pn∇ Pn
Partial Effects fr (x)(t) as Latent GPs

Equivalent representation as (low-rank) Gaussian process prior:

fr (Xr )(t) = Br θr ,
θr ∼ N 0, (Pr (λtr , λxr ))−


Latent Low-Rank Gauss Processes


 
fr (Xr )(t)|λtr , λxr ∼ GP 0, Br (Pr (λtr , λxr ))− B>
r

270 / 398
Choice of Bases and Penalties

Any suitable
I marginal basis Btr (e.g. B-splines)
I penalty Ptr (e.g. pth order difference matrix).
over support T .

Effects constant in t: Btr = 1nT and Ptr = 0.

271 / 398
Marginal Bases for Scalar & Concurrent Effects

Xr effect Bxr Pxr

∅ intercept β0 (t) 1nT 0

z linear: zβz (t) (z1 , . . . , zn )> ⊗ 1T 0


nonlinear: f (z, t) (spline basis over z) ⊗ 1T associated penalty

x(t) linear: x(t)β(t) vec (x) 0


n×T

nonlinear: f (x(t), t) spline basis over vec (x) associated penalty


n×T

=⇒ Effects varying in t similar to varying coefficient terms in scalar


response models.

272 / 398
Marginal Basis for Functional Covariates
Z H
X
xi (s)β(s, t)ds ≈ wh xi (sh )β(sh , t)
S h=1
H Kx X
Kt
(s) (t)
X X
≈ wh xi (sh ) Bks (sh )Bkt (t)θr ,ks kt
h=1 ks =1 kt =1

effect Bxr Pxr


R
x(s)β(s, t)ds ((x ⊗ 1T ) · W)Bs penalty associated with Bs

Quadrature weights W = (w1 , . . . , wh )T ⊗ 1nT ;


(s)
x = [xi (sh )]i;h ; Bs = [BRks (sh )]h;ks .
ui (t)
I Modify W to get
li (t) xi (s)β(s, t)ds:
⇒ historical model, lag-/lead-effects, auto-regressive terms.
I FPC-based effects of functional covariates available in pffr as well
(later).
(Ivanescu et al. 2015)
273 / 398
Marginal Basis for Functional Covariates

Fairly similar for nonlinear effects of functional covariates:


Z H
X
F (xi (s), s, t)ds ≈ wh F (xi (sh ), sh , t)
S h=1
H Kx X
Kt
(s) (t)
X X
≈ wh Bkx (xi (sh ), sh )Bkt (t)θr ,ks kt
h=1 kx =1 kt =1

(s)
where Bkx (x(sh ), sh ) are radial basis functions or elements of another
tensor product basis.
(McLean et al. 2014, for scalar response)

274 / 398
Functional Random Effects

Xr effect Bxr Pxr

g random intercepts bgi (t) ∆ ⊗ 1T Pb

{g , z} random slopes zbgi (t) (diag(z)∆) ⊗ 1T Pb

i functional residuals ei (t) In ⊗ 1T In

I ∆ = [δgi m ]i;m : incidence matrix for levels m = 1, . . . , M = Kxr of g .


I Pb : precision matrix of any Gaussian (Markov) random field
modeling the dependency structure between levels of g
(e.g.: IM if bg (t) are i. i. d.).
I implies Gaussian process prior for bg (t) with low-rank covariance
that is smooth in t and controlled by P−1
b between units.

275 / 398
Functional Random Effects
Issues:
I often fairly important model component: errors typically
autocorrelated and not homoscedastic along T
=⇒ need to capture somehow with ei (t) for valid conditional
inference
I typically require fairly large basis Btr to capture high frequency /
local behavior
=⇒ estimation scales very badly for g with many level
I not (really) locally adaptive
I Pb must be fixed a priori
=⇒ no estimation of inter-level dependency structure, only of
relative variability between levels of g .

Partial solution: using a better basis → FPCs

276 / 398
Functional Random Effects: FPC representation

For (uncorrelated) functional random intercepts, can use Karhunen-Loève


expansion
Kt
X
bgi (t) ≈ ξgk φk (t),
k=1

I κk , φk (t): eigenvalues and -functions of covariance of bg (t)


I ξgk ∼ N (0, κk ): associated FPC loadings
I truncation lag Kt

effect Btr Ptr


functional random intercept bgi (t) 1n ⊗ (φ̂k (tl ))l;k diag(κ̂1 , . . . , κ̂Kt )

277 / 398
Functional Random Effects: FPC representation

How to get φ̂k etc?


I Simple case: functional random intercepts bg (t), Gaussian model
i
I estimate model without bg (t) under working independence assumption
I compute grouped mean residual curves
I do FPCA based on covariance of grouped mean residual curves
I refit model with estimated FPC basis
I one iteration usually sufficient
I Cederbaum, Pouplier, et al. 2016: extensions for FPCA of
hierarchical, crossed, sparse/irregular Gaussian data

278 / 398
Functional Random Effects: FPC representation

Advantages over spline-based functional random effects:


I by definition, L2 -optimal low-rank representation for any given Kt
=⇒ can get away with much fewer basis functions (usually....)
I basis is learnt from data
=⇒ automatically locally adaptive
But: Choice of Kt ? Leave penalty fixed?

279 / 398
Functional Covariates: FPC representation
For xi (s) ≈ K
P x
k=1 ψk (s)ξik ,
Z X Z X
xi (s)β(s, t)ds = ξik ψk (s)β(s, t)ds = ξik β̃k (t)
S k S k

⇒ sum of varying coefficient terms for FPC scores


Bxr Pxr

linear effect [ξˆik ] ⊗ 1T 0


n×Kx

I Extends to nonlinear FPC effects


fr (xi (s), t) = k fk (ξik , t), fr (xi (s), t) = K
PKx
ˆ ˆ ˆ
P x
k;k 0 >k fk,k 0 (ξik , ξik 0 , t)
(sum of) nonlinear effects of synthetic covariates ξˆk .
(H. Müller and Yao 2008)

280 / 398
Functional Covariates: FPC representation

+ always identifiable (more later)


+ β̃k (t) easier to interpret than β(s, t), at least for interpretable ψ̂k (s)
+ applicable to sparse/irregular functional covariates
− inference becomes conditional on estimated FPCs
− suitable truncation lag Kx ?

281 / 398
Introduction

Model

Covariate Effects

Effect Representation

Estimation & Inference


GFAMMs as GAMs

Applications

282 / 398
Penalized Estimation

We can then write the model for y = (y1 (t11 ), . . . , yn (tnTn )) as a


generalized linear model in the basis functions:

E(y|X ) = g (Bθ)

for a suitably concatenated design matrix B = [B1 | . . . |BR ] and


corresponding coefficient vector θ.

X
−2 log (L(θ|y, B, ν)) + λv θ > P̃v θ → min (1)
v

with B = [B1 | . . . |BR ], θ = [θ1> , . . . , θR> ]> and P̃v the marginal penalties
suitably padded with zeros.

282 / 398
Penalized Likelihood & Mixed Effect Representation

I Criterion (1) equivalent to determining MLE in generalized linear


mixed model
!− !
X
y ∼ P (g (Bθ) , ν) , θ ∼ N 0, λv P̃v
v

(Ruppert et al. 2003)


I Reparameterize to separate θ into fixed (unpenalized) and random
(penalized) coefficients with proper priors (finite variance)
(Wood 2006a; Wood, Scheipl, and Faraway 2012).
I Smoothing parameters via (approximate) REML: λv estimated as
variance components using restricted maximum likelihood or
Laplace-approximate marginal likelihood
(Wood 2011; Wood, Pya, et al. 2016).

283 / 398
This is also just a kind of varying coefficient model...

... effects vary over index of functional response:


 
I write as model for concatenated function values vec Y
n×T

I reformat covariate data accordingly & add index t as covariate


⇒ can be fitted like standard GAMM for scalar responses
I effects vary smoothly over t ⇒ smoothness of E (Y (t))
I sparse/irregular Y (t) possible

284 / 398
Inference is mostly a solved problem:

I use penalized splines for smooth effects, including linear functional


effects and functional random effects
I standard penalized likelihood inference: GCV or (RE)ML via
mixed model representation
I approximate CIs, tests, diagnostics, etc. immediately available
I refund’s pffr() as wrapper for mgcv and gamm4:
I optimized, robust, well-tested algorithms
I versatile library of spline bases ready to use:
cyclic, monotonicity, adaptive, P-splines, thin plate, etc.
I model effects with any given correlation structure via GMRFs

285 / 398
Advantages of Mixed Model Framework

Profit from decades of methodological development:


I modularity
I flexibility
I unified framework for smoothing parameter and random effect
estimation:
Smoothing parameters as variance components via (approximate)
REML (Wood 2011)

I approximate confidence bands


(Ruppert et al. 2003; Wood 2006a; Marra and Wood 2011)
I tests e.g. for constant or zero effects (Wood 2012)

I model selection (Saefken et al. 2014)

286 / 398
Introduction

Model

Covariate Effects

Effect Representation

Estimation & Inference

Applications
Application 1: Piggy Panopticon
Application 2: Nutrition & ICU survival

287 / 398
Application: The Piggy Panopticon
PIGWISE Project: RFID surveillance of pig behaviour (Maselyne et al. 2014)
I measure proximity to trough (yes-no) every 10 secs over 102 days for
100 pigs
I additionally: humidity, temperature over time
⇒ models of feeding behaviour potentially useful for ethology (porcine
sociology, clustering) & quality control (disease, quality of feed stock)
pig 57
100
60
day

20

02:00 07:00 12:00 17:00 22:00

287 / 398
Example: Model

Auto-regressive model with random day effects for pig 57:


Z t−10min
logit (µi (t)) = β0 (t) + bi (t) + yi (s)β(t, s)ds
t−3h

I periodic P-spline for β0 (t)


i.i.d.
I random day effect bi (t) ∼ GP(0, K (t, t 0 )) for day i.
I fit takes about 1min (5 smoothing parameters, nT = 14544, p = 850)

288 / 398
Example: Model Fit
µ̂(t) & y (t) for selected days (training data)
1 6 10 16 21
1.00
0.75
0.50
0.25

27 31 37 42 48
1.00
0.75
0.50
0.25
type

53 59 63 69 74 fitted
1.00 observed
0.75
0.50
0.25

80 84 90 95 101
1.00
0.75
0.50
0.25

0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t

289 / 398
Example: Model Predictions
µ̂(t) & y (t) for selected days (test data)
2 5 11 17 20
1.00
0.75
0.50
0.25

26 32 38 41 47
1.00
0.75
0.50
0.25
type

52 58 61 67 73 fitted
1.00 true
0.75
0.50
0.25

79 82 88 94 100
1.00
0.75
0.50
0.25

0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t

290 / 398
Example: Estimates
β̂0 (t)
3

-3

-6

0 6 12 18 24
t

b̂i (t)
10

-10

-20

-30
0 6 12 18 24
t

291 / 398
Example: Estimates
β̂(t, s); max. lag= 3 h
estimate upper CI lower CI
0

-1 value
0.1
0.0
-0.1
s-t

-0.2
-0.3
-0.4
-2 -0.5

-3
0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t

292 / 398
Example: Alternative Model

I many alternative model specifications feasible/sensible


I slightly better in terms of predictive accuracy on test set (Brier
Score):
Z t−10min
logit (µi (t)) = β0 (t) + f (i, t) + yi (s)β(t, s)ds
t−3h

with smooth effect surface f (i, t) to capture aging effect.

293 / 398
Example: Alternative Model

β0(t) f (i, t)
^ ^
−1 estimate upper CI lower CI

100
−2

75 value
2.5
−3

day i
0.0
50
−2.5
−4 −5.0
25

−5
0
0 6 12 18 24 0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t t

β(t, s) ; max. lag= 3h


^

estimate upper CI lower CI


0

value
−1 0.2
0.1
s−t

0.0
−0.1
−2 −0.2

−3
0 6 12 18 24 0 6 12 18 24 0 6 12 18 24
t

294 / 398
Exposure-Lag-Response Association (ELRA):
Nutrition & ICU Survival
I multi-center study of critical care patients from 457 ICUs
(≈ 10k patients)
I investigate acute mortality (first 30d)
I confounders z: age, gender, Apache II Score, year of admission, ICU
random effect, ...
I 12-day nutrition record xi (s)
I prescribed calories (determined at baseline)
I daily caloric intake
I daily caloric adequacy (CA)= caloric intake/prescribed calories
6709 38318 50609
125 ●

100 ●
● ● ● ●

75 ●

● ● ●

● ●
50 ● ●

25 ●

0 ● ● ● ● ● ● ● ●
caloric adequacy (%)

97114 99908 99912


125 ●

● ● ● ● ● ●

100 ●
● ● ● ● ●


75 ●

50 ●

25 ● ●

0 ● ● ●

105803 109206 121915


125 ●

● ● ●
100 ● ● ● ● ●

● ● ● ● ● ● ●

● ● ●
● ●
● ● ●

75 ●


50 ●

25
0 ● ●

1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
Protocol day te

(Bender, Scheipl, et al. 2018; Bender, Groll, et al. 295


2018)
/ 398
Clinical Questions & Parameterization
I Importance of “adequate” nutrition after trauma:
catabolic/anabolic states? metabolic stress?
I How much is “adequate”, “necessary”?
I Importance of timing & amount of nutrition during critical, acute &
recovery phases unclear
Importance of timing & amount of nutrition during critical, acute &
recovery phases unclear

Model Assumptions:
delayed, time-limited, cumulative & time-varying effect of time-varying
exposure xi (s)

Idea:
(partial) effect of xi (s) on log-hazard at time t:
Z
xi (s)β(t, s)ds
W (t)

I delay & time limit defined by W (t) 296 / 398


Piece-wise Exponential Model

I define cut-points on the time line: 0 = κ0 < . . . < κJ = tmax


assume constant hazard rates in each interval [κj−1 , κj )
I
(
1, event in [κj−1 , κj )
=⇒ likelihood for event indicators yij = is
0, else
proportional to Poisson likelihood (Friedman, 1982)
I for t ∈ [κj−1 , κj ), piecewise constant hazard rate for event process
λ(t; zi , xi ) is rate of pseudo-Poisson responses yij !
=⇒ ≈ a model for pseudo-Poisson functional responses yi (tj )

297 / 398
Piece-wise Exponential Model

Time-to-event model reparameterized as Poisson-GAMM:


X Z
log (λ(t|zi , xi )) = α0 (t) + hj (zi , t) + xi (s)β(t, s)ds
j W (t)

I estimated with mgcv (could use FDboost as well)


I flexible, semi-parametric modeling of baseline hazard rate α0 (t)
I use GFAMM framework for functional covariates to perform inference,
e.g. about nutrition effects
I flexible confounder adjustment: non-linear & time-varying effects;
time-varying covariates, random effects / frailties, . . .

298 / 398
ELRA: Example
Compare hazard of patient with given nutrition record to constant
undernutrition (e.g., ceteris paribus):
100 1.25

75
0.75

caloric adequacy [%]

hazard ratio estimate



● ●

50 ●

● 0.5


25 ●

0.25
0 ●

1 2 3 4 5 6 7 8 9 10 11 10 20 30
protocol day t

I’m leaving out a LOT of complications here...

299 / 398
Part VI

Functional Regression Models:


Implementation

300 / 398
GAMM Algorithm

refund

Componentwise Gradient Boosting

FDboost
GAMM Algorithm
GAMMS as GLMs as LMs
Estimation Algorithm

refund

Componentwise Gradient Boosting

FDboost

302 / 398
Introduction

I Generalized functional additive mixed models (GFAMM)


mathematically equivalent to varying coefficient models for scalar data
=⇒ can reuse, adapt, recycle scalar data methodology
I SOTA R implementation of GAMs: mgcv (N 2019)
I GFAMM implemented in package refund (Goldsmith, Scheipl, et al.
2018): wrapper around mgcv

302 / 398
MLE for GAMMs

yi |xi ∼ EF(µi , ν); g (µi ) = Bi θ

I B contains evaluated basis functions, θ the combined coefficient


vector
combined penalty v λv θv> Pv θ with smoothing parameters λv ,
P
I

MLE: θ̂ = arg maxθ `P (θ) = arg maxθ `(θ) − v λv θv> Pv θ


P
I

I equivalent to this penalized likelihood: priors θv ∼ N (0, λv P−


v)
I ... this also covers most random effects ...

303 / 398
IWLS

I GLMs estimated by Fisher-Scoring:


−1
∂ 2 `P (θ)

(k+1) (k) ∂`P (θ) (k)
θ̂ = θ̂ −E (θ̂ )
∂θ∂θ > ∂θ

I reduces to iteratively solving a penalized working linear model


(IWLS):

ỹ(k) = Bθ + , Cov() = (W(k) )−1 ν, E() = 0

for diagonal W, with


(k) (k) (k) (k)
I working observations ỹi = g 0 (µ̂i )(yi − µ̂i ) + η̂i
(k) 0 (k) 2
I (W(k) )−1
ii = V (µ̂i )g (µ̂i ) , ηi = g (µi )
I V determined by likelihood.

304 / 398
mgcv Algorithm

I estimation of θ given λ fairly trivial


I much harder: optimizing λ
I in mgcv: based on (restricted) marginal approximate likelihood

˜ kỹ − Bθ̂λ k2 + θ̂λ> Pλ θ̂λ


`(λ) ∝ + log B> WB + Pλ − log |Pλ |
ν

I inner loop:
run IWLS to convergence for θ̂λ for each evaluation of `(λ)
I outer loop:
optimize `(λ)

305 / 398
mgcv Algorithm: Outer Loop
kỹ − Bθ̂λ k2 + θ̂λ> Pλ θ̂λ
`(λ) ∝ + log B> WB + Pλ − log |Pλ |
ν

Ioptimization/evaluation ugly: log-determinants numerically unstable


(log of zero!), expensive
=⇒ optimize `(λ) without evaluating it
=⇒ optimize `(λ) over ρ = log(λ) via Newton steps: only need 1st, 2nd
derivatives
I step length control via newton step-length only, no re-evaluation
I avoids log-determinant problem, e.g.:
!−1
∂ log B> WB + Pλ X
= tr B> WB + λ r Pr Pj λ j
∂ρj r

I computation via pivoted, blockwise Cholesky: parallelizable in


OpenMP.
306 / 398
mgcv Algorithm: Discretisation

I computing B> WB is O(np 2 ), for B


n×p

I much more efficiently computable if B only has m  n distinct rows


(Lang et al. 2014)
I always given for Bj basis for a given covariate: much fewer distinct
covariate values than observations, if not, binning.
I write entries Bj,il = B̃j,kj (i)l , where kj (i) maps observations i to
distinct covariate value, so B̃j
m×p

=⇒ much cheaper crossproducts:


> X
B>
j w = B̃j w̃ with w̃l = wi
kj (i)=l

=⇒ O(n + mj pj ) instead of O(npj )

307 / 398
mgcv Algorithm: Discretisation

I extends to crossproducts and quadratic forms of compressed design


matrices
I extends to multiple smooth terms, tensor product smooths (Wood, Z. Li,
et al. 2017)
I computations for large p orders of magnitude faster (Z. Li and Wood 2019)
I in mgcv::bam(): option discrete

308 / 398
mgcv Alternative Inference Algorithms

mgcv implements a variety of alternatives to LAML maximization:


I alternative REML-type model fitter better suited for large, sparse
design matrices: gamm4::gamm4() (Wood and Scheipl 2017)
I penalized likelihood inference via extended Fellner-Schall iteration:
(Wood and Fasiolo 2017)
faster, simpler, potentially less accurate.
I approximate fully Bayesian inference via integrated nested Laplace
approximation (mgcv::ginla())
I fully Bayesian inference via JAGS (mgcv::jagam())

309 / 398
GAMM Algorithm

refund
Scalar responses: pfr
Functional responses: pffr

Componentwise Gradient Boosting

FDboost

310 / 398
refund

I fairly large collaborative software project, mainly Johns Hopkins,


Columbia University, LMU
I definitely mostly research quality software.... caveat emptor
I FPCA: fpca.sc, fpca.face, fpca.ssvd, fpca.2s
I functional regression:
I pffr(): penalized function-on-function regression for functional
responses and scalar and/or functional covariates.
I pfr(): penalized functional regression for scalar responses and scalar
and/or functional covariates.

310 / 398
refund::pfr

I fairly lightweight wrapper around mgvc’s model fitting functions


I defines some additional term types for functional covariates.
I formula-based model definition
R adapted from mgvc:
e.g. E(y |x) = β0 + x1 (s)β1 (s)ds + f (x2 )
becomes y ~ 1 + lf(x1) + s(x2)

effect syntax
R
linear functional effectR x(s)β(s)ds lf(x)
smooth functional effect RF (x(s), s)ds af(x)
FPC based x(s)β(s)ds fpc(x)

Additional arguments control basis representation of β(s), FPC


parameters, etc.

311 / 398
refund::pffr
I wrapper for GFAMMs around mgvc’s model fitting functions
I defines additional term types for functional covariates and formula
specials for functional responses
I formula-based model definitionRadapted from mgvc:
e.g. E(y (t)|x) = β0 (t) + x1 (s)β1 (s, t)ds + x2 β2 (t) + f (x3 )
becomes y ~ 1 + ff(x1) + x2 + c(s(x3))
Iby default, all effects vary over t
→ tensor product representation of effects
=⇒ all effects available for scalar responses, covariates in mgcv usable for
functional responses (... almost)
I constant effects wrapped in c()
I specification of basis over t in arguments bs.yindex, bs.int

I irregular responses possible


I modified identifiability constraints for better interpretability of
functional effects:
ˆj (Xi , t) = 0 ∀ t instead of mgcv default ˆ
P P
i f i,t fj (Xi , t) = 0
I location-scale models via family = "gaulss" for heteroskedastic
data (more later)
312 / 398
pffr terms syntax

effect syntax
R
linear functional effectR x(s)β(s, t)ds ff(x)
smooth functional effectR F (x(s), s, t)ds sff(x)
FPC based x(s, t)β(s, t)ds
as k ξˆk β̃k (t)
P
ffpc(x)
FPC based random-effects bgi (t) pcre(g, ...)

313 / 398
GAMM Algorithm

refund

Componentwise Gradient Boosting


Componentwise Gradient Boosting
mboost

FDboost

314 / 398
Introduction

I Generalized functional additive mixed models (GFAMM)


mathematically equivalent to varying coefficient models for scalar data
=⇒ can reuse, adapt, recycle scalar data methodology
I SOTA R implementation for componentwise gradient boosting:
mboost (Hothorn, Buehlmann, et al. 2018)
I Boosted GFAMM implemented in package FDboost (Brockhaus and
Ruegamer 2018): wrapper around mboost and gamboostLSS
(Hofner, Mayr, Fenske, et al. 2018)

314 / 398
Component-wise gradient boosting
I Boosting is an ensemble method that aims at minimizing the
expectation of a loss criterion.
I The predictor is iteratively updated along the steepest gradient with
respect to the components of an additive predictor (functional
gradient descent).
I Model represented as a sum of simple (penalized) regression models,
the base-learners, fitted to the negative gradients by OLS in each
step.
I In each boosting iteration only the best fitting base-learner is updated
(component-wise) with step-length ν.
I For functional response regression, response and predictors are
functions over T .

(Hothorn, Bühlmann, et al. 2010; Brockhaus and Ruegamer 2018)

315 / 398
Some transformation functions ξ and loss functions ρ
PR
Model: ξ(yi |Xi = xi ) = f (Xi ) = r =1 fr (xri )

Choose loss function ρ corresponding to transformation function ξ.


For scalars e.g.:

ξ ρ(Y , h(x))
mean regression E L2 -loss
median regression q0.5 L1 -loss
quantile regression qτ check function
generalized regression g ◦E neg. log-likelihood
GAMLSS vector of Q par. neg. log-likelihood

Loss for functional responses: Integrate loss ρ(y , f (x))(t) over T .


Goal: Minimize the expected loss, the risk, w.r.t. f .

316 / 398
Algorithm: component-wise gradient boosting

I [Step 1:] initialize all parameters, set m = 0


I [Step 2:] (within each iteration m)
I compute the negative partial gradients ui for i = 1, . . . , N of the
empirical risk w.r.t. the predictor f (Xr , t) using the current
estimates of all distribution parameters
I fit each base-learner fr to ui , r = 1, . . . , R
I select the best fitting base-learner fr ∗ and update it with a small
[m]
f [m] = b
step-length ν, b f [m−1] + ν b
fr ? .
I [Step 3:] unless m > mstop , set m = m + 1, go to Step 2.

→ The final fˆ is a linear combination of base-learner fits.

317 / 398
Algorithm: functional GAMLSS boosting
I [Step 1:] initialize all parameters, set m = 0
I [Step 2:] (within each iteration m)
I for q = 1, ..., Q
(q)
I compute the negative partial gradients ui for i = 1, . . . , N of the
empirical risk w.r.t. the predictor f (q) using the current estimates of
all distribution parameters
(q) (q)
I fit each base-learner fr to ui , r = 1, . . . , R
(q)
I select the best fitting base-learner fr ∗
(q ∗ )
I select parameter q ∗ with the best fitting base-learner fr ∗ and
update its coefficients with a small step-length ν
I [Step 3:] unless m > mstop set m = m + 1, go to Step 2.

→ Each final fˆ(q) is a linear combination of base-learner fits.

(Mayr et al., 2012; Brockhaus et al., 2017; non-cyclical: Thomas et al, 2017)

318 / 398
Tuning parameters of gradient boosting

I The number of boosting iterations determines the model complexity


for fixed step length and smoothing parameters (chosen for unbiased
selection of base learners; Hofner et al., 2012).
I Choose prediction-optimal stopping iteration mstop by resampling
methods (on the level of curves).
I Model selection by early stopping and stability selection (Shah &
Samworth, 2013).

319 / 398
Summary

Idea:
I iteratively boost the model performance
(=ˆ reduce expected loss)
I by fitting and evaluating the partial effects (base learners) fr (Xr , t)
component-wise
I using one partial effect at a time to update the model

=⇒ Results in component-wise gradient descent steps.

To account for within-function dependency:


I function specific smooth error functions
I function-wise cross-validation

320 / 398
Comparison

Major differences to penalized likelihood estimation:


− no uncertainty quantification (confidence intervals, ...), or at least not
that easy
+ very flexible in terms of fitting criteria
+ much faster for many complex covariate effects
+ inherent variable selection

321 / 398
mboost

I very efficient stepwise divide-and-conquer algorithm: any loss


function optimized via simple, small LS updates.
I uses compressed design matrices for baselearners (cf. mgcv)
I exploits sparsity of base learner design matrices, where possible
I for tensor product baselearners, uses very efficient array arithmetic
(I. D. Currie et al. 2006)

322 / 398
Generalized Linear Array Models

ND
I Tensor product design matrices B = d=1 Bd become huge very
quickly
I .. but they have lots of repeating structure, by construction
I Idea: use structure for more efficient computations of Bθ,
B> diag(w)B
I never actually compute B, only marginal Bd , d = 1, . . . , D
I perform matrix operations by clever re-dimensioning and successive
operations on marginal Bd
I e.g. D = 2 : (B1 ⊗ B2 )θ = B2 ((B1 Θ)> )> with Θ = [θj k]j,k
I large gains for high D, parameter count
I at least 1 order of magnitude fewer ops for Bθ, 2-3 orders for
B> diag(w)B

323 / 398
GAMM Algorithm

refund

Componentwise Gradient Boosting

FDboost
Example: Emotion components data

324 / 398
Implementation in FDboost
I Main fitting function:
FDboost(formula, timeformula, data, ...)
I timeformula
= NULL for scalar-on-function regression,
= ∼ bbs(t) for function-on-function regression

I Some of the base-learners for functional data:


zβ(t) bolsc(z) %O% bbs(t)
f (z, t) bbsc(z) %O% bbs(t)
z1 z2 β(t) bols(z1) %Xc% bols(z2) %O% bbs(t)
R
S
x(s)β(s, t)ds bsignal(x, s = s) %O% bbs(t)
x(t)β(t) bconcurrent(x, s = s, time = t)
R u(t)
l(t)
x(s)β(s, t)ds bhist(x, s = s, time = t, limits = ...)

324 / 398
Example: Emotion components data

Data set from Gentsch et al. 2014, also used in Rügamer et al. 2018
I Main goal: Understand how emotions evolve
I Participants played a gambling game with real money outcome
I Emotions “measured” via EMG (muscle activity in the face)
I Influencing factor appraisals measured via EEG (brain activity)
I Different game situation, a lot of trials

325 / 398
Example: Emotion components data
1 25
4 2 26

EEG
0 3 27
value [microvolt]

4 28
−4 5 29
−8 6 30
8 32
50
15 33

EMG
25 19 34
22 35
0
23 36
−25 24
0 100 200 300 400
time

(Rügamer et al. 2018)


Goal: Try to explain
I facial expressions (measured with EMG)
I by brain activity (measured with EEG)
→ Function-on-function-regression
326 / 398
Example: Emotion components data

Model equation:

yEMG (t) = β0 (t) + xEEG (t)β1 (t) + ε(t)


Z
yEMG (t) = β0 (t) + xEEG (s)β1 (s, t)ds + ε(t)
I One-to-one relation between EEG and EMG → Concurrent effect
Cumulated effect of functionalt−δcovariates → Linear functional effect
Z
I
yEMG (t) = β0 (t) + xEEG (s)β1 (s, t)ds + ε(t)
I EMG can only be influenced0by EEG activities in the past
→ Historical effect

327 / 398
Results for more complex model

328 / 398
Example: FDboost call

FDboost(EMG ~ 1 +
brandomc(id, df = 5) +
bhist(EEG, df = 20),
timeformula = ~ bbs(t, df = 4),
control = boost_control(mstop = 5000,
trace = TRUE),
data = data)

329 / 398
Part VII

Modeling Functional Data: Issues,


Outlook & Advanced Topics

330 / 398
Functional regression: Edge Cases, Problems, Pitfalls

Functional Regression: Extensions

Functional Response Regression: Alternative Approach

Clustering Functional Data


Functional regression: Edge Cases, Problems, Pitfalls
Identifiability of Functional Covariate Effects
Phase Variation
Registration Approaches
Scaling up GFAMM Inference
Practical considerations

Functional Regression: Extensions

Functional Response Regression: Alternative Approach

Clustering Functional Data

332 / 398
Identifiability
Functional covariates x(s) often well approximated by truncated
Karhunen-Loève-expansions,

X M
X
xi (s) = ξi,l φX
l (s) ≈ ξi,l φX
l (s).
l=1 l=1

For a linear functional effect


Z
fj (x(s))(t) = x(s)β(s, t)ds,
S

β is only identified up to additions of functions


g : g (·, t) ∈ span({φX
l , l > M}) ∀ t.

For finite grid data, the design matrix is rank-deficient if M < Kj or if the
span of the basis for β in s-direction contains functions orthogonal to
{φX
l , 1 ≤ l ≤ M} using numerical integration.
332 / 398
Identifiability

g : g (·, t) ∈ span({φXl , l > M}) ∀ t.


Z M
Z X
x(s) (g (s, t) + β(s, t)) ds = ξl φXl (s) (g (s, t) + β(s, t)) ds
S S l=1
 
M
X Z Z
X X
 
=  φl (s)g (s, t)ds + φl (s)β(s, t)ds 
ξl  
l=1 S S
| {z }
≡0
M
Z X
= ξl φXl (s)β(s, t)ds
S l=1
Z
= x(s)β(s, t)ds
S

Since φXl ⊥ φXl0 by definition and therefore g (·, t) ⊥ φXl0 , l 0 ≤ M.

333 / 398
Identifiability

I Iff the kernels of penalty and design matrix do not overlap, there is a
unique minimum of the penalty on each hyperplane defined by
coefficient vectors θ representing β(s, t) that yield identically valued
effects fj (x(s))(t).

I The penalized optimization criterion then finds the “smoothest”


solution (i.e, argmin(pen(θ)) among these solutions with optimal fit
to the data.

I If kernels of penalty and x(s) overlap, bad things can happen: e.g.
meaningless linear transformations or constant shifts of estimated
β(s, t)
→ c.f. collinearity in classical models.

=⇒ possibly no valid interpretation of slope, level, sign of β(s, t)

334 / 398
Identifiability: Synthetic Example

Truth Ridge 1st Diff. 2nd Diff.

1.0 1.0 1.0 0


0.5 0.5 0.5
-5000
0.0 1.0 0.0 1.0 0.0 1.0 1.0
0.8 0.8 0.8 0.8
-0.5 0.6 -0.5 0.6 -0.5 0.6 -10000 0.6
0.0 0.0 0.0 0.0
0.2 0.4 0.2 0.4 0.2 0.4 0.2 0.4
t

t
0.4 0.4 0.4 0.4
s 0.6 0.2 s 0.6 0.2 s 0.6 0.2 s 0.6 0.2
0.8 0.8 0.8 0.8
0.0
1.0 0.0
1.0 0.0
1.0 0.0
1.0

E(Y (t)) Ŷ (t) Ŷ (t) Ŷ (t)

335 / 398
Identifiability: Practical recommendations

1. for low-rank functional covariates, use FPC based effect


representation – identifiable (and possibly: interpretable) by
construction.
2. avoid rank-reducing pre-processing of functional covariates –
specifically: curve-wise centering, where possible.
3. check diagnostic measures (implemented in refund/FDboost)
4. use penalties with small kernel: 1st order differences/ derivatives or
full-rank penalties (Marra and Wood, 2011) or suitable constraints
(Scheipl and Greven 2016)

336 / 398
Phase & Amplitude Variation

30

2 2

Absolute Time
20

1 1
10

0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
Time Time Individual Time

(Happ, Scheipl, et al. 2019)

337 / 398
Phase Variation: Functional responses

I ignored phase variation can invalidate even simple summaries like


functional means
I models presented so far well suited for explaining amplitude variation
I modeling phase variability typically requires large flexibility of effects
(time-varying, non-linear)
I phase variability of responses left unaccounted induces
auto-correlated errors where features/landmarks don’t align

338 / 398
Phase Variation: Functional responses

0.3
centered FA-RCST
0.7

0.1
FA-CCA
0.5

-0.1
0.3

-0.3
0 20 40 60 80 0 10 20 30 40 50
CCA tract location t RCST location s

339 / 398
Phase Variation: Functional responses

Cor(Yi (t) − Ŷi (t))

1.0

93
0.68

MS 1

0.10
control 0.76
0.5

70
CCA tract location t
0.56

0.05
0.59
Mean FA-CCA

Yi (t) − Ŷi (t)


0.32
0.0
Êi (t)

-0.05 0.00
0.12

46
-0.12
0.50

-0.5

-0.32

23
-0.56

-0.10
-1.0

-0.76
0.41

-1
0 23 46 70 93 0 23 46 70 93 0 23 46 70 93 23 46 70 93
CCA tract location t CCA tract location t CCA tract location t CCA tract location t

340 / 398
Phase Variation: Functional covariates

I important? relevant? =⇒ register data, then use both phase &


amplitude information
I for FPCA: data with phase variation typically with slow eigenvalue
decay, FPC score distributions with lots of structure
=⇒ (linear) dimension reduction less successful / informative
I non-linear dimension reduction methods (D. Chen, H.-G. Müller, et al.
2012, e.g.)
I joint phase-amplitude PCA (Tucker et al. 2013; Happ, Scheipl, et al.
2019, e.g.)

341 / 398
Phase & Amplitude Variation

30

2 2

Absolute Time
20

1 1
10

0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
Time Time Individual Time

(Happ, Scheipl, et al. 2019)

342 / 398
Phase & Amplitude Variation: Registration

Typical procedure:
Decompose xi (t) = (wi ◦ γi )(t) = wi (γi (t))
I warping functions γi : T → T
I map clock time of observed curve xi to common system time of
registered curves
(e.g. growth curves of kids: earlier/later puberty etc.)
I no time jumps: γi continuous

I no time reversals: ∂t γi (t) > 0
I γi (min T ) = min T , γi (max T ) = max T
I registered functions wi (t̃): landmarks like maxima/minima typically
aligned, easier to interpret
I decomposition into horizontal/phase variation γi and
vertical/amplitude variation wi
(Marron et al. 2015, e.g.)

343 / 398
Phase & Amplitude Variation: Registration

I Registration can be ill-posed: accelerating one function or slowing


down another one, eg.?
I Registration is difficult: estimate n functions γi from n functions
xi (t), under weird constraints.
=⇒ theoretically and numerically challenging
I yields very “interesting” math in weird function spaces – many
different approaches
I Registration in practice:
typically needs smooth, non-noisy data on dense grid or in basis
representation to work well

344 / 398
Registration Approaches

I Dynamic Time Warping: (K. Wang, Gasser, et al. 1997)


Iconstrained to piece-wise linear warping functions
+ fairly fast, powerful algorithms
− need equidistant grid
− need to choose penalization to avoid over-alignment
I Landmark Registration:
I idea: match recognizable features occuring in all xi (t) to fixed
timepoints
I warping functions interpolate between those somehow
+ simple
− only applicable for simple, “one-shape-fits-all” data
− landmark choice can be arbitrary (minima? maxima? zero-crossings?)
− exact landmark localization difficult for noisy data

345 / 398
Registration Approaches
I L2 -Distance Based:
Criterion L(γi ; xi , x0 ) = (x0 (t) − xi (γi−1 (t)))2 dt → minγi
R
I

− pinching problem (if γi too flexible)


− not really a suitable metric: L(γi ; xi , x0 ) 6= L(γi ; xi ◦ γ0−1 , x0 ◦ γ0−1 )
I ignore proportional (amplitude) variation:
2
R  x0 (t) xi (γi−1 (t))
L(γi ; xi , x0 ) = kx0 (t)k − −1
kxi (γi (t))k
dt (J. Ramsay and Silverman 2005)
+ implemented – partially – in fda
I Square Root Velocity Function: (Srivastava and Klassen 2016)
Iidea: find distance metric for equivalence classes of “amplitude
functions”
I metric can be computed simply as L distance of SRVF:
p 2
SRVF(x) := sgn(x 0 (t)) |x 0 (t)|
+ deep, elegant maths guarantees consistency, well-posedness
+ rigorous definitions of phase, amplitude & respective means,
generalized FPCA etc.
+ implemented in fdasrvf (Tucker 2017)

346 / 398
Registration in practice

I almost all registration methods for multiple curves (not pairwise)


require a template curve as target.
choice of template? well-defined?
I in many cases: one template is not actually enough – functional data
has grouped structure, requires multiple templates.
how many? which ones?
I clear distinction between amplitude and phase variation or can
variation be represented either way?
I consider joint variation in phase and amplitude? or is phase variation
just nuisance?

347 / 398
What’s my n?

I fairly successful approach: rephrase functional response models as


models for scalar function evaluations by shifting all functional
structure into the predictor.
I issue: likelihood now for nT data points, but actually only have n
observational units
=⇒ downsampling/upsampling could change inference: smoothing, CI
widths, selection, etc.!
I so far: not encountered as huge problem in practice, but must not
ignore

348 / 398
Autocorrelation & Variance Heterogeneity
Ifairly successful approach: rephrase functional response models as
models for scalar function evaluations by shifting all functional
structure into the predictor.
I issue: intra-functional dependency needs to be modeled (see slide
before)
I issue: GFAMMs are conditional models with independence
assumption over yi (tj )|X , i = 1, . . . , n, j = 1, . . . , T
=⇒ most models will require smooth residual terms Ei (t) to capture
intra-functional dependencies, variance heterogeneity:
scales rather terribly....
I marginal type models computationally preferable, but:
I not clear how to incorporate into mgcv’s computational framework,
generally
I hard/impossible to generalize to non-Gaussian case
I doable: AR(1) residual structure
I doable: explicit model for time- and covariate-dependent variance for
Gaussian data (family = "gaulss", more tomorrow)
349 / 398
What my thesis comitte did not get to hear

I approach: rephrase functional response models as models for scalar


function evaluations by shifting all functional structure into the
predictor.
+ no presmoothing/preprocessing of functional responses =⇒ honest,
“unconditional” inference
?? how relevant if data are truly smooth without relevant iid error?
+ reuse/recycle/adapt all the (scalar) things
− computation does not scale well for dense grids, large n
− if data are low-rank representable yi ≈ Φ ξi , usually preferable to
T ×1 T ×K

exploit this low-rank structure and do inference on basis


coefficients instead of function evaluations: nK  nT

350 / 398
Predictive modeling with functional covariates

For predictive accuracy, modern ML algos using scalar features derived


from functional data usually as good or better than “functional data”
methods.

I.e., something like dumping


I FPC scores
I FPC scores of (smoothed) derivatives
I locations of maxima / minima / zero-crossings, if relevant
I etc ...
as features into a RF or SVM tends to work really well without needing a
lot of “FDA” know-how or specialized software.

Caveat: For benchmarking, include all preprocessing (FPC estimation etc)


in cross-validation to avoid information leakage!

351 / 398
Functional regression: Edge Cases, Problems, Pitfalls

Functional Regression: Extensions


Regression Models for Multi-Level Functional Data
Regression Models for Multivariate Functions
Regression Models for Probability Density Functions
MFPCA for Phase & Amplitude

Functional Response Regression: Alternative Approach

Clustering Functional Data

352 / 398
Multi-Level (Functional) Data

I data with hierarchical structure: observations in nested groups


students in classes in schools in school districts
I data with crossed structure: observations in (partially) overlapping
groups
product sales in categories and countries
I few levels, enough replications, group level effects interesting:
=⇒ just treat like standard nominal covariates
I many levels, (partially) few replications:
=⇒ (functional) random effects for group levels =⇒ Functional
LMM

352 / 398
Functional Random Effects

I in GFAMM context, simple tensor product basis representation


available:
spline-based random intercepts bg (i) (t)
I problem: typically requires large basis Btr → tensor product basis
becomes hyuuge.
I possible solution:
PK use more compact FPC-basis for Btr :
bg (i) (t) = k ξgk φk (t)
I problem: how to estimate FPCs φk (t)?
I easy for simple random intercepts: do FPCA of grouped mean
residual curves from estimate without random intercepts.
I general case: coming up now

353 / 398
Functional Linear Mixed Model

yi (t) = µ(t, Xi ) + Bg 1(i) (t) + Cg 2(i) (t) + Ei (t) + εi (t)

I µ(tij , xi ) : mean function (conditional on covariates)


I Bg 1(i) (t), Cg 2(i) (t) mutually independent functional random effects
for grouping factors g1 (i), g2 (i), can be (partially) nested or crossed.
I Ei (t) smooth residuals, εi (t) unstructured measurement error with
variance σε2

Cov(yi (t), yi 0 (t 0 )) = δiig01 K B (t, t 0 ) + δiig02 K C (t, t 0 )+


 
δi=i 0 K E (t, t 0 ) + δt=t 0 σ 2

with auto-covariance Z 0 0
( functions K (t, t ) := Cov(Z (t), Z (t )), and
1 if g (i) = g (i 0 )
indicators δiig0 := (Cederbaum, Pouplier, et al. 2016)
0 else
354 / 398
FLMM: Inference

1. estimate mean function µ(t, Xi ) under working independence via


FAMM (Scheipl, Staicu, et al. 2015)
2. estimate auto-covariances from centered data ỹi (t) = yi (t) − µ̂(t, Xi ):
use E(ỹi (t)ỹi 0 (t 0 )) = Cov(yi (t), yi 0 (t 0 ))

=⇒ ỹi (t)ỹi 0 (t 0 ) ≈ δiig01 K B (t, t 0 ) + δiig02 K C (t, t 0 )+


 
δi=i 0 K E (t, t 0 ) + δt=t 0 σ 2

=⇒ estimate additive model for cross-products of responses


3. do FPCA of resulting K̂ Z (t, t 0 ) to get FPCs φ̂Zk (t)
4. re-estimate model using φ̂Zk (t)

355 / 398
FLMM: Covariance Surface Estimation

I generalized additive model for crossproduct surface ỹi (t)ỹi 0 (t 0 ) :


K Z (t, t 0 ) estimated as isotropic tensor product splines over T × T
I working assumptions: crossproducts independent, homoskedastic (!)
I uses all crossproducts with at least one gj (i) = gj (i 0 )
=⇒ quadratic in no. of data points (!)
I some savings possible: exploit symmetry, discretization
I estimates K̂ Z (t, t 0 ) evaluated on dense grid, eigendecomposition
yields φ̂Zk (t), ν̂kZ
(Cederbaum, Scheipl, et al. 2018)

356 / 398
FLMM: Final Model-Reestimation
Alternatives:
a) Simple LMM for
I responses ỹ = (ỹ1 (t11 ), . . . , ỹn (tnTn )>
I random effect design matrices P ΦZ = [φ̂Zk (ti )]
Z
i Ti ×K
I random effect co-variances diag(ν̂1Z , . . . , ν̂KZ Z )
Closed form solutions for ξˆik
Z.
P
b) Re-estimate µ(t, Xi ) = r fr (Xri , t) simultaneously with functional
random effects using FPC bases φ̂Zk (t).
I Option a) is (much) faster and worked well for recovering random
effects in synthetic data.
Option b) will yield more honest uncertainty quantification for
covariate effects in µ.
Both performed much better than direct estimation of spline-based
random effects in simulations.
I procedure extends to more grouping factors & random slopes,
irregular/sparse data
(Cederbaum, Pouplier, et al. 2016; Cederbaum, Scheipl, et al. 2018)
357 / 398
Multivariate Functional Data

(1) (D)
I functional data frequently multivariate yi (t) = (yi (t), . . . , yi (t))
e.g. spatial accelerations for accelerometry, trajectory data over a
plane, ...
I challenge: model dependence structure within (y (1) (t), . . . , y (D) (t))
I one possible approach based on GFAMM, FLMM, MFPCA
described in the following
I assumptions: identical grids and commensurable measurement scales
for all dimensions

358 / 398
FAMM for each dimension

I Univariate FAMM for each dimension d = 1, ..., D:

y(d) = B(d) θ (d) + Φ(d) ξ + (d) ,

y(d) contains function evaluations, B(d) θ (d) represents fixed effects, Φ(d) contains
evaluations of random effect FPCs, unstructured errors (d) ∼ N(0, σd2 I)

I scores for FPCs ξ same for all d (!)

359 / 398
Multivariate FAMM

I connect dimensions simply by stacking:

y(1) B(1) θ (1)


  (1)   (1) 
0 ... 0
   
Φ 
 y(2)   0 B(2) ... 0   θ (2)   Φ(2)   (2) 
 ..  =    ..  +  ..  ρ +  .. ,
        
.. .. ..
 .   . . .  .   .   . 
y(D) 0 0 ... B(D) θ (D) Φ(D) (D)
| {z } | {z } | {z } | {z } | {z }
ȳ B̄ θ̄ Φ̄ ¯

with
¯ ∼ N 0, diag(σ12 , ..., σp2 ) ⊗ I


I variance heterogeneity over dimensions simply uses weighted fits (...


or use "gaulss")
I model is conditional on estimated multivariate FPCs in Φ̄

360 / 398
Multivariate Functional PCA P
We want a multivariate FPC representation y (t) = K k=1 ξk φ(t),
(1) (D)
where φk (t) = (φk (t), . . . , φk (t))> with eigenvalues νk .
Assume a finite univariate Karhunen-Loève representation for each y (d) (t) exists, i.e.
XKd (d) (d)
y (d) (t) = ρk ψk (t),
k=1

Then the MFPC eigenvalues νk correspond to the eigenvalues of


 (11)
. . . Z (1D)

Z
 . ..  dd 0 (d 0 )
 
Z =  .. .. with entries Z = Cov ρ (d)
, ρ .
. .  mn m n
(D1) (DD)
Z ... Z
→ multivariate eigenvalues from (cross-)covariances of univariate score vectors
The multivariate FPC components are given by
XKd
φ(d)
m (t) = [cm ](d) (d)
n ψn (t),
n=1
(d) Kd
where [cm ] ∈ R denotes the j-th block of the m-th eigenvector cm of Z .
→ multivariate eigenfunctions from “re-distributing” univariate eigenfunctions
based on univariate score cross-covariance
For the MFPC scores: XD XKd
ξm = [cm ](d) (d)
n ρn .
d=1 n=1
(Happ and Greven 361
2018)
/ 398
MFPCA: Interpretation

I yields simple vector representation ξ of multivariate functions in a


multivariate functional basis φ(t).
(1) (D)
I Each MFPC φk (t) = (φk (t), . . . , φk (t))> represents a joint mode
of variation across all dimensions of the function
=⇒ MFPCA captures patterns of (cross-)covariance between and within
dimensions

362 / 398
MFPCA: Estimation Algorithm

1. Estimate univariate functional PCA for each element y (d) with


existing algorithms
2. Perform standard (multivariate) PCA for combined score vectors
(D) >
 
(1) (1) (D)
ρ1 , . . . , ρK1 , . . . , ρ1 , . . . , ρKD
3. Plug the results into the formulae for φm , νm , ξm
Implemented in R-package MFPCA (Happ 2018)

363 / 398
Multivariate FAMM

I flexibility for covariate effects equivalent to standard FAMM


I 3-step method:
1. estimate univariate FLMMs or FAMMs for each component
2. do MFPCA of estimated univariate FPCs
3. fit multivariate model with MFPCs, e.g. via pffr
I non-parametric estimate of cross-covariance between components of
the response
 via MFPC representation
 of smooth residuals
(1) (D)
Ei (t) = Ei (t), . . . , Ei (t)
I not published yet, Master thesis of Alexander Volkmann.

364 / 398
Density Functions as Outcomes

Problem:
Density functions are positive, must integrate to 1. P
How to guarantee this in an addditive model where y (t) ≈ r fr (X , t)?
Solution:
Similar problem as with non-Gaussian responses for LMs, similar solution:
=⇒ transformation of the response into a more friendly space without
restrictions.

365 / 398
Density Functions as Outcomes

Slightly more mathematically precise:


need to define a proper vector space of PDFs where addition,
multiplication, inner products make sense.

366 / 398
Vector space structure for densities

I Bayes Spaces introduced by Egozcue et al. (2006), Boogart et al.


2014
I To define a Bayes Hilbert Space consider a measurable space (T , A)
with a finite measure µ and the set B 2 (µ) of PDFs f with
I f > 0 µ-almost everywhere
I fR bounded upwards and away from 0
I log (f )2 dµ exists.

I Continuous distributions on T = [0, 1]:


=⇒ µ = λ, the Lebesgue measure
Discrete distributions on T = {1, ..., K }:
=⇒ µ = K
P
δ
k=1 k , the counting measure

367 / 398
Vector space structure for densities

Proposition (adapted from Egozcue et al. 2006, Boogart et al. 2014)


B 2 (µ) is a vector space with addition ⊕ and scalar multiplication
defined as follows:
For fi : T → R, i = 1, 2 in B 2 (µ) and a ∈ R define
I Addition: f1 ⊕ f2 = R f1 ·f2
f1 ·f2 dµ
a
I Scalar multiplication: a f1 = R f1a
f1 dµ
and with
I Additive neutral element: 0B = µ(T )−1 (const.)
f1−1
I Additive inverse: f1 = R
f1−1 dµ
(µ-a.e.)
I Multiplicative neutral element: 1 ∈ R

368 / 398
The clr-transformation
Introduced by Aitchison 1986, Boogart et al. 2014

Let L20 (µ) be the space of square-integrable functions with integral zero
(sub-vector space of L2 (µ)).
The centered-log-ratio (clr) transformation B 2 (µ) → L20 (µ), f 7→ f˜ is
defined as Z
˜
f = log(f ) − µ(T )−1
log(f )dµ

with inverse
exp(f˜)
f =R
exp(f˜)dµ
Well-defined, linear, injective!
This means: we can take elements in B 2 (µ), send them to L20 (µ), analyse
them there, and map them back to B 2 (µ).

369 / 398
An inner product for densities

Let f1 , f2 ∈ B 2 (µ), then the B 2 (µ)-inner product h·, ·iB is defined as


Z
˜ ˜
hf1 , f2 iB = hf1 , f2 i = f˜1 f˜2 dµ

I inner product properties inherited from L20 (µ) due to


linearity/injectivity of clr -transformation

370 / 398
Additive functional regression model for PDFs
Formulation in B2 (µ)

I Pairs (y (t), X ) with response PDF y : T → R, t 7→ y (t) in B 2 (µ), and


vector of scalar/functional covariates X .
I Formulation of generic Functional Additive Mixed Model (FAMM), as
discussed, now in the B 2 (µ) setting:

⊕-additive model for PDF


M
y (t) = f (X )(t) = fr (X , t) ⊕ ε
r

I predictor f composed of fr partial effects mapping X into B 2 (µ)


I error ε ∈ B 2 (µ) with E[ε] = 0B

371 / 398
Additive functional regression model for PDFs
Formulation in L20 (µ)

Additive model for clr-transform


clr(y ) = ỹ (t) = f˜(X )(t) = f˜r (X )(t) + ε̃
X
r

I predictor f˜ composed of f˜r partial effects mapping X into L20 (µ)


I error ε̃ ∈ L20 (µ) with E[ε̃] = 0 (const.)
I constraint f˜r ∈ L20 (µ) very easy to enforce by linear constraints on
design matrix: standard sum-to-zero
I implemented in FDboost
(Maier et al. 2019)

372 / 398
Application: relative income within households
Distribution of the income share of the woman
I Economic consequences of gender identity (Bertrand 2014):
wife income
distribution of s = wife income + husband income in the U.S.
I German Socio-Economic Panel (SOEP):

East Germany 2016 East Germany 2016


f(s): density estimate

f(s): density estimate


3.0

3.0
I Potentially relevant
2.0

2.0
factors: region, year,
children, ...
1.0

1.0
I Positive probability mass
0.0

0.0
at 0 and 1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0

s: income share earned by the Wife s: income share earned by the

373 / 398
The SOEP-Data
German Socio-Economic
1984 Panel (SOEP) (Schupp et al. nd):
I wide-ranging representative longitudinal study of private households
1987
in Germany
1990
I mixed discrete + continuous PDFs estimated per
1993
I year: from 1984 to 2016
I region: e.g., south = {Bavaria, Baden-Würt.}
geographical1996
I child status: age 0 − 6 / age 7 − 18 / older or no child in household
1999
Example PDFs:
2002
Weighted Densities per Year: south, child group 1 Weighted Densities per Year: south, child group 3
2005
2.0

2.0
2008
1.5

1.5
Density

Density
1.0

1.0

2011



0.5

0.5






− −

2014 − −
0.0

0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Share earned by the Wife Share earned by the Wife

1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014

374 / 398
Model formulation
I Mixed reference measure µ = δ0 + λ + δ1
I Response PDFs y (s) ∈ B 2 (µ) of income share s ∈ [0, 1]

⊕-additive model for PDF

y (s) = β0 (s) ⊕ βnew (s) ⊕ βregion (s) ⊕ βchild (s)⊕


⊕ g (year )(s) ⊕ gnew (year )(s) ⊕ ε

I joint fitting possible, but separate fitting easier:


unique orthogonal decomposition y = yc ⊕ yd into
I ’continuous’ part with y˜c |{0,1} = 0 and
I ’discrete’ part with y˜d |(0,1) constant.

⇒ corresponds to fitting yc in B 2 (λ) and yd in B 2 (δ0 + δ1 )


375 / 398
West vs East Germany
Effects under clr -transform: Effects in B 2 (µ):

old: b 0
5

1.5
new: b 0 Å b new
0

1.0
PDF
clr

-5

0.5
~
old: b 0
new: b~ +b ~
0 new
-10

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

s: income share earned by wife s: income share earned by wife

376 / 398
Phase & Amplitude Variation

30

2 2

Absolute Time
20

1 1
10

0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
Time Time Individual Time

(Happ, Scheipl, et al. 2019)

377 / 398
3 ideas:
I warping functions are (generalized) cumulative distribution functions
I define FPCA for densities via clr-projection
I do MFPCA of (derivatives of) warping functions and registered
functions to get joint representation of phase and amplitude variation

378 / 398
Warping functions & densities

I warping functions on a dataset are (generalized) cumulative


distribution functions:
I (strictly) increasing
I always go from same minimum to same maximum over same domain.

=⇒ ∂t γ(t) are functions that are very much like densities
=⇒ can transfer B 2 (µ)-cconcepts defined above to ∂
∂t γ(t)

379 / 398
FPCA for densities

I previous section defined notions of addition, multiplication, inner


product for densities
I that’s all you need to define (Fréchet) means and (co)variances
=⇒ can develop FPCA for densities in B 2 (µ)
=⇒ computable by clr-projection into L20 (µ)

380 / 398
MFPCA for warping functions and registered
functions

1. perform registration with your favorite method: xi (t) → (γ̂i (t), ŵi (t̃))


2. do FPCA of clr ∂t γ̂(t) , ŵ (t̃).
3. combine univariate FPCs into multivariate FPC via MFPCA
=⇒ MFPC scores represent both phase and amplitude variation
(Happ, Scheipl, et al. 2019) (conjectured:
=⇒ not well-definable distinction between phase and amplitude variation becomes less
relevant, just describe the data compactly...)

381 / 398
Phase-Amplitude-MFPCA: Example
PC 1 (26.8% Fréchet variance explained) PC 1 PC 1
Full variation Phase variation Amplitude variation
1.5 1.5
++++ ++ −
+ −+ − ++−− +++
++ ++ +− ++ ++
+ + + +
+ + +++ ++ + − + − + −+ − + − +−−− ++
−−
+ + + +
+ + +++ ++
0.9 + −+ − + +
+ + + ++++ + + + + ++
+ −− + −+ − + −+ − + − + − + − +−
++
− + + + ++++ + + + + +
++ +++ + + + + − +− ++ +++ + ++ ++ +
1.0 +++ ++ ++−− ++−− + + + − +− + 1.0 ++ +
+ +− +−−− +++ −− −+
+− +
−− 0.6 −−
+ −−− − − −− +− + −−− − − −−
− − −−− −− − − −−− −−
+ −− − − − − + − − − − −
0.5 − − − − −− − + 0.5 − − − − −
+ −−
−−− −−− −−− − − −− 0.3 + − + −−
−−− −−− −−− − −− −−−−
−− −−
+ − + − +−−
+ − + − +−
0.0 +
−+
−+−+
−+−+−−+
−−
−++ −− − 0.0 −+
+−+
−+
−+−+
−+ −+
−+−−−− 0.0 +
−+
−+
−++−
−− +−
+−+−−+
++−−
0 10 20 30 0 10 20 30 0 10 20 30
Time Time Time

PC 2 (15.6% Fréchet variance explained) PC 2 PC 2


Full variation Phase variation Amplitude variation
1.2 − + − 1.2 −
−− + +−− − ++ −−− −−
+ − ++ −−− + −
+−
++ −−− + −+ + − ++ −−− −+ +− − +− −−
− − − +−
+ + + −
−+ − − − −
− − + + −+ + ++ − −− ++
−+ − + ++− −−− −−
0.9 + + 0.9
− +− + + + 0.9 + + − +− − +++− −++−
− +− + − + +
− + −
+ − − +++ − + − ++− + − + − − −+ − − − − + + −+ − −
+
− −
+ − −−−−
+ + + −−++ + − − −− + −− ++ −+ −− + −+ − −+
−++ − + − −−+ +− −+ +− −−+ + −−
−− ++ − ++ ++ + + −
++ + + −− + ++

− ++++
− − +−+
+ ++ −
+++ −−
+ ++ + +
0.6 + ++ ++++ 0.6 0.6 + ++ ++
− +− −
+ +
+− +− −
+
0.3 +− 0.3 +− 0.3 −
+
− − −
+
+− +− −
−+
0.0 −+
+ −+
−+
−+ −+
−+
−+ −+
−+ −−
−+ 0.0 −+
+−+
−+
−+
−+ −+
−+
−+ −+
−+−− 0.0 −+
+ −+
−+
−+
−+ −+
−+−+−+
−+
0 10 20 30 0 10 20 30 0 10 20 30
Time Time Time

(Happ, Scheipl, et al. 2019)

382 / 398
Functional regression: Edge Cases, Problems, Pitfalls

Functional Regression: Extensions

Functional Response Regression: Alternative Approach


Wavelet-Based Functional Mixed Models

Clustering Functional Data

383 / 398
Alternatives

Alternatives to the GFAMM approaches differ in primarily in one crucial


aspect:
Instead of modeling scalar function evaluations, they project the
functional responses into a coefficient space of some known basis and do
the modeling there.

383 / 398
Model

A
X H
X
yi (t) = xia βa (t) + zihm bhm (t) + ei (t)
a h

I for functional responses yi (t) on common, fine grids over very


general domains T
I covariates x and grouping factors z associated with very general
random effects bhm (t) ∼ GP(0, Qh (t, t 0 )
I i. i. d. functional errors ei (t) ∼ GP(0, S(t, t 0 ) with generalizations to
autocorrelation, t-distributed errors, etc. (Zhu, P. Brown, et al. 2011)
I extensions for linear effect of functional covariates and nonlinear
effects of scalar covariates possible.
(J. S. Morris and R. J. Carroll 2006; J. S. Morris, P. J. Brown, et al. 2008; Meyer et al. 2015;
Zhu, Versace, et al. 2018)

384 / 398
Basis Representation
Use a (near) lossless basis representation yi (t) = Φ(t)y?i , i.e.,

Y = Y? Φ
N×T N×K K ×T

Project data into coefficient space:

Y? = YΦ> (ΦΦ> )−1

(use FFT, DWT!)

Model
A
X H
X
yik? = ?
xia βak + ?
zihm bhmk + eik? ,
a h
PK ? φ (t) etc..
with βa (t) = k βak k

385 / 398
Inference

Fully Bayesian inference implemented via purpose-built MH-samplers in


coefficient space.
Model for each coefficient space dimension yk fit separately (!)

Coefficient functions etc. then projected back on T for visualization,


diagnostics, etc.

=⇒ full posterior in function space accessible


=⇒ easy to get simultaneous CIs, local exceedance probabilities, etc....
=⇒ critical mutual independence assumption across K basis dimensions.

386 / 398
Distributional assumptions

? , e ? for random effect and error functions assumed


Basis coefficients bhmk ik
multivariate Gaussian, with variants for Laplace and t via scale mixtures of
Gaussians.

Very flexible, non-stationary covariance structures possible since


covariance effectively estimated separately for each coefficient space
dimension.
? with shrinkage priors like Laplace, spike-and-slab
Basis coefficients βak
(Bayesian LASSO variants).

387 / 398
Pros & Cons

+ vast scope: anything that can be represented in a basis can be


modeled: images, multivariate functions on complex domains, etc.
+ Bayesian flexibility: very complex covariance structures possible
+ scales to very large data:
I basis representation reduces T to K , effort independent of grid size.
I clever marginalization of random effects etc makes MCMC
computations linear in K & (almost) independent of n
I parallelized over k = 1, . . . , K
+ fully Bayesian inference

388 / 398
Pros & Cons

− only applicable to additive error models for continuous functional


data: no counts, binary, ordinal etc.
(workaround requires re-computation of yik? in each MCMC
iteration....)
− requires lossless basis representability of data
− requires common dense observation grid: no sparse or irregular
data
− requires all effects at all levels representable in same basis
− assumes independent basis dimensions: y?k ⊥ y?k 0 ∀ k 6= k 0
− no unified software implementation: undocumented MATLAB
scripts and a closed-source C tool (Windows only).

389 / 398
Functional regression: Edge Cases, Problems, Pitfalls

Functional Regression: Extensions

Functional Response Regression: Alternative Approach

Clustering Functional Data

390 / 398
Clustering functions, the easy way

I do FPCA or other basis representation


I use any standard multivariate cluster methods of basis coefficient
vectors
=⇒ can use all types of custering methods: partitioning, hierarchical,
model-based, ...
I alternative: define any distance measure of functions, use distance
matrix as input for hierarchical clustering.

390 / 398
Clustering functions, more fancy

K-means FPCA (J.-M. Chiou and P.-L. Li 2008)


Initialize: do FPCA on entire data set, run k-means on coefficient vectors.
Iterate:
1. for each of the k clusters, redo FPCA using only data in that cluster
2. project every curve on each of the k sets of FPCs (LOCO estimates!)
3. re-assign each curve to the cluster whose FPCs “predict” it best

391 / 398
Clustering functions, more fancy yet

K-means Alignment (Sangalli et al. 2010)


Initialize: define k registered “template” functions, assign each
observation to one of them.
Iterate:
1. for each of the k clusters, update its “template” functions using only
data in that cluster
2. align all functions in each cluster to the respective templates and
re-assign to cluster based on their amplitude distances
3. re-align all functions in each cluster with the average cluster warping

392 / 398
References I
Bender, A., A. Groll, and F. Scheipl (2018). A generalized additive model approach to time-to-event analysis. In: Statistical
Modelling 18.3-4, pp. 299–321.
Bender, A., F. Scheipl, et al. (2018). Penalized estimation of complex, non-linear exposure-lag-response associations. In:
Biostatistics dkxy003.
Brockhaus, S., A. Fuest, A. Mayr, and S. Greven (2018). Signal regression models for location, scale and shape with an
application to stock returns. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 67.3, pp. 665–686.
Brockhaus, S., M. Melcher, F. Leisch, and S. Greven (2017). Boosting Flexible Functional Regression Models with a High
Number of Functional Historical Effects. In: Statistics and Computing 27.4, pp. 913–926.
Brockhaus, S. and D. Ruegamer (2018). FDboost: Boosting Functional Regression Models. R package version 0.3-2.
Brockhaus, S., F. Scheipl, T. Hothorn, and S. Greven (2015). The functional linear array model. In: Statistical Modelling 15.3,
pp. 279–300.
Brumback, B. and J. Rice (1998). Smoothing spline models for the analysis of nested and crossed samples of curves (with
discussion). In: Journal of the American Statistical Association 93, pp. 961–994.
Cardot, H., F. Ferraty, and P. Sarda (1999). Functional Linear Model. In: Statistics and Probability Letters 45.1, pp. 11–22.
Cardot, H., F. Ferraty, and P. Sarda (2003). Spline estimators for the functional linear model. In: Statistica Sinica 13.3,
pp. 571–592.
Cederbaum, J., M. Pouplier, P. Hoole, and S. Greven (2016). Functional linear mixed models for irregularly or sparsely sampled
data. In: Statistical Modelling 16.1, pp. 67–88.
Cederbaum, J., F. Scheipl, and S. Greven (2018). Fast symmetric additive covariance smoothing. In: Computational Statistics
& Data Analysis 120, pp. 25–41.
Chen, D., H.-G. Müller, et al. (2012). Nonlinear manifold representations for functional data. In: The Annals of Statistics 40.1,
pp. 1–29.
Chen, T. et al. (2019). xgboost: Extreme Gradient Boosting. R package version 0.81.0.1.
https://CRAN.R-project.org/package=xgboost.
Chiou, J. M. and P. L. Li (2007). Functional clustering and identifying substructures of longitudinal data. In: Journal of the
Royal Statistical Society. Series B: Statistical Methodology 69.4, pp. 679–699.
Chiou, J.-M. and P.-L. Li (2008). Correlation-based functional clustering via subspace projection. In: Journal of the American
Statistical Association 103.484, pp. 1684–1692.
Chiou, J., H. Müller, and J. Wang (2004). Functional response models. In: Statistica Sinica 14.3, pp. 675–694.
Cuevas, A. (2014). A partial overview of the theory of statistics with functional data. In: Journal of Statistical Planning and
Inference 147, pp. 1–23.
References II
Currie, I. D., M. Durban, and P. H. Eilers (2006). Generalized linear array models with applications to multidimensional
smoothing. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.2, pp. 259–280.
Currie, I., M. Durban, and P. Eilers (2006). Generalized linear array models with applications to multidimensional smoothing.
In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.2, pp. 259–280.
Descary, M.-H. and V. M. Panaretos (2018). Recovering covariance from functional fragments. In: Biometrika 106.1,
pp. 145–160.
Di, C.-Z., C. Crainiceanu, B. Caffo, and N. Punjabi (2009). Multilevel functional principal component analysis. In: Annals of
Applied Statistics 3.1, pp. 458–488.
Egozcue, J. J., J. L. D´az–Barrero, and V. Pawlowsky–Glahn (2006). Hilbert space of probability density functions based on
Aitchison geometry. In: Acta Mathematica Sinica 22.4, pp. 1175–1182.
Ferraty, F. and P. Vieu (2006). Nonparametric Functional Data Analysis. Springer Series in Statistics. Springer, New York.
Theory and practice.
Genest, M., J.-C. Masse, and J.-F. Plante. (2017). depth: Nonparametric Depth Functions for Multivariate Analysis.
R package version 2.1-1. https://CRAN.R-project.org/package=depth.
Gentsch, K., D. Grandjean, and K. R. Scherer (2014). Coherence explored between emotion components: Evidence from
event-related potentials and facial electromyography. In: Biological Psychology 98, pp. 70–81.
Goldsmith, J., J. Bobb, et al. (2011). Penalized Functional Regression. In: Journal of Computational and Graphical Statistics
20.4, pp. 830–851.
Goldsmith, J., C. Crainiceanu, B. Caffo, and D. Reich (2012). Longitudinal Penalized Functional Regression for Cognitive
Outcomes on Neuronal Tract Measurements. In: Journal of the Royal Statistical Society: Series C 61.3, pp. 453–469.
Goldsmith, J., M. Wand, and C. Crainiceanu (2011). Functional regression via variational Bayes. In: Electronic Journal of
Statistics 5, p. 572.
Goldsmith, J., F. Scheipl, et al. (2018). refund: Regression with Functional Data. R package version 0.1-17.
https://CRAN.R-project.org/package=refund.
Greenwell, B., B. Boehmke, J. Cunningham, and G. Developers (2019). gbm: Generalized Boosted Regression Models. R
package version 2.1.5. https://CRAN.R-project.org/package=gbm.
Greven, S., C. Crainiceanu, B. Caffo, and D. Reich (2010). Longitudinal Functional Principal Component Analysis. In:
Electronic Journal of Statistics 4, pp. 1022–1054.
Greven, S. and F. Scheipl (2017). A general framework for functional regression modelling. In: Statistical Modelling 17.1-2,
pp. 1–35.
References III
Groll, A., T. Kneib, A. Mayr, and G. Schauberger (2018). On the dependency of soccer scores–a sparse bivariate Poisson model
for the UEFA European football championship 2016. In: Journal of Quantitative Analysis in Sports 14.2, pp. 65–79.
Happ, C. (2018). MFPCA: Multivariate Functional Principal Component Analysis for Data Observed on Different
Dimensional Domains. R package version 1.3-1. https://github.com/ClaraHapp/MFPCA.
Happ, C. and S. Greven (2018). Multivariate functional principal component analysis for data observed on different
(dimensional) domains. In: Journal of the American Statistical Association 113.522, pp. 649–659.
Happ, C., F. Scheipl, A.-A. Gabriel, and S. Greven (2019). A general framework for multivariate functional principal component
analysis of amplitude and phase variation. In: Stat 8.1.
He, G., H. Müller, and J. Wang (2003). “Extending correlation and regression from multivariate to functional data”. In:
Asymptotics in Statistics and Probability. Ed. by M. Puri. VSP International Science Publishers, pp. 301–315.
Hofner, B., L. Boccuto, and M. Göker (2015). Controlling false discoveries in high-dimensional situations: boosting with stability
selection. In: BMC bioinformatics 16.1, p. 144.
Hofner, B., T. Hothorn, T. Kneib, and M. Schmid (2011). A framework for unbiased model selection based on boosting. In:
Journal of Computational and Graphical Statistics 20.4, pp. 956–971.
Hofner, B., A. Mayr, N. Fenske, and M. Schmid (2018). gamboostLSS: Boosting Methods for GAMLSS Models. R package
version 2.0-1. https://CRAN.R-project.org/package=gamboostLSS.
Hofner, B., A. Mayr, N. Robinzonov, and M. Schmid (2014). Model-based boosting in R: a hands-on tutorial using the R
package mboost. In: Computational statistics 29.1-2, pp. 3–35.
Hothorn, T., P. Buehlmann, et al. (2018). mboost: Model-Based Boosting. R package version 2.9-1.
https://CRAN.R-project.org/package=mboost.
Hothorn, T., P. Bühlmann, et al. (2010). Model-based boosting 2.0. In: Journal of Machine Learning Research 11,
pp. 2109–2113.
Hyndman, R. J. and H. L. Shang (2010). Rainbow Plots, Bagplots, and Boxplots for Functional Data. In: Journal of
Computational and Graphical Statistics 19.1, pp. 29–45.
Ivanescu, A., A.-M. Staicu, F. Scheipl, and S. Greven (2015). Penalized function-on-function regression. In: Computational
Statistics 30.2, pp. 539–568.
Jacques, J. and C. Preda (2013). Funclust: A curves clustering method using functional random variables density approximation.
In: Neurocomputing 112, pp. 164–171.
James, G. M. and C. A. Sugar (2003). Clustering for Sparsely Sampled Functional Data. In: Journal of the American Statistical
Association 98.462, pp. 397–408.
References IV
James, G. (2002). Generalized linear models with functional predictors. In: Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 64.3, pp. 411–432.
Lang, S. et al. (2014). Multilevel structured additive regression. In: Statistics and Computing 24.2, pp. 223–238.
Li, Z. and S. N. Wood (2019). Faster model matrix crossproducts for large generalized linear models with discretized covariates.
In: Statistics and Computing, pp. 1–7.
Loève, M. (1978). Probability theory II. Springer.
López-Pintado, S. and J. Romo (2009). On the Concept of Depth for Functional Data. In: Journal of the American Statistical
Association 104.486, pp. 718–734.
Maier, E., A. Stoecker, B. Fitzenberger, and S. Greven (2019). “Flexible Regression for Probability Densities in Bayes Spaces”.
in preparation.
Malfait, N. and J. Ramsay (2003). The historical functional linear model. In: Canadian Journal of Statistics 31.2, pp. 115–128.
Marra, G. and S. N. Wood (2011). Practical variable selection for generalized additive models. In: Computational Statistics &
Data Analysis 55.7, pp. 2372–2387.
Marron, J. S., J. O. Ramsay, L. M. Sangalli, and A. Srivastava (2015). Functional Data Analysis of Amplitude and Phase
Variation. In: Statistical Science 30.4, pp. 468–484.
Maselyne, J. et al. (2014). Validation of a High Frequency Radio Frequency Identification (HF RFID) system for registering
feeding patterns of growing-finishing pigs. In: Computers and Electronics in Agriculture 102, pp. 10–18.
Mayr, A. et al. (2012). Generalized additive models for location, scale and shape for high-dimensional data - a flexible approach
based on boosting. In: Journal of the Royal Statistical Society, Series C - Applied Statistics 61.3, pp. 403–427.
McLean, M. W. et al. (2014). Functional generalized additive models. In: Journal of Computational and Graphical Statistics
23.1, pp. 249–269.
Meinshausen, N. and P. Bühlmann (2010). Stability selection. In: Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 72.4, pp. 417–473.
Meyer, M. J. et al. (2015). Bayesian function-on-function regression for multilevel functional data. In: Biometrics 71.3,
pp. 563–574.
Morris, J. S. (2015). Functional Regression. In: Annual Review of Statistics and Its Application 2.1, pp. 321–359.
Morris, J. S., P. J. Brown, et al. (2008). Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional
mixed models. In: Biometrics 64.2, pp. 479–489.
Morris, J. S. and R. J. Carroll (2006). Wavelet-based functional mixed models. In: Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 68.2, pp. 179–199.
References V
Morris, J. and R. Carroll (2006). Wavelet-based functional mixed models. In: Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 68.2, pp. 179–199.
Müller, H. and F. Yao (2008). Functional additive models. In: Journal of the American Statistical Association 103.484,
pp. 1534–1544.
N, W. S. (2019). mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. R package version
1.8-27. https://CRAN.R-project.org/package=mgcv.
Nychka, D. (1988). Confidence intervals for smoothing splines. In: Journal of the American Statistical Association 83,
pp. 1134–1143.
Prchal, L. and P. Sarda (2007). “Spline estimator for functional linear regression with functional response”. unpublished. url:
http://www.math.univ-toulouse.fr/staph/PAPERS/flm_prchal_sarda.pdf.
R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing. Vienna, Austria. http://www.R-project.org/.
Ramsay, J. O., H. Wickham, S. Graves, and G. Hooker (2018). fda: Functional Data Analysis. R package version 2.4.8.
https://CRAN.R-project.org/package=fda.
Ramsay, J. and G. Hooker (2017). Dynamic data analysis. New York: Springer.
Ramsay, J. and B. Silverman (2005). Functional Data Analysis. 2. ed. New York: Springer.
Reiss, P., L. Huang, and M. Mennes (2010). Fast Function-on-Scalar Regression with Penalized Basis Expansions. In: The
International Journal of Biostatistics 6.1, p. 28.
Reiss, P. and T. Ogden (2009). Smoothing parameter selection for a class of semiparametric linear models. In: Journal of the
Royal Statistical Society: Series B (Statistical Methodology) 71.2, pp. 505–523. issn: 1467-9868. doi:
10.1111/j.1467-9868.2008.00695.x. url: http://dx.doi.org/10.1111/j.1467-9868.2008.00695.x.
Reiss, P. T. and M. Xu (2018). Tensor product splines and functional principal components.
https://works.bepress.com/phil_reiss/46.
Reiss, P. and R. Ogden (2007). Functional principal component regression and functional partial least squares. In: Journal of
the American Statistical Association 102.479, pp. 984–996.
Rügamer, D. et al. (2018). Boosting factor-specific functional historical models for the detection of synchronization in
bioelectrical signals. In: Journal of the Royal Statistical Society: Series C (Applied Statistics) 67.3, pp. 621–642.
Ruppert, D., R. Carroll, and M. Wand (2003). Semiparametric Regression. Cambridge, UK: Cambridge University Press.
Saefken, B., T. Kneib, C.-S. van Waveren, and S. Greven (2014). A unifying approach to the estimation of the conditional
Akaike information in generalized linear mixed models. In: Electronic Journal of Statistics 8.1, pp. 201–225.
References VI
Sangalli, L. M., P. Secchi, S. Vantini, and V. Vitelli (2010). K-mean alignment for curve clustering. In: Computational Statistics
& Data Analysis 54.5, pp. 1219–1233.
Scheipl, F., J. Gertheiss, and S. Greven (2016). Generalized functional additive mixed models. In: Electronic Journal of
Statistics 10.1, pp. 1455–1492.
Scheipl, F., A.-M. Staicu, and S. Greven (2015). Functional additive mixed models. In: Journal of Computational and Graphical
Statistics 24.2, pp. 477–501.
Scheipl, F. and S. Greven (2016). Identifiability in penalized function-on-function regression models. In: Electronic Journal of
Statistics 10.1, pp. 495–526. url: http://arxiv.org/abs/1506.03627.
Shi, J. Q. and T. Choi (2011). Gaussian process regression analysis for functional data. Chapman and Hall/CRC.
Sørensen, H., J. Goldsmith, and L. M. Sangalli (2013). An introduction with medical applications to functional data analysis.
In: Statistics in Medicine 32.30, pp. 5222–5240.
Srivastava, A. and E. P. Klassen (2016). Functional and shape data analysis. Springer.
Sun, Y. and M. G. Genton (2011). Functional Boxplots. In: Journal of Computational and Graphical Statistics 20.2,
pp. 316–334.
Tucker, J. D. (2017). fdasrvf: Elastic Functional Data Analysis. R package version 1.8.3.
https://CRAN.R-project.org/package=fdasrvf.
Tucker, J. D., W. Wu, and A. Srivastava (2013). Generative models for functional data using phase and amplitude separation.
In: Computational Statistics & Data Analysis 61, pp. 50–66.
Van den Boogaart, K. G., J. J. Egozcue, and V. Pawlowsky-Glahn (2014). Bayes hilbert spaces. In: Australian & New Zealand
Journal of Statistics 56.2, pp. 171–194.
Wang, J.-L., J.-M. Chiou, and H.-G. Müller (2016). Review of Functional Data Analysis. In: Annual Review of Statistics and Its
Application 3.1, pp. 1–41.
Wang, K., T. Gasser, et al. (1997). Alignment of curves by dynamic time warping. In: The annals of Statistics 25.3,
pp. 1251–1276.
Wood, S. N. (2006a). Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC.
Wood, S. N. (2006b). Low Rank Scale Invariant Tensor Product Smooths for Generalized Additive Mixed Models. In:
Biometrics 62.1.
Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized
linear models. In: Journal of the Royal Statistical Society (B) 73.1, pp. 3–36.
Wood, S. N. (2012). On p-values for smooth components of an extended generalized additive model. In: Biometrika 100.1,
pp. 221–228.
References VII

Wood, S. N., Y. Goude, and S. Shaw (2015). Generalized additive models for large data sets. In: Journal of the Royal
Statistical Society: Series C (Applied Statistics) 64.1, pp. 139–155.
Wood, S. N., F. Scheipl, and J. Faraway (2012). Straightforward intermediate rank tensor product smoothing in mixed models.
In: Statistics and Computing, pp. 1–20. url: http://dx.doi.org/10.1007/s11222-012-9314-z.
Wood, S. N. and M. Fasiolo (2017). A generalized Fellner-Schall method for smoothing parameter optimization with application
to Tweedie location, scale and shape models. In: Biometrics 73.4, pp. 1071–1081.
Wood, S. N., Z. Li, G. Shaddick, and N. H. Augustin (2017). Generalized additive models for gigadata: modeling the UK black
smoke network daily data. In: Journal of the American Statistical Association 112.519, pp. 1199–1210.
Wood, S. N., N. Pya, and B. Säfken (2016). Smoothing parameter and model selection for general smooth models. In: Journal
of the American Statistical Association 111.516, pp. 1548–1563.
Wood, S. N. and F. Scheipl (2017). gamm4: Generalized Additive Mixed Models using ’mgcv’ and ’lme4’. R package
version 0.2-5. https://CRAN.R-project.org/package=gamm4.
Xiao, L., V. Zipunnikov, D. Ruppert, and C. Crainiceanu (2016). Fast covariance estimation for high-dimensional functional
data. In: Statistics and computing 26.1-2, pp. 409–421.
Yao, F., B. Liu (Authors), H.-G. Mueller, and J.-L. Wang (Coordinators) (2012). PACE: Principal Analysis by Conditional
Expectation, Functional Data Analysis and Empirical Dynamics. MATLAB package version 2.15.
http://anson.ucdavis.edu/~ntyang/PACE/.
Yao, F., H.-G. Müller, and J.-L. Wang (2005). Functional Data Analysis for Sparse Longitudinal Data. In: Journal of the
American Statistical Association 100.470, pp. 577–590.
Zhu, H., P. Brown, and J. Morris (2011). Robust, Adaptive Functional Regression in Functional Mixed Model Framework. In:
Journal of the American Statistical Association 106.495, pp. 1167–1179.
Zhu, H., F. Versace, et al. (2018). Robust and Gaussian spatial functional regression models for analysis of event-related
potentials. In: NeuroImage 181, pp. 501–512.

You might also like