Journalclub Wood2016

Journal Club:
Smoothing parameter and model selection for

general smooth models
1
Summary of the paper
• extends mgcv to non-exponential families, multiple additive

predictors
→ GAMLSS, multinomial/multivariate responses, Cox PH
• stable, computationally efficient inference based on very
general REML-like construction (“Laplace approximate
marginal likelihood”)
• approximate accounting for smoothing parameter uncertainty:
• improved CIs
• improved AIC
2
Framework
• model log-likelihood l(β|y, x)

(possible additional scale φ, latent θ)
• function of (multiple) additive predictors
• additive predictor is sum of basis expanded effects
Pkj
gj (x j ) = i=1 βij b ji (x j )
• each effect with quadratic penalty λj βS j β
⇒ Maximize penalized L(β) = l(β) − 1

λj βS j β
P
2 j
3
Proposed algorithm in a nutshell
Ideas:
1. Empirical Bayesian viewpoint:

penalties ≡ MVN-priors fλ (β j ) : β j ∼ N(0, (λj S j )− )
2. ⇒ find modal smoothing parameters by maximizing
log-marginal likelihood
R
V r (λ) = log ( exp(l(β|y, x))fλ (β)dβ) via Newton iteration.
→ Gaussian case: equivalent to REML-estimation
3. coefficients β given smoothing parameters λ via Newton
iteration (≈ IWLS-like step)
4
Proposed algorithm in a nutshell
Maximize Laplace-approximation of marginal likelihood at β̂:
V(λ) ∝ L(β̂) + 12 log |S(λ)|+ − 21 log |H|

−1
∂2
with inverse Hessian H = − ∂β∂β T
L(β) |β̂
5
Proposed algorithm in a nutshell: “Outer algorithm”
Newton optimization of V over ρ = log λ:
∂ ∂ 2
• requires ∂ρ V, ∂ρ∂ρ T V as well as β̂ for every candidate ρ
• lots of numerical tricks to keep computation stable despite

• numerically unstable log-determinants of rank-deficient penalty
matrices
• flat likelihoods very large ρ (strong regularization)
6
Proposed algorithm in a nutshell: “Inner algorithm”
Get β̂ via Newton iteration
• only requires gradient & Hessian of penalized likelihood for β|ρ
Numerical tricks:
• step-halving and perturbation of (non-pos. def.) Hessian to

ensure increase in each step
• accounting for non-identifiable parameters
7
Proposed algorithm in a nutshell: “Inner algorithm”
∂ ∂2
Compute ∂ρi V, ∂ρi ∂ρj V for “outer algorithm”
• via implicit differentiation for first and second derivatives of β̂

w.r.t. ρ
• tricksy model-specific implementation to keep computational
effort manageable for second derivatives of log |H|
8
What about φ, θ?
Same general idea, slightly different LAML-criterion, yet more

derivatives. . . .
9
Smoothing parameter uncertainty
• convential GAM-CIs condition on smoothing parameters

• more honest CIs if uncertainty about ρ is considered as well
Approximate idea here:
• compute “sensitivity” of approximate posterior of β̂ to changes

in ρ via Taylor expansion of β̂ as function of ρ
• use covariance of Taylor expanded expression as corrected
covariance
10
Smoothing parameter uncertainty
• use Pseudo-Bayes conditional posterior

−1

β|y, ρ ∼ N β̂(ρ̂), −H + S λ̂
• use Gaussian
approximation of posterior
−1
∂2
ρ|y ∼ N ρ̂, − ∂ρi ∂ρj V
• yields “corrected” approximate marginal posterior
−1
β|y ∼ N β̂(ρ̂), −H + S λ̂ + correction terms
11
Smoothing parameter uncertainty: Discussion
• seems to work mostly well even though Bayesian motivation

rather sketchy:
• implicit improper prior f (ρ) ∝ 1
• Gaussian approx. quite heroic in many cases
• most applications we’ve seen: little difference for moderately

regularized terms, very slightly wider
12
Smoothing parameter uncertainty: Discussion
• implementation actually ignores smoothing parameter

uncertainty where it’s most critical:
at the “phase change” boundary between parametric/linear and
smooth fits
• terms not defined for λ → ∞, correction terms break down &
are set to zero
• Gaussian approximation for posterior very bad – often bimodal
marginals for β representing linear and nonlinear models
13
IC for smooth model selection
• uses same correction terms as β-CI’s to give improved

estimates of total model d.f.
• very general approximate solution, previous solutions often
fairly model-specific
14
Interesting other recent developments:
Next-gen. estimation algorithm is already written up &

implemented:
EM-like “Fellner-Schall” algorithm (Wood/Fasiolo (2016);
arxiv:1606.04802) does not require derivatives of Hessian
• no smoothing parameter uncertainty

• much simpler to implement, similar computational costs
• practical convergence properties not clear yet
15
“Big Data”-GAMMs (Wood et. al (2016), Generalized additive

models for gigadata [. . . ], JASA)
• use binned covariate values

• fitting algorithm uses only unique bin values & bin weights
instead of observed values
• method pioneered in BayesX years ago, now generalized to
multivariate smooths as well
• huge savings in memory, computation times:
n ≈ 108 , p ≈ 104 feasible on laptop
16
JAGS interface mgcv::jagam() (Wood(2016); arxiv:1602.02539)
• translates (almost) any mgcv-model into Bayesian pseudo-code

• use Martyn Plummer’s rjags-interface to JAGS to perform
fully Bayesian automated MCMC inference
• modify automatically generated JAGS code for different priors
on λ,
17
General points for discussion
• Practical importance for applied work?

• Novelty compared to gamlss, VGAM?
• Pseudo-/ Pragmatic Bayesian justification for corrected CIs
• ...?
18

Journalclub Wood2016

Uploaded by

Copyright:

Available Formats

You might also like

Journalclub Wood2016

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Journalclub Wood2016

Uploaded by

Copyright:

Available Formats

Journal Club:

Smoothing parameter and model selection for

• extends mgcv to non-exponential families, multiple additive

• model log-likelihood l(β|y, x)

⇒ Maximize penalized L(β) = l(β) − 1

1. Empirical Bayesian viewpoint:

Maximize Laplace-approximation of marginal likelihood at β̂:

V(λ) ∝ L(β̂) + 12 log |S(λ)|+ − 21 log |H|

Newton optimization of V over ρ = log λ:

• lots of numerical tricks to keep computation stable despite

Get β̂ via Newton iteration

• only requires gradient & Hessian of penalized likelihood for β|ρ

• step-halving and perturbation of (non-pos. def.) Hessian to

• via implicit differentiation for first and second derivatives of β̂

Same general idea, slightly different LAML-criterion, yet more

• convential GAM-CIs condition on smoothing parameters

Approximate idea here:

• compute “sensitivity” of approximate posterior of β̂ to changes

• use Pseudo-Bayes conditional posterior

• seems to work mostly well even though Bayesian motivation

• most applications we’ve seen: little difference for moderately

• implementation actually ignores smoothing parameter

• uses same correction terms as β-CI’s to give improved

Next-gen. estimation algorithm is already written up &

• no smoothing parameter uncertainty

“Big Data”-GAMMs (Wood et. al (2016), Generalized additive

• use binned covariate values

JAGS interface mgcv::jagam() (Wood(2016); arxiv:1602.02539)

• translates (almost) any mgcv-model into Bayesian pseudo-code

• Practical importance for applied work?

You might also like