Journal Club:

Smoothing parameter and model selection for

general smooth models

Summary of the paper

• extends mgcv to non-exponential families, multiple additive

→ GAMLSS, multinomial/multivariate responses, Cox PH
• stable, computationally efficient inference based on very
general REML-like construction (“Laplace approximate
marginal likelihood”)
• approximate accounting for smoothing parameter uncertainty:
• improved CIs
• improved AIC


• model log-likelihood l(β|y, x)

(possible additional scale φ, latent θ)
• function of (multiple) additive predictors
• additive predictor is sum of basis expanded effects
gj (x j ) = i=1 βij b ji (x j )
• each effect with quadratic penalty λj βS j β

⇒ Maximize penalized L(β) = l(β) − 1

λj βS j β
2 j

Proposed algorithm in a nutshell


1. Empirical Bayesian viewpoint:

penalties ≡ MVN-priors fλ (β j ) : β j ∼ N(0, (λj S j )− )
2. ⇒ find modal smoothing parameters by maximizing
log-marginal likelihood
V r (λ) = log ( exp(l(β|y, x))fλ (β)dβ) via Newton iteration.
→ Gaussian case: equivalent to REML-estimation
3. coefficients β given smoothing parameters λ via Newton
iteration (≈ IWLS-like step)

Proposed algorithm in a nutshell

Maximize Laplace-approximation of marginal likelihood at β̂:

V(λ) ∝ L(β̂) + 12 log |S(λ)|+ − 21 log |H|

with inverse Hessian H = − ∂β∂β T
L(β) |β̂

Proposed algorithm in a nutshell: “Outer algorithm”

Newton optimization of V over ρ = log λ:

∂ ∂ 2
• requires ∂ρ V, ∂ρ∂ρ T V as well as β̂ for every candidate ρ

• lots of numerical tricks to keep computation stable despite

• numerically unstable log-determinants of rank-deficient penalty
• flat likelihoods very large ρ (strong regularization)

Proposed algorithm in a nutshell: “Inner algorithm”

Get β̂ via Newton iteration

• only requires gradient & Hessian of penalized likelihood for β|ρ

Numerical tricks:

• step-halving and perturbation of (non-pos. def.) Hessian to

ensure increase in each step
• accounting for non-identifiable parameters

Proposed algorithm in a nutshell: “Inner algorithm”

∂ ∂2
Compute ∂ρi V, ∂ρi ∂ρj V for “outer algorithm”

• via implicit differentiation for first and second derivatives of β̂

w.r.t. ρ
• tricksy model-specific implementation to keep computational
effort manageable for second derivatives of log |H|

What about φ, θ?

Same general idea, slightly different LAML-criterion, yet more

derivatives. . . .

Smoothing parameter uncertainty

• convential GAM-CIs condition on smoothing parameters

• more honest CIs if uncertainty about ρ is considered as well

Approximate idea here:

• compute “sensitivity” of approximate posterior of β̂ to changes

in ρ via Taylor expansion of β̂ as function of ρ
• use covariance of Taylor expanded expression as corrected

Smoothing parameter uncertainty

• use Pseudo-Bayes conditional posterior


β|y, ρ ∼ N β̂(ρ̂), −H + S λ̂
• use Gaussian
 approximation of posterior
ρ|y ∼ N ρ̂, − ∂ρi ∂ρj V
• yields “corrected” approximate marginal posterior 
β|y ∼ N β̂(ρ̂), −H + S λ̂ + correction terms

Smoothing parameter uncertainty: Discussion

• seems to work mostly well even though Bayesian motivation

rather sketchy:
• implicit improper prior f (ρ) ∝ 1
• Gaussian approx. quite heroic in many cases

• most applications we’ve seen: little difference for moderately

regularized terms, very slightly wider

Smoothing parameter uncertainty: Discussion

• implementation actually ignores smoothing parameter

uncertainty where it’s most critical:
at the “phase change” boundary between parametric/linear and
smooth fits
• terms not defined for λ → ∞, correction terms break down &
are set to zero
• Gaussian approximation for posterior very bad – often bimodal
marginals for β representing linear and nonlinear models

IC for smooth model selection

• uses same correction terms as β-CI’s to give improved

estimates of total model d.f.
• very general approximate solution, previous solutions often
fairly model-specific

Interesting other recent developments:

Next-gen. estimation algorithm is already written up &

EM-like “Fellner-Schall” algorithm (Wood/Fasiolo (2016);
arxiv:1606.04802) does not require derivatives of Hessian

• no smoothing parameter uncertainty

• much simpler to implement, similar computational costs
• practical convergence properties not clear yet

Interesting other recent developments:

“Big Data”-GAMMs (Wood et. al (2016), Generalized additive

models for gigadata [. . . ], JASA)

• use binned covariate values

• fitting algorithm uses only unique bin values & bin weights
instead of observed values
• method pioneered in BayesX years ago, now generalized to
multivariate smooths as well
• huge savings in memory, computation times:
n ≈ 108 , p ≈ 104 feasible on laptop

Interesting other recent developments:

JAGS interface mgcv::jagam() (Wood(2016); arxiv:1602.02539)

• translates (almost) any mgcv-model into Bayesian pseudo-code

• use Martyn Plummer’s rjags-interface to JAGS to perform
fully Bayesian automated MCMC inference
• modify automatically generated JAGS code for different priors
on λ,

General points for discussion

• Practical importance for applied work?

• Novelty compared to gamlss, VGAM?
• Pseudo-/ Pragmatic Bayesian justification for corrected CIs
• ...?


