Journalclub Wood2016

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Journal Club:

Smoothing parameter and model selection for


general smooth models

1
Summary of the paper

• extends mgcv to non-exponential families, multiple additive


predictors
→ GAMLSS, multinomial/multivariate responses, Cox PH
• stable, computationally efficient inference based on very
general REML-like construction (“Laplace approximate
marginal likelihood”)
• approximate accounting for smoothing parameter uncertainty:
• improved CIs
• improved AIC

2
Framework

• model log-likelihood l(β|y, x)


(possible additional scale φ, latent θ)
• function of (multiple) additive predictors
• additive predictor is sum of basis expanded effects
Pkj
gj (x j ) = i=1 βij b ji (x j )
• each effect with quadratic penalty λj βS j β

⇒ Maximize penalized L(β) = l(β) − 1


λj βS j β
P
2 j

3
Proposed algorithm in a nutshell

Ideas:

1. Empirical Bayesian viewpoint:


penalties ≡ MVN-priors fλ (β j ) : β j ∼ N(0, (λj S j )− )
2. ⇒ find modal smoothing parameters by maximizing
log-marginal likelihood
R
V r (λ) = log ( exp(l(β|y, x))fλ (β)dβ) via Newton iteration.
→ Gaussian case: equivalent to REML-estimation
3. coefficients β given smoothing parameters λ via Newton
iteration (≈ IWLS-like step)

4
Proposed algorithm in a nutshell

Maximize Laplace-approximation of marginal likelihood at β̂:

V(λ) ∝ L(β̂) + 12 log |S(λ)|+ − 21 log |H|


 −1
∂2
with inverse Hessian H = − ∂β∂β T
L(β) |β̂

5
Proposed algorithm in a nutshell: “Outer algorithm”

Newton optimization of V over ρ = log λ:

∂ ∂ 2
• requires ∂ρ V, ∂ρ∂ρ T V as well as β̂ for every candidate ρ

• lots of numerical tricks to keep computation stable despite


• numerically unstable log-determinants of rank-deficient penalty
matrices
• flat likelihoods very large ρ (strong regularization)

6
Proposed algorithm in a nutshell: “Inner algorithm”

Get β̂ via Newton iteration

• only requires gradient & Hessian of penalized likelihood for β|ρ

Numerical tricks:

• step-halving and perturbation of (non-pos. def.) Hessian to


ensure increase in each step
• accounting for non-identifiable parameters

7
Proposed algorithm in a nutshell: “Inner algorithm”

∂ ∂2
Compute ∂ρi V, ∂ρi ∂ρj V for “outer algorithm”

• via implicit differentiation for first and second derivatives of β̂


w.r.t. ρ
• tricksy model-specific implementation to keep computational
effort manageable for second derivatives of log |H|

8
What about φ, θ?

Same general idea, slightly different LAML-criterion, yet more


derivatives. . . .

9
Smoothing parameter uncertainty

• convential GAM-CIs condition on smoothing parameters


• more honest CIs if uncertainty about ρ is considered as well

Approximate idea here:

• compute “sensitivity” of approximate posterior of β̂ to changes


in ρ via Taylor expansion of β̂ as function of ρ
• use covariance of Taylor expanded expression as corrected
covariance

10
Smoothing parameter uncertainty

• use Pseudo-Bayes conditional posterior


  −1

β|y, ρ ∼ N β̂(ρ̂), −H + S λ̂
• use Gaussian
 approximation of posterior
−1 
∂2
ρ|y ∼ N ρ̂, − ∂ρi ∂ρj V
• yields “corrected” approximate marginal posterior 
  −1
β|y ∼ N β̂(ρ̂), −H + S λ̂ + correction terms

11
Smoothing parameter uncertainty: Discussion

• seems to work mostly well even though Bayesian motivation


rather sketchy:
• implicit improper prior f (ρ) ∝ 1
• Gaussian approx. quite heroic in many cases

• most applications we’ve seen: little difference for moderately


regularized terms, very slightly wider

12
Smoothing parameter uncertainty: Discussion

• implementation actually ignores smoothing parameter


uncertainty where it’s most critical:
at the “phase change” boundary between parametric/linear and
smooth fits
• terms not defined for λ → ∞, correction terms break down &
are set to zero
• Gaussian approximation for posterior very bad – often bimodal
marginals for β representing linear and nonlinear models

13
IC for smooth model selection

• uses same correction terms as β-CI’s to give improved


estimates of total model d.f.
• very general approximate solution, previous solutions often
fairly model-specific

14
Interesting other recent developments:

Next-gen. estimation algorithm is already written up &


implemented:
EM-like “Fellner-Schall” algorithm (Wood/Fasiolo (2016);
arxiv:1606.04802) does not require derivatives of Hessian

• no smoothing parameter uncertainty


• much simpler to implement, similar computational costs
• practical convergence properties not clear yet

15
Interesting other recent developments:

“Big Data”-GAMMs (Wood et. al (2016), Generalized additive


models for gigadata [. . . ], JASA)

• use binned covariate values


• fitting algorithm uses only unique bin values & bin weights
instead of observed values
• method pioneered in BayesX years ago, now generalized to
multivariate smooths as well
• huge savings in memory, computation times:
n ≈ 108 , p ≈ 104 feasible on laptop

16
Interesting other recent developments:

JAGS interface mgcv::jagam() (Wood(2016); arxiv:1602.02539)

• translates (almost) any mgcv-model into Bayesian pseudo-code


• use Martyn Plummer’s rjags-interface to JAGS to perform
fully Bayesian automated MCMC inference
• modify automatically generated JAGS code for different priors
on λ,

17
General points for discussion

• Practical importance for applied work?


• Novelty compared to gamlss, VGAM?
• Pseudo-/ Pragmatic Bayesian justification for corrected CIs
• ...?

18

You might also like