Comparison Fit Method GAM

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Stat Comput (2008) 18: 87–99

DOI 10.1007/s11222-007-9040-0

A comparison of methods for the fitting of generalized additive


models
Harald Binder · Gerhard Tutz

Received: 23 November 2006 / Accepted: 14 September 2007 / Published online: 20 October 2007
© Springer Science+Business Media, LLC 2007

Abstract There are several procedures for fitting general- In cases with more information the prediction performance
ized additive models, i.e. regression models for an exponen- of all procedures is very similar. So, in difficult data situa-
tial family response where the influence of each single co- tions the boosting approach can be recommended, in others
variates is assumed to have unknown, potentially non-linear the procedures can be chosen conditional on the aim of the
shape. Simulated data are used to compare a smoothing pa- analysis.
rameter optimization approach for selection of smoothness
and of covariates, a stepwise approach, a mixed model ap- Keywords Generalized additive models · Selection of
proach, and a procedure based on boosting techniques. In smoothness · Variable selection · Boosting · Mixed model
particular it is investigated how the performance of proce- approach
dures is linked to amount of information, type of response,
total number of covariates, number of influential covari-
ates, and extent of non-linearity. Measures for comparison 1 Introduction
are prediction performance, identification of influential co-
variates, and smoothness of fitted functions. One result is Generalized additive models, as introduced by Hastie and
Tibshirani (1986), present a flexible extension of general-
that the mixed model approach returns sparse fits with fre-
ized linear models (e.g. McCullagh and Nelder 1989), al-
quently over-smoothed functions, while the functions are
lowing for arbitrary functions for modeling the influence of
less smooth for the boosting approach and variable selec-
each covariate on an exponential family response in a mul-
tion is less strict. The other approaches are in between with
tivariate regression setting. Various techniques can be em-
respect to these measures. The boosting procedure is seen to
ployed for actually fitting such models, some of them doc-
perform very well when little information is available and/or
umented in two recent monographs (Ruppert et al. 2003;
when a large number of covariates is to be investigated. It
Wood 2006), and there is ongoing research on new ones.
is somewhat surprising that in scenarios with low informa-
While most of the available approaches have been developed
tion the fitting of a linear model, even with stepwise vari-
with a specific application in mind, there is quite some over-
able selection, has not much advantage over the fitting of an
lap of application settings where several procedures may
additive model when the true underlying structure is linear.
be feasible. There are relatively few studies that compare
different methods available. Usually comparisons are done
when a new procedure is introduced and results are limited
H. Binder ()
Institut für Medizinische Biometrie und Medizinische Informatik,
to only a few procedures (e.g. Wood 2004). Also, often the
Universitätsklinikum Freiburg, Stefan-Meier-Str. 26, focus is on univariate fitting of splines (e.g. Lindstrom 1999;
79104 Freiburg, Germany Wand 2000; Ruppert 2002; Lee 2003), i.e. the model selec-
e-mail: binderh@fdm.uni-freiburg.de tion problem arising in multivariate settings is not dealt with.
A precursor to the present study is contained in Tutz and
G. Tutz
Institut für Statistik, Ludwig-Maximilians-Universität München, Binder (2006), where only a limited amount of simulation
Munich, Germany scenarios for comparison with other methods is considered.
88 Stat Comput (2008) 18: 87–99

The focus of the present paper is on comparison of pro- Number of influential predictors/distribution of information
cedures for fitting generalized additive models. We there- Given the same signal-to-noise ratio, information can be
fore use an extended set of examples with simulated data contained in a small number of variables (constituting a
and additional procedures for comparison. It cannot be ex- sparse problem) or distributed over a large number of covari-
pected that there is a “best procedure”. The advantage of ates (dense problem). It has been recognized that in the latter
one approach over the other will depend on the underlying case usually no procedure can do really well and therefore
structure and the sampling scheme. We will explore several one should “bet on sparsity” (see e.g. Friedman et al. 2004).
structures and sampling schemes and compare the perfor- Nevertheless we will evaluate how the performance of the
mance of the various procedures. In the following we briefly various approaches changes when information is distributed
sketch the characteristics, naturally subjective and limited, over a larger number of covariates.
of the underlying structure that can be expected to have an
effect on the performance of fitting procedures. Amount of non-linearity in the data It is also worthwhile
to explore the performance of procedures when the structure
Amount of information One feature that will have an ef- underlying the data is simple. As noted for example recently
fect on the performance is the amount of structure underly- by Hand (2006), seemingly simple methods still offer sur-
ing the data. This can be quantified by the signal-to-noise prisingly good performance in applications with real data
ratio (definition given in Sect. 3). Generalized additive mod- when compared to newer methods. Therefore a procedure
els are frequently used for the exploratory analysis of data. offering more complexity should offer means for automatic
In contrast, in an experimental setting one often has some regulation of complexity, thus being able to fit a complex
knowledge about the functional form of the response. Since model only when necessary and a simple one when that is
in exploratory analyses it is often not even clear whether the sufficient. It will be evaluated whether these mechanisms
predictors contain any information at all, we will focus on work even in the extreme case, where the underlying struc-
rather small signal-to-noise ratios. ture is linear. This is important, because it touches the ques-
tion whether one can rely on fitting additive models, and still
expect results to be reliable, even if the underlying model is
Type of response When a binary (or Poisson) response is
linear.
used instead of a continuous response with Gaussian er-
One issue that will not be dealt with in the present study
ror this can be seen as a coarsening of the response, i.e.
is that of influential covariates that differ in the size of
the signal-to-noise ratio decreases. Especially for binary re-
their influence. In preliminary studies when using such non-
sponse settings the question is whether the relative perfor-
uniform covariate influence profiles we did not find large
mance of the various approaches for fitting generalized ad-
differences compared to examples with equal covariate in-
ditive models changes just as would be expected for a de- fluence, after controlling for signal-to-noise ratio. Another
crease in signal-to-noise ratio. If this is not the case the type issue is that of correlation of covariates. The latter may re-
of response has to be considered as a distinct factor. sult in more difficult estimation problems, but might also be
beneficial for prediction performance, e.g. because it can be
Number of non-influential covariates If subject matter easier to identify covariates that contain at least some in-
knowledge about the potential influence of covariates is formation, when there is correlation. Indeed, in preliminary
available, it is advisable to include only those covariates investigations we could not find a consistent effect of corre-
that are truly influential. Nevertheless, sometimes there is a lation and there was no differential pattern of some proce-
large number of covariates and not enough subject matter dures being affected more than others. Therefore we do not
knowledge to decide on a small set of covariates to be used. report these results here.
Even classical linear model procedures run into problems Most examples will be structured such that it is difficult,
in the high-dimensional case, and therefore regularization but not impossible, to reasonably fit a generalized additive
techniques have to be used (e.g. Hoerl and Kennard 1970; model to the data. This is for example ensured by using the
Tibshirani 1996). For generalized additive models high- intercept model, which does not use any covariate informa-
dimensional settings are even more difficult to handle, be- tion, as a performance reference that should always be out-
cause for each covariate a function has to be estimated in- performed at least by some of the approaches under consid-
stead of just one slope parameter. As each approach for fit- eration. We will avoid undemanding examples, e.g. featur-
ting generalized additive models has specific technique(s) ing a large number of observations with only a few covari-
for regularization of complexity, it will be instructive to ates, because in our experience all approaches perform well
compare the decrease of performance when adding more in this case and not much can be learned.
and more non-influential covariates (i.e. not changing the The results in this paper will be structured such that for
underlying structure and signal-to-noise ratio). each data property/issue listed above recommendations can
Stat Comput (2008) 18: 87–99 89

be derived. Section 2 reviews the generalized additive model 2.1 Backfitting and stepwise selection of degrees of
framework and the theoretical foundations of the procedures freedom (bfGAM)
for fitting such models that will be used for comparison.
Section 3 presents the design of the simulation study, i.e. The most traditional algorithm for fitting additive models
the types of example data used, the specific details and im- is the backfitting algorithm (Friedman and Stuetzle 1981)
plementations of the procedures, and the measures used for which has been propagated for additive models in Hastie and
comparison. The results are discussed in Sect. 4, with a Tibshirani (1990). The algorithm is based on univariate scat-
focus on the issues highlighted above. Finally, in Sect. 5 terplot smoothers which are applied iteratively. The back-
we summarize the results and give recommendations as to fitting algorithm cycles through the individual terms in the
which procedure can or should be used for what kind of ob- additive model and updates each using an unidimensional
jective. smoother. In principle also higher-order terms can be incor-
porated, e.g. using bivariate smoothers, but this will not be
considered in the following.
2 Procedures for fitting generalized additive models Thus if fˆj(0) (·), j = 1, . . . , p, are estimates, updates for
additive models are computed by
Generalized additive models (GAMs) assume that data    
fˆj = Sj y −
(1)
(yi , xi ), i = 1, . . . , n, with covariate vectors xiT = fs(1) − fs(0) ,
(xi1 , . . . , xip ) follow the model s<j s>j

μi = h(ηi ), ηi = f(1) (xi1 ) + · · · + f(p) (xip ), where Sj is a smoother matrix, y T = (y1 , . . . , yn ) and
(j )
fs = (f(s) (x1s ), . . . , f(s) (xns ))T denotes the vector of
where μi = E(yi |xi ), h is a specified response function, and evaluations. The second term on the right hand side rep-
f(j ) , j = 1, . . . , p, are unspecified functions of covariates. resents partial residuals that are smoothed in order to obtain
As in generalized linear models (GLMs) (McCullagh and an update for the left out component fj . The algorithm is
Nelder 1989) it is assumed that y|x follows a simple expo- also known as Gauss-Seidel algorithm.
nential family, including among others normally distributed, For generalized additive models the local scoring algo-
binary, or Poisson distributed responses. rithm is used. In the algorithm for each Fisher scoring step
One has to distinguish, between how a model is fitted, there is an inner loop that fits the additive structure of the
and how covariates and smoothness of the fitted functions linear predictor in the form of an weighted backfitting algo-
are selected. Table 1 gives an overview of the approaches rithm. For details, see Hastie and Tibshirani (1990).
used in the present study, which will be sketched in the fol- The procedure that is used in the simulation study is
lowing subsections. Naturally, not all available approaches called bfGAM, for traditional backfitting combined with a
for fitting of generalized additive models could be consid- stepwise procedure for selecting the degrees of freedom for
ered. For example, marginal integration approaches (see e.g. each component (package gam, version 0.97). The program
Linton and Härdle 1996) are not included. With respect to uses cubic smoothing splines as smoother and selects the
fitting of the model, i.e. fitting of (smooth) functions f(j ) , smoothing parameters by stepwise selection of the degrees
the different approaches in Table 1 allow for different classes of freedom for each component (procedure step.gam).
of techniques (column 2) from which typical ones (to be The possible levels of degrees of freedom we use are 0, 1,
used later for the simulation study) are shown in column 3 4, 6, or 12, where 0 means exclusion of the covariate and 1
of Table 1. For example, all approaches except bfGAM re- inclusion with linear influence. The procedure starts with a
quire the use of penalized regression splines and only the model where all terms enter linearly and in a stepwise man-
latter allows for a more general class of smoothers. Also ner seeks improvement of AIC by upgrading or downgrad-
with respect to selection of the smoothness of the fitted func- ing the degrees of freedom for one component by one level
tions and the selection of covariates the approaches shown (see Chambers and Hastie 1992).
in Table 1 allow only for a certain amount of customization.
Columns 4 and 5 show typical choices. We will only investi- 2.2 Simultaneous estimation and optimization in
gate the combinations of techniques for fitting smooth func- smoothing parameter space (wGAM and wGAMstep)
tions and selecting tuning parameter shown in Table 1. For
example, we do not consider a potential variant of GAMM Regression splines offer a way to approximate the underly-
which uses a stepwise technique for selection of covariates. ing functions f(j ) (·) by using an expansion in basis func-
The analysis is performed in the statistical environment R tions. One uses the approximation
(R Development Core Team 2006), version 2.3.1, where im- 
m
(j )
plementations for the presented approaches for fitting gen- f(j ) (xi ) = βs φs (xis ),
eralized additive models are available as packages. s=1
90 Stat Comput (2008) 18: 87–99
Table 1 Higher level view of
approaches for fitting Name Fitting of (smooth) functions Selection of . . .
generalized additive models Principle Specific procedure Smoothness Covariates

bfGAM Any scatterplot Cubic smoothing Stepwise, evaluating a fixed set of degrees
smoother splines of freedom for each component

wGAM Penalized Cubic regression Optimization in smoothing parameter space


regression splines splines

wGAMstep Penalized Cubic regression Optimization in Stepwise


regression splines splines smoothing parameter space

GAMM Penalized Cubic regression Implicitly as variance estimates in a mixed


regression splines splines model

GAMBoost Penalized 2nd degree B-splines Implicitly via number of updates received in
regression splines with 1st order boosting steps
difference penalty

where φs (·) are known basic functions. A frequently used set 2.3 Mixed model approach (GAMM)
of basis functions are cubic regression splines which assume
that for a given sequences of knots τ1 < · · · < τm (from It has been pointed out already by Speed (1991) that fitting
the domain of the covariate under investigation) the func- of a smoothing spline can be formulated as the estimation
tion may be represented by a cubic polynomial within each of a random effects model, but only recently this has been
interval [τs , τs+1 ] and has continuous first and second deriv- popularized (see e.g. Wang 1998; Ruppert et al. 2003) and
atives at the knots. Marx and Eilers (1998) proposed the use implementations have been made readily available (Wood
of a large number of evenly spaced knots and the B-spline 2004). Let the additive model be represented in the matrix
basis in order to obtain a flexible fit and to use a penalized form
log-likelihood criterion in order to obtain stable estimates.
Then one maximizes the penalized likelihood η = i1 β1 + · · · + im βp , (2)

lp = l + λ (βi+1 − βi )2 , (1) where ηT = (η1 , . . . , ηn ) and is denotes matrices com-
i posed from the basis functions. By assuming that the para-
meters β1 , . . . , βp are random effects with a block-diagonal
where l denotes the original likelihood and λ is a tuning covariance matrix, one may estimate the parameters by
parameter which steers the difference penalty. using best linear unbiased prediction (BLUP) as used in
Wood (2004) extensively discusses the implementation mixed models. Smoothing parameters are obtained by max-
of such an approach for penalized estimation of the func- imum likelihood or restricted maximum likelihood within
tions together with a technique for selection of the smooth- the mixed models framework. For details see Ruppert et al.
ing parameters (see also Wood 2000, 2006). The procedure (2003), Wood (2004).
is referred to as wGAM (for woodGAM). It performs simul- The mixed model approach is denoted as GAMM (gamm
taneous estimation of all components with optimization in in package mgcv).
smoothing parameter space with respect to the generalized
cross-validation criterion (for a continuous response) or the 2.4 Boosting approach (GAMBoost)
unbiased risk estimate (for a binary and Poisson response)
(gam in package mgcv, version 1.3-17). The packages of- Boosting originates in the machine learning community
fers several choices for the basis functions and we use a cu- where it has been proposed as a technique to improve classi-
bic regression spline basis (with default number and spacing fication procedures by combining estimates with reweighted
of the knots). As wGAM can no longer be used when the observations. Since it has been shown that reweighting cor-
number of covariates gets too large, in parallel we also use responds to minimizing iteratively a loss function (Breiman
wGAMstep, a stepwise procedure, which, similar to bfGAM, 1999; Friedman 2001) boosting has been extended to regres-
evaluates the levels “exclusion”, “linear”, and “smooth” for sion problems in a L2 -estimation framework by Bühlmann
each component. and Yu (2003). The extension to generalized additive models
Stat Comput (2008) 18: 87–99 91

where estimates are obtained by likelihood based boosting is controls the signal-to-noise ratio, to arrive at a n-vector of
outlined in Tutz and Binder (2006). Likelihood based boost- predictors η = (η1 , . . . , ηn ) . Most examples will feature a
ing is an iterative procedure in which estimates are obtained continuous response with Gaussian error and true mean vec-
by applying a “weak learner” successively on residuals of tor μ = η, but also examples with binary and Poisson re-
components of the additive structure. By iterative fitting sponse will be given and for these the elements of μ =
of the residual and selection of components the procedure (μ1 , . . . , μn ) are obtained by μi = exp(ηi )/(1 + exp(ηi ))
adapts automatically to the possibly varying smoothness of or μi = exp(ηi ) respectively. The elements of the response
components. Estimation in one boosting step is based on (1) vector y = (y1 , . . . , yn ) are drawn from normal N (μi , 1),
with λ being chosen very large, in order to obtain a weak binomial B(μi , 1) or Poisson distributions Poisson(μi ). For
learner. The number of boosting steps is determined by a most examples the value of ce is chosen such that for few
stopping criterion, e.g. cross-validation or an information covariates (say six, with three being informative) all pro-
criterion. cedures for fitting generalized additive models can improve
We denote the procedure by GAMBoost (Tutz and Binder over the fit of a generalized linear model. For a reference
2006), which is a boosting procedure with the number of example, with six covariates of which three are informative,
boosting steps selected by AICC (Hurvich et al. 1998) we choose ce = 0.5 for the continuous and the Poisson re-
in the Gaussian response examples and by AIC otherwise sponse case and ce = 1.5 for the binary response case. When
(package GAMBoost, version 0.9-3). Since the smoothness examples with a larger number of informative covariates are
penalty λ is not as important as the number of boosting steps used, the value of ce is decreased to maintain a similar level
(Tutz and Binder 2006), it is determined by a coarse line of information. The latter is quantified by the (generalized)
search such that the corresponding number of boosting steps signal-to-noise ratio, which is estimated for each example
(selected by AICC /AIC or cross-validation) is in the range on new test data of size nnew = 1000 by
[50, 200] (procedure optimGAMBoostPenalty). 
(μi − μ̄)2
signal-to-noise ratio =  i (3)
i (μi − yi )
2


3 Design of the simulation study with μ̄ = ( i μi )/nnew .

3.1 Simulated data 3.2 Procedures

For each example in the simulation study there are 20 rep- As already outlined in Sect. 2 we use the procedures
etitions where a fixed number of n = 100 observations is 1. bfGAM: traditional backfitting combined with a stepwise
generated. In preliminary experiment we found that the ma- procedure for selecting the degrees of freedom for each
jor performance differences are due to the ratio between the component (package gam, version 0.97).
number of observations and the number of covariates and 2. wGAM: simultaneous estimation of all components with
therefore we manipulate only the number of covariates. We optimization in smoothing parameter space (gam in
generate p = 6, 10, 20, or 50 normally distributed covari- package mgcv, version 1.3-17) and wGAMstep, a step-
ates (N(0, 1), truncated to range [−2, 2]). As already stated wise procedure, based on wGAM.
in the introduction we use no correlation between covariates 3. GAMM: mixed model approach (gamm in package
because this did not have a differential effect on procedures mgcv).
in preliminary investigations. Either the first three or the first 4. GAMBoost: GAMBoost with the number of boosting
six of the covariates are informative. The covariate influence steps selected by AICC (Hurvich et al. 1998) in the
functions f(j ) (·) for these are centered and scaled such that Gaussian response examples and by AIC otherwise
the mean of single covariate effects ηij = f(j ) (xij ) gener- (package GAMBoost, version 0.9-3).
ated for xij ∈ [−2, 2] is 0 and the standard deviation is 1 (de-
termined empirically for a large sample). We consider two The following procedures are used as a performance ref-
types of structures, the semiparametric structure for which erence:
in each simulation the function for each informative covari- • base: generalized linear model using only an intercept
ate is randomly sampled to be a centered and standardized term, to check whether the performance of procedures can
version of one of the functions flinear (x) = x, fsinus (x) = be worse than a model that uses no covariate information.
sin(π(2 · (x + 2)/4 − 1)), fquadratic (x) = (2 · (x + 2)/4 − 1)2 , • GLMse: generalized linear model using covariate infor-
or finvlogit (x) = 1/(1 + exp(−10 · (2 · (x + 2)/4 − 1))). The mation, to check whether more flexible procedures might
second type is parametric where examples with just linear perform worse than this conservative procedure, when the
functions are considered. The effect of all informative co- true structure is linear. Because all other procedures per-
variates is then added up and scaled by a constant ce , which form some sort of variable selection, we employ stepwise
92 Stat Comput (2008) 18: 87–99

variable selection based on AIC (stepAIC from package of size nnew = 1000. For a side by side comparison of pre-
MASS, version 7.2-27.1) starting from the full model. This diction performance for various response types (e.g. to be
also provides a performance reference for the identifica- able to judge whether a change in response type is similar
tion of informative covariates. In initial investigations we to a change in signal-to-noise ratio) we consider the relative
also considered fitting the full model, i.e. without variable (predictive) squared error
selection, but this did not result in big differences with re- 
(yi − μ̂i )2
spect to prediction performance. RSE = i , (4)
i (yi − μi )
2
For GLMse, bfGAM, wGAM, and wGAMstep the degrees
of freedom, used for the model selection criteria, are given calculated on the new data. It will be equal to one when the
more weight by multiplying them by 1.4 (wGAMr is a vari- true model is recovered and larger otherwise.
ant of wGAM without this modification). In an earlier simu- While good prediction performance on new data is an im-
lation study this consistently increased performance (but not portant property, one usually hopes that the fitted functions
for GAMBoost). Some reasoning for this is given by Kim allow some insight into the effect of covariates. Thus cor-
and Gu (2004) who suggest a small range from 1.2 to 1.4. rect identification of informative covariates is important. As
For the procedures used in the present study one explana- a measure for this we use the hit rate (i.e. the proportion of
influential covariates identified) and the false alarm rate (i.e.
tion might be, that they have to search for optimal smooth-
the proportion of non-influential covariates wrongly deemed
ness parameters in a high-dimensional space (the dimension
influential). These are rather crude measures and one could
being the number of smoothing parameters to be chosen,
attempt a more detailed investigation in a testing framework,
typically equal to the number of covariates, p), guided by
e.g. inspecting the power of procedures (while fixing α-
a model selection criterion. When close-to-minimum values
level) with respect to covariate identification. As such prop-
of the criterion stretch over a large area in this space and the
erties directly depend on the specific techniques employed
criterion is subject to random variation, there is a danger of
for selecting degrees of freedom or penalty parameters it is
moving accidentally towards an amount of smoothing that is
very difficult to make the present approaches comparable in
too small. Increasing the weight of the degrees of freedom in this respect. Therefore we refrain from attempting a more
the criterion reduces this potential danger.1 In contrast, for in-depth evaluation.
GAMBoost there is only a single parameter, the number of In addition to the identification of influential covariates
boosting steps (ignoring the penalty parameter). Therefore also the shape of the fitted functions should be useful. While
finding the minimum is not so problematic and increasing good approximation of the true functions may already be in-
the weight of the degrees of freedom does not result in an dicated by good prediction performance, the “wiggliness” of
improvement. fitted functions strongly affects visual appearance and inter-
While for bfGAM (by default) smoothing splines are pretation. One wants neither a function that is too smooth,
used, the other procedures employ regression splines with not revealing local structure, nor an estimate that is too wig-
penalized estimation and therefore there is some flexibil- gly, with local features that reflect just noise. The former
ity in specifying the penalty structure. For GAMBoost and might be more severe because information gets lost, while
its B-spline basis we use a first order difference penalty of the overinterpretation of noise features in the latter case can
the form (1). For wGAM and wGAMr we use the “shrink- be avoided by additionally considering confidence bands.
age basis” provided by the package, that leads to a constant A measure that was initially employed to control the fitting
function (instead of a linear function) when the penalty goes of smooth functions is integrated squared curvature
towards infinity. Therefore these procedures allow for com- 
plete exclusion of covariates from the final model instead of (fˆj (x))2 dx (5)
having to feature at least a linear term for each covariate.
For wGAMstep no shrinkage basis is used, because covari- (see e.g. Green and Silverman 1994). We will use the inte-
ates can be excluded in the stepwise course of the procedure. grated squared curvature of the single fitted functions eval-
uated over the interval [−2, 2] as a measure to judge their
3.3 Measures for comparison visual quality over a large number of fits.

We will use several measures to evaluate the performance.


For the continuous response examples prediction perfor- 4 Results
mance is evaluated by the mean squared error on new data
4.1 Reference example

1 This reasoning is based on personal communication with Simon Rather than presenting results in general, we first discuss
Wood. a reference example and then evaluate the dependence on
Stat Comput (2008) 18: 87–99 93
Fig. 1 Mean squared error on
new data from the reference
example for all procedures used
in the simulation study for 20
repetitions (only 18 for GAMM)

the underlying structure as outlined in the introduction. For not employ an increased weight for the degrees of freedom,
additional material, results tables including deviances, hit wGAMr, is distinctly worse than that of wGAM.
rates and false positives as well as fitted curvature for all Besides prediction performance, identification of infor-
scenarios see Binder and Tutz (2006). As a reference exam- mative covariates is also an important performance measure.
ple we use a rather simple estimation problem with only a The mean hit rates/false alarm rates for the procedures in
few covariates and a signal-to-noise ratio of medium size this example are shown in Fig. 2. The stepwise procedure
(ce = 0.5). The response is continuous with Gaussian error, for fitting a generalized linear model (GLMse) is distinctly
the shape of covariate influence is potentially non-linear for worse in terms of identification of influential covariates,
three covariates and there is no influence for the remaining probably because it discards covariates with e.g. influence of
three covariates. So there are six covariates in total, which quadratic shape. For comparison, wGAM has the same false
are uncorrelated and observed together with the response alarm rate but a much better hit rate at the same time. The
for n = 100 observations. The mean estimated signal-to- unmodified version of the latter, wGAMr, employs a more
noise ratio is 0.645, which corresponds to a mean R 2 of lenient (implicit) criterion for calling covariates influential,
0.391 for the true model. This is what might be expected reflected by an increased hit and false alarm rate (that is sim-
for non-experimental, observational studies. For this exam- ilar to that of bfGAM). For that specific approach to fitting
ple we did not expect problems when fitting a generalized of generalized additive models this adversely affects predic-
additive model with any of the procedures. Nevertheless for tion performance (as seen from Fig. 1). In contrast, GAM-
two out of the 20 repetitions no fit could be obtained for Boost has an even more lenient (implicit) criterion for co-
GAMM due to numerical problems. variate selection, while having a very competitive prediction
Figure 1 shows boxplots for the mean squared errors for performance. The mixed model approach, GAMM, achieves
all procedures. The procedure that fits a generalized linear a similar prediction performance with an implicit criterion
model, GLMse, despite being not adequate for the under- for variable selection that is even more strict than that for
lying structure, is seen to improve over the baseline model wGAM. So there is the option of choosing a procedure with
(base) which does not use covariate information. Neverthe- the desired behavior while maintaining good prediction per-
less all procedures that fit a generalized additive model can formance.
improve over the linear models. This indicates that there is For judgment of how wiggly or potentially over-smoothed
enough information in the data for the procedures to ben- the fitted functions are, Fig. 3 shows the integrated squared
efit from fitting non-linear functions. In this rather simple curvature (5) (from left to right) for fits to linear functions,
example they all have similar prediction performance, with fquadratic , finvlogit , and fsinus for some of the procedures.
GAMBoost showing a slight advantage in terms of mean The order is according to integrated squared curvature of
performance and variability, and GAMM being close to the the true functions (given together with the function labels
latter. The performance of the variant of wGAM that does in Fig. 3), so one could expect that the fitted squared cur-
94 Stat Comput (2008) 18: 87–99
Fig. 2 Mean hit rates/false
alarm rates from the reference
example for GLMse (2), bfGAM
(3), wGAM (4), wGAMr (4*),
wGAMstep (5), GAMM (6), and
GAMBoost (7)

Fig. 3 Integrated squared


curvature of fits (with extreme
values cut off) to (from left to
right) linear, quadratic, invlogit,
and sinus functions (true
integrated squared curvature
given together with the function
labels)

vature would increase for each procedure from left to right. large and therefore closer to the true curvature. In addition,
This is not the case. Especially for finvlogit often no curva- it is very similar for all kinds of functions, which might be
ture, i.e. a linear function, is fitted. For the linear functions due to the specific penalization scheme used. Another ex-
(leftmost four boxes) all procedures except GAMBoost fit planation might be that fitted curvature increases with the
linear functions in almost all instances. Except for the lin- number of boosting steps where a covariate receives an up-
ear functions, the integrated squared curvature of the fits is date. As more important covariates are targeted in a larger
always smaller than that of the true functions. This might number boosting steps the integrated squared curvature in-
indicate that there is not enough information in the data and creases for these. While this impedes adequate fits for linear
therefore the bias-variance tradeoff implicitly performed by functions, it provides the basis for less bias in fits to non-
the procedures leads to a large bias. Overall, GAMM fits the linear functions.
least amount of curvature of all procedures, i.e. exhibits the Having investigated the behavior of the procedures in this
largest bias. The curvature found for GAMBoost is rather reference example, we now turn to the characteristics of the
Stat Comput (2008) 18: 87–99 95
Table 2 Mean relative squared error (RSE) for examples with contin- indicating better performance for larger signal-to-noise ra-
uous, binary and Poisson response for three levels of signal-to-noise
ratio (stn). For each example the number of repetitions where no fit tios, while for GAMBoost the mean RSE increases. The
could be obtained for a procedure is given in parentheses latter method performs best for small signal-to-noise ra-
tios. For binary and Poisson response examples the mean
stn GLMse bfGAM wGAM GAMM GAMBoost
RSEs increase for all methods as the signal-to-noise ratio in-
Gaussian creases. This different pattern might be explained by the fact
0.282 1.187 1.192 1.181 1.125 (4) 1.112 that most of the procedures use different algorithms for the
0.641 1.308 1.159 1.178 1.158 (2) 1.143 generalized response case compared to continuous response
1.441 1.662 1.153 1.151 1.154 (1) 1.196 examples. For binary and Poisson response examples GAM-
binary Boost outperforms the other procedures, except for the Pois-
0.465 1.285 1.214 1.343 1.161 (9) 1.202 son example with a very large signal-to-noise ratio. GAMM,
0.871 1.538 1.318 1.241 1.219 (14) 1.240 when it works, delivers competitive prediction performance
1.330 1.754 1.327 1.318 1.313 (14) 1.277 for larger signal-to-noise ratios.
Poisson
0.366 1.346 1.265 1.223 1.206 (8) 1.183 4.2.2 Number of non-influential covariates
1.158 1.994 1.462 1.444 1.409 (4) 1.405
5.628 5.694 1.821 1.951 1.773 (4) 2.498
Figure 4 shows the effect of the number of covariates on
mean squared error. All parameters of the data are the same
as in the reference example, only the number of covariates
data highlighted in Sect. 1 and their effect on performance.
is increased, i.e., more and more non-informative covari-
ates are added. So the leftmost block of boxes is a subset
4.2 Dependence on the underlying structure
of Fig. 1. For up to ten covariates all procedures for fitting
generalized additive models show similar performance that
4.2.1 Amount of structure in the data
is well ahead of GLMse. The largest variability in perfor-
mance is found for bfGAM, the least for GAMM and GAM-
In the following examples the amount of information in
Boost, but note that for GAMM for two repetitions with six
the data is varied by scaling the predictor, i.e. using dif-
covariates and for five repetitions with ten covariates no fit
ferent values for the constant ce . In addition, also binary
could be obtained. For p = 20 and p = 50 GAMBoost and
and Poisson responses are considered, because these might
wGAMstep are the only procedures for which still fits can
be viewed as a reduction of information by coarsening
be obtained. While for p = 20 wGAMstep still has reason-
of the response. All other parameters are the same as for
the reference example. Signal-to-noise ratio (3) is used able performance, for p = 50 it performs distinctly worse
as a characteristic that makes examples comparable even compared to GAMBoost.
for different types of response. Table 2 shows the mean For p = 10 covariates the mean hit rates/false alarm
RSE (4) for GLMse, bfGAM, wGAM, GAMM, and GAM- rates are as follows: bfGAM: 0.97/0.18; wGAM: 0.95/0.16;
Boost for three levels of predictor scaling for continuous wGAMstep: 0.93/0.14; GAMM: 0.82/0.06; GAMBoost:
(ce ∈ {0.33, 0.5, 0.75}), binary (ce ∈ {1, 1.5, 2}), and Pois- 0.98/0.24. So the most severe drop in performance with re-
son (ce ∈ {0.33, 0.5, 0.75}) response examples. Note that es- spect to identification of influential covariates compared to
pecially for binary response examples frequently no fit could the reference example is seen for GAMM (for which in ad-
be obtained for GAMM and so the mean RSEs for this pro- dition for five repetitions no fit could be obtained). For all
cedure may be misleading. other procedure the performance basically stays the same.
If binary and Poisson responses were essentially a coars- For p = 20 the mean hit rates/false alarm rates are 0.95/0.14
ening, i.e., discarding of information, the RSEs should be for wGAMstep and 0.97/0.19 for GAMBoost, and for p = 50
similar for different response types when the signal-to-noise they are 0.92/0.18 and 0.97/0.17 respectively. It is surprising
ratio is similar. This is seen not to be the case. The mean that while prediction performance drastically decreases for
RSEs for binary and Poisson response examples are larger wGAMstep as the number of covariates increases, there is
than that for examples with continuous response. Also, for only a slight worsening of the hit rates and the false alarm
the change of prediction performance following a change in rates. For GAMBoost even for p = 50 there is hardly any
signal-to-noise ratio there is a different pattern for different change compared to p = 10. So it is seen to perform very
response types. For continuous response examples there is well in terms of identification of influential covariates as
only little change in mean RSE as the signal-to-noise ratio well as in terms of prediction performance for a large num-
increases. In tendency it decreases for bfGAM and wGAM, ber of covariates.
96 Stat Comput (2008) 18: 87–99
Fig. 4 Effect of the total
number of covariates p on mean
squared error

Fig. 5 Effect of the number of


covariates over which the
information in the data is
distributed (“info”), given a
similar signal-to-noise ratio, on
mean squared error

4.2.3 Number of influential covariates/distribution of right two blocks there are ten covariates in total and a larger
information signal-to-noise ratio is used. Again the left of the two blocks
has three and right one six informative covariates and the
The effect of changing the number of covariates over which signal-to-noise ratio is fixed at a similar level for both (with
the information in the data is distributed is illustrated in ce = 0.5 and ce = 0.75). GAMBoost, compared to the other
Fig. 5. The leftmost block of boxes shows the mean squared procedures, seems to be the least affected by switching from
error from the reference example. In the block next to it sparse scenarios to non-sparse scenarios. This may be ex-
there are six instead of three informative covariates, but the plained by the algorithm, which distributes small updates
predictor is scaled (ce = 0.33), such that the signal-to-noise over a number of covariates, where it does not matter much
ratio is approximately equal to that in the reference exam- whether a smaller number of covariates or a larger number of
ple. Note that now all covariates are informative. In the covariates receives the updates. While the other procedures
Stat Comput (2008) 18: 87–99 97
Fig. 6 Relative squared error
for examples with non-linear
(first and third block from the
left) vs. linear (2nd and 4th
block) true covariate influence
shape with similar
signal-to-noise ratio

perform similarly to GAMBoost or even better when the in- for fitting linear models, e.g. the Lasso (Tibshirani 1996),
formation in the data is distributed over a small number of but consideration of these would lead too far from here. So,
covariates, i.e. there is enough information per covariate to at least when compared to standard procedures for fitting a
accurately estimate the single functions, their performance (generalized) linear model, using GAMBoost does not seem
gets worse when functions for a larger number of covariates to result in a loss of prediction performance in difficult data
have to be estimated. situations.

4.2.4 Amount of non-linearity in the data


5 Summary and concluding remarks
Figure 6 compares prediction performance (by RSE) for ex-
There are several properties by which a data situation, in
amples with non-linear underlying structure (first and third
which a generalized additive model may be fitted, can be
block of boxes) to that with linear underlying structure (sec-
characterized. The present paper focused on the signal-to-
ond and fourth block of boxes) with similar signal-to-noise noise ratio, the type of response, the number of (uninforma-
ratio. The data in the leftmost block are from the reference tive) covariates, the number of covariates over which infor-
example. In the corresponding example with linear structure mation with respect to the response is distributed, and on
(second block from the left) all procedures have improved whether there is actually non-linear structure in the data.
performance, indicating that fitting linear functions is easier Naturally, only a limited number of examples could be con-
even for procedures that can fit generalized additive mod- sidered and we focused on problems where generalized ad-
els. There is only a slight performance advantage from us- ditive models are difficult to fit, but still information can
ing the adequate procedure, GLMse, given that it is known be extracted. Therefore our conclusions are mainly valid for
that the true structure is linear (which will rarely be the case such difficult settings.
in applications). The results in the right two blocks of Fig. Several techniques for fitting generalized additive mod-
6 are from binary response examples with ten covariates, els were reviewed and we evaluated how their perfor-
of which six are informative, and a rather small signal-to- mance changes when the characteristics of the data situa-
noise ratio (ce = 1). When the true structure is non-linear tion changed. We had to restrict to a limited number of ap-
(third block from the left) only GAMBoost can improve over proaches with a specific combination of the building blocks
GLMse (GAMM is not shown because for most repetitions available for fitting of smooth functions, for selection of
no fit could be obtained). The good performance of GAM- smoothness, and for selection of covariates. We specifi-
Boost seems to transfer to the example with linear under- cally focused on typical implementations found in statistical
lying structure (rightmost block of boxes). In this example packages, such as the R environment used here. Therefore
it is very competitive with GLMse. It might be argued that we hope that the results are useful for guiding everyday data
for such difficult situations there are optimized procedures analysis.
98 Stat Comput (2008) 18: 87–99
Table 3 Summary of the results from the simulation study

bfGAM wGAM GAMM GAMBoost

Properties with respect to . . .


Covariate inclusion
parsimony Fair Good Excellent Low
hit rate Good Fair Problematic Excellent
Appropriate smoothness Good; large Fair Problematic Good; sometimes
variability undersmoothing
Number of covariates Small Medium Small Large
Prediction performance Good Good Very good Very good

Effects of . . .
Signal-to-noise ratio (stn) Best for larger signal-to-noise ratio Best for
small stn
Response type Problematic Outperforms
for binary for binary
Large number of covariates Large wGAMstep Numerical Stable
variability required problems
Information distribution Best when concentrated on few covariates Least affected
Linear true structure Reasonable results for continuous response reasonable also
for binary

Table 3 summarizes the results from the simulation proach (bfGAM) and the smoothing parameter optimization
study. Criteria, on which this overall characterization of approach (wGAM) that it performs well also for a smaller
approaches is based, are prediction performance, identifi- signal-to-noise ratio. When prediction performance is the
cation of (influential) covariates (via hit rates/false alarm main objective, and the signal-to-noise ratio is expected to
rates), and smoothness of the fitted functions (via integrated be small, GAMBoost is also a very good option, but results in
squared curvature). It is seen that none of the procedures complex fits. This shows again that the single properties of
performs best in all situations and with respect to all crite- the approaches are somewhat detached, as good prediction
ria. performance can be obtained either using very parsimonious
We found that the properties of the approaches are some- fits (via GAMM) or using very complex fits (via GAMBoost).
what detached from each other and this allows for separate GAMBoost also seems to be well suited for binary response
consideration in a model fitting context at hand. For exam- data, which is the type of application for which boosting ap-
ple, with respect to smoothness of fits and identification of proaches initially have been developed.
influential covariates, the procedures have different proper- When the signal-to-noise ratio is expected to be larger
ties, while at the same time they have very similar prediction (and there is a relatively small number of covariates),
performance in less difficult settings such as for the refer- bfGAM and wGAM will be competitive with GAMM in
ence example. In such a situation approaches can be chosen terms of prediction performance, while at the same time
according to the specific objectives of data analysis, i.e. the resulting in a larger level of complexity for the fitted mod-
model fitting context has to be considered. els, potentially corresponding to less oversmoothing. The
When there is only a relatively small number of covari- fits obtained by bfGAM seem to be somewhat more com-
ates and then very smooth (often linear) fits and a very strict plex compared to the fits from wGAM, and the former corre-
criterion for calling covariates influential are wanted, then spondingly exhibits larger variability when there is a larger
the mixed model approach (GAMM) is preferable. The par- number of covariates.
simony of the fitted models comes at the cost of identify- The extension of wGAM towards stepwise variable se-
ing a smaller proportion of influential variables. This was lection (wGAMstep) allows for obtaining reasonable fits up
seen to be especially problematic with a larger number of to a medium number of covariates, but when the number of
covariates. One further disadvantage of GAMM is that it covariates is relatively large compared to the number of ob-
may not always be feasible due to numerical problems (es- servations, GAMBoost seems to be the only option. When
pecially for binary response data). With respect to prediction there is a larger number of covariates there is also the dan-
performance it has the advantage over the backfitting ap- ger that information is distributed over a large number of
Stat Comput (2008) 18: 87–99 99

covariates. In such situations it is generally very difficult to Friedman, J.H., Stuetzle, W.: Projection pursuit regression. J. Am. Stat.
build a reasonable model from the data, but GAMBoost still Assoc. 76, 817–823 (1981)
Friedman, J., Hastie, T., Rosset, S., Tibshirani, R., Zhu, J.: Statistical
seems to succeed. When information is concentrated on a behavior and consistency of classification methods based on con-
small number of covariates, all approaches perform reason- vex risk minimization: discussion of the paper by T. Zhang. Ann.
ably well, and again their other properties can be taken into Stat. 32(1), 102–107 (2004)
consideration for choosing which one to use. Green, P.J., Silverman, B.W.: Nonparametric Regression and General-
ized Linear Models. Chapman & Hall, London (1994)
Finally, even when the true underlying structure is hardly Hand, D.J.: Classifier technology and the illusion of progress. Stat. Sci.
non-linear in the continuous response examples all ap- 21(1), 1–14 (2006)
proaches at least perform similarly to the linear model fits. Hastie, T., Tibshirani, R.: Generalized additive models. Stat. Sci. 1,
So one can safely use such methods for fitting generalized 295–318 (1986)
Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Chapman
additive models when the nature of the underlying structure
& Hall, London (1990)
is unclear. For the, more difficult, binary response examples Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for
only GAMBoost was still seen to be competitive compared nonorthogonal problems. Technometrics 12(1), 55–67 (1970)
to the (generalized) linear model fits, when the true structure Hurvich, C.M., Simonoff, J.S., Tsai, C.: Smoothing parameter selec-
tion in nonparametric regression using an improved Akaike infor-
was linear. mation criterion. J. R. Stat. Soc. B 60(2), 271–293 (1998)
As it is the case for any simulation study that covers such Kim, Y.-J., Gu, C.: Smoothing spline Gaussian regression: more scal-
a wide range of situations, the guidelines presented here can able computation via efficient approximation. J. R. Stat. Soc. B
only be very rough. For more substantiated recommenda- 66(2), 337–356 (2004)
Lee, T.C.M.: Smoothing parameter selection for smoothing splines: a
tions a separate, in-depth simulation study for every aspect simulation study. Comput. Stat. Data Anal. 42, 139–148 (2003)
would be required. In addition, when a generalized addi- Lindstrom, M.J.: Penlized estimation of free-knot splines. J. Comput.
tive model is to be built from one specific data set, much Graph. Stat. 8(2), 333–352 (1999)
more time will be devoted to careful data analysis. For in- Linton, O.B., Härdle, W.: Estimation of additive regression models
with known links. Biometrika 83, 529–540 (1996)
stance, results such as non-convergence of the mixed model Marx, B.D., Eilers, P.H.C.: Direct generalized additive modelling
approach would be investigated more closely, e.g. identi- with penalized likelihood. Comput. Stat. Data Anal. 28, 193–209
fying covariates which cause non-convergence when added (1998)
to the model and thereby gaining insight into the modeling McCullagh, P., Nelder, J.A.: Generalized Linear Models, 2nd edn.
Chapman & Hall, London (1989)
problem at hand. Nevertheless, the results from the present R Development Core Team: R: A Language and Environment for Sta-
simulation study might be very useful in guiding the data tistical Computing. R Foundation for Statistical Computing, Vi-
analysis process. enna (2006). ISBN 3-900051-07-0
Ruppert, D.: Selecting the number of knots for penalized splines. J.
Comput. Graph. Stat. 11, 735–757 (2002)
Acknowledgements We gratefully acknowledge support from
Ruppert, D., Wand, M.P., Carroll, R.J.: Semiparametric Regression.
Deutsche Forschungsgemeinschaft (Project C4, SFB 386 Statistical
Cambridge University Press, Cambridge (2003)
Analysis of Discrete Structures). We also want to thank the Associate
Speed, T.: Comment on “That BLUP is a good thing: the estimation of
Editor and the two anonymous reviewers for their valuable remarks
random effects” by G.K. Robinson. Stat. Sci. 6(1), 42–44 (1991)
which helped to considerably improve the paper.
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R.
Stat. Soc. B 58(1), 267–288 (1996)
Tutz, G., Binder, H.: Generalized additive modelling with implicit vari-
able selection by likelihood based boosting. Biometrics 62, 961–
References 971 (2006)
Wand, M.P.: A comparison of regression spline smoothing procedures.
Binder, H., Tutz, G.: Fitting generalized additive models: a comparison Comput. Stat. 15, 443–462 (2000)
of methods. FDM-Preprint 93, University of Freiburg (2006) Wang, Y.: Mixed effects smoothing spline analysis of variance. J. R.
Breiman, L.: Prediction games and arcing algorithms. Neural Comput. Stat. Soc. B 60(1), 159–174 (1998)
11, 1493–1517 (1999) Wood, S.N.: Modelling and smoothing parameter estimation with mul-
Bühlmann, P., Yu, B.: Boosting with the L2 loss: regression and clas- tiple quadratic penalties. J. R. Stat. Soc. B 62(2), 413–428 (2000)
sification. J. Am. Stat. Assoc. 98, 324–339 (2003) Wood, S.N.: Stable and efficient multiple smoothing parameter estima-
Chambers, J.M., Hastie, T.J.: Statistical Models in S. Wadsworth, Pa- tion for generalized additive models. J. Am. Stat. Assoc. 99(467),
cific Grove (1992) 673–686 (2004)
Friedman, J.H.: Greedy function approximation: a gradient boosting Wood, S.N.: Generalized Additive Models. An Introduction with R.
machine. Ann. Stat. 29, 1189–1232 (2001) Chapman & Hall/CRC, Boca Raton (2006)

You might also like