Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Smoothing techniques for visualisation

Adrian W. Bowman

Department of Statistics
The University of Glasgow
Glasgow G12 8QQ, U.K.
adrian@stats.gla.ac.uk

1 Introduction
Graphical displays are often constructed to place principal focus on the indi-
vidual observations in a dataset and this is particularly helpful in identifying
both the typical positions of datapoints and unusual or influential cases. How-
ever, in many investigations principal interest lies in identifying the nature of
underlying trends and relationships between variables and so it is often help-
ful to enhance graphical displays in ways which give deeper insight into these
features. This can be very beneficial both for small datasets, where variation
can obscure underlying patterns, and large datasets, where the volume of data
is so large that effective representation inevitably involves suitable summaries.
These issues are particularly prominent in a regression setting, where it is
the nature of the relationships between explanatory variables and the mean
value of a response which is the focus of attention. Nonparametric smoothing
techniques are extremely useful in this context as they provide an estimate of
the underlying relationship without placing restrictions on the shape of the
regression function, apart from an assumption of smoothness.
This is illustrated in Figure 1 where the left hand panel displays a scatter-
plot of data collected by the Scottish Environment Protection Agency on the
level of dissolved oxygen close to the start of the Clyde estuary. Data from a
substantial section of the River Clyde are analysed in detail by McMullan et
al. (2005), who give the background details. Water samples have been taken
at irregular intervals over a long period. The top left panel plots the data
against time in years. The large amount of variation in the plot against year
makes it difficult to identify whether any underlying trend is present. The
top right hand panel adds a smooth curve to the plot, estimating the mean
value of the response as a function of year. Some indication of improvement in
DO emerges, with the additional suggestion that this improvement is largely
restricted to the earlier years. The smooth curve therefore provides a signif-
icant enhancement of the display by drawing attention to features of some
potential importance which are not immediately obvious from a plot of the
2 Adrian W. Bowman

raw data. However, these features required further investigation to separate


real evidence of change from the effects of sampling variation.

10

10
9

9
8

8
DO

DO
7

7
6

6
5

5
4

4
1975 1980 1985 1990 1995 1975 1980 1985 1990 1995

Year Year
10

10
9

9
8

8
DO

DO
7

7
6

6
5

5
4

0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350

Day of the year Day of the year

Fig. 1. The left panels shows data on dissolved oxygen (DO) in the Clyde estuary,
plotted against year and day within the year. The right hand panels add smooth
curves as estimates of the underlying regression functions.

In exploring the effect of an individual variable it is also necessary to


consider the simultaneous effects of others. The lower left panel shows the
data plotted against day of the year. Water samples are not taken every day
but, when the samples are plotted by day of the year across the entire time
period, a very clear relationship is evident. This seasonal effect is a periodic
one and so this should be reflected in an appropriate estimate. The smooth
curve added to the lower right panel has this periodic property. It also suggests
that a simple trigonometric shape may well be adequate in describing the
seasonal effect. Once a suitable model for this variable has been constructed,
Smoothing techniques for visualisation 3

it will be advisable to re-examine the relationship of DO and year, adjusted


for the seasonal effect.
The aim of this chapter is to discuss the potential benefits of enhancing
graphical displays in this manner and to illustrate the insights which this can
bring to a variety of types of regression data. From this perspective, graph-
ics are oriented towards exploration of appropriate models for data, as well
as towards the display of the observed data themselves. In Section 2, simple
methods of constructing smooth estimates are described and illustrated. The
ideas are developed in the context of response data on a continuous measure-
ment scale but the more general applicability of the concept is indicated by
an extension to binary response data. The graphical methods employed are
also extended beyond simple displays of the underlying regression estimate to
include indications of variability and of the suitability of simple parametric
models. These concepts are extended further in Section 3 where displays of
nonparametric regression surfaces relating a response variable to two explana-
tory variables are discussed. The addition of information on variability and
the suitability of parametic models are revisited in this setting. Situations
involving several covariates are discussed in Section 4, where additive models
are used to provide descriptions of each separate regression component. Some
final discussion is given in Section 5.

2 Smoothing in one dimension


There are many ways in which a nonparametric regression curve can be
constructed. These include orthogonal basis functions, a wide variety of ap-
proaches based in splines and, more recently, methods based on wavelets.
While there are important differences between these approaches from a tech-
nical perspective, the particular choice of technique for the construction of a
nonparametric regression curve is less important in a graphical setting. The
principal issue is how an estimate can be used to best effect, rather than on
the details of its construction.
For convenience, this chapter will make use of local linear methods of non-
parametric regression. These have the advantages of being simple to explain
and easy to implement, as well as having theoretical properties which are
amenable to relatively straightforward analysis. A further advantage lies in
the link with the chapter on smoothing by Loader (2004) in an earlier Com-
putational Statistics Handbook where many of the technical details can be
found. The basic ideas of the method are described below but the emphasis
thereafter is on the use of the technique to enhance graphical displays.
With regression data of the form {(xi , yi ) : i = 1, . . . , n}, where y denotes
a response variable and x a covariate, a general prescription of a model is
provided by
yi = m(xi ) + εi ,
4 Adrian W. Bowman

where m denotes a regression function and the εi denote independent errors.


A parametric form for m can easily be fitted by the method of least squares. A
nonparametric estimate of m can be constructed simply by fitting a parametric
model locally. For example, an estimate of m at the covariate value x arises
from minimising the weighted least squares
n
X
{yi − α − β(xi − x)}2 w(xi − x; h) (1)
i=1

over α and β. The estimate m̂(x) is the fitted value of the regression at x,
namely α̂. By choosing the weight function w(xi − x; h) to be decreasing in
|xi −x|, the linear regression is fitted locally as substantial weight is placed only
on those observations near x. Unless otherwise noted, this chapter will adopt
a weight function w which is a normal density centred on 0 with standard
deviation h. The parameter h controls the width of the weight function and
therefore the extent of its local influence. This in turn dictates the degree of
smoothness of the estimate. For this reason h is usually referred to as the
smoothing parameter or bandwidth.
Computationally, the solution of the weighted least squares (1) is straight-
forward, leading to an estimate of the form m̂(x) = v ⊤ y, where the vector v
is a simple function of x, the covariate values xi and the weights w(xi − x; h).
Specifically, the ith element of v is
1 {s2 (x; h) − s1 (x; h)(xi − x)} w(xi − x; h)
vi = ,
n s2 (x; h)s0 (x; h) − s1 (x; h)2
where sr (x; h) = { (xi − x)r w(xi − x; h)}/n. The estimate can therefore
P
be computed at a set of covariate values through the expression Sy, where
S denotes a smoothing matrix whose rows contain the vectors v required to
construct the estimate at the points of interest.
This representation emphasises the important fact that the estimation
process is linear in the response data y. It also suggests useful analogies with
standard linear modelling techniques. In particular, the degrees of freedom
associated with a linear model can be identified as the trace of the projection
matrix P which creates the fitted values as ŷ = P y. It is therefore convenient
to define the approximate degrees of freedom associated with a nonparamet-
ric estimate as ν = tr{S}, where S is the smoothing matrix which creates
the fitted values at the observed covariate values {xi ; i = 1, . . . , n}. As the
smoothing parameter h is increased, the influence of the weight function ex-
tends across a greater range of the covariate axis and the flexibility of the
estimate is reduced. This corresponds to a reduction in the approximate de-
grees of freedom associated with the estimate.
The approximate degrees of freedom therefore provide a helpful alternative
scale on which degree of smoothness can be expressed. The estimate for year
shown in Figure 1 was produced with a smoothing parameter corresponding
to 4 degrees of freedom, namely h = 3.16. This allows a moderate degree
Smoothing techniques for visualisation 5

of flexibility in the curve beyond the 2 degrees of freedom associated with a


simple linear shape.
The choice of smoothing parameter h, or equivalently of the approximate
degrees of freedom ν, is therefore of some importance. From a graphical and
exploratory perspective it is helpful to plot the estimates over a wide range of
smoothing parameters, to view the effects of applying different degrees of local
fitting. This is particularly effective in the form of an interactive animation.
However, it is also worthwhile to consider ways of automatically identifying
suitable choices of smoothing parameter.
Some very effective methods of doing this have been developed for particu-
lar types of regression problem but other proposals have the advantage of very
wide applicability. One of the most popular of these has been cross-validation,
where h is chosen to minimise
n
X
{yi − m̂i (xi )}2 .
i=1

The subscript on m̂−i indicates that the estimate is constructed from the
dataset with the ith observation omitted and so the criterion being minimised
represents the prediction error of the estimate. There is some evidence that
this approach produces substantial variation in its selected smoothing param-
eters. This chapter will therefore use an alternative criterion, proposed by
Hurvich et al. (1998), based on Akaike’s information criterion (AIC). This
chooses h to minimise
RSS 2(ν + 1)
+1+ , (2)
n (n − ν − 2)
Pn
where RSS denotes the residual sum-of-squares i=1 {yi − m̂(xi )}2 and, as
described above, ν = tr{S}. In general, this method offers a very useful, and
usually very effective, means of selecting an appropriate degree of smoothness.
However, any method of automatic selection must be used carefully.
In the initial plots of the DO data in Figure 1 it was noted that the day
effect is a periodic one. This can easily be accommodated in the construction
of a smooth estimate by employing a periodic weight function. Since a linear
model is not appropriate for periodic data, a locally weighted mean offers a
simple solution. A smooth estimate is then available as the value of α which
minimises the weighted least squares
n  
X
2 1
{yi − α} exp cos(2π(xi − x)/366) .
i=1
h

This uses an unscaled von Mises density as a weight function, with period 366
days to allow for leap years. In order to allow the estimate to express shapes
beyond a standard trigonometric pattern, the degrees of freedom were set to
the slightly higher value of 6 in constructing the estimate of the seasonal effect
in Figure 1.
6 Adrian W. Bowman

Nonparametric curve estimates are very useful as a means of highlighting


the potential shapes of underlying regression relationships. However, like any
estimate based on limited information, they are subject to the effects of vari-
ability. Indeed, the flexibility which is the very motivation for a nonparametric
approach will also increase the sensitivity of the estimate to sampling varia-
tion in the data. It is therefore important not only to examine curve estimates
but also to examine their associated variability.
The linear representation of the estimate as m̂(x) = v ⊤ y, where v is a
known vector as discussed
Pn above, means that its variance is readily available
as var{m̂(x)} = ( i=1 vi2 ) σ 2 , where σ 2 is the common variance of the errors
εi . The calculation of standard errors then requires an estimate of σ 2 . Pursuing
the analogy with linear models mentioned above leads to proposals such as
σ̂ 2 = RSS/df , where df is an appropriate value for the degrees of freedom for
error. Other approaches are based on local differencing. A particularly effective
proposal of Gasser et al. (1986) is based on the deviations of each observation
from the linear interpolation between its neighbours, as ε̃i = yi − ai yi−1 −
(1 − ai )yi+1 , where ai = (xi+1 − xi )/(xi+1 − xi−1 ), under the assumption that
the data have been ordered by increasing x-value. This leads to the estimate
n−1
1 X ε̃2i
σ̂ 2 = .
n − 2 i=2 1 + a2i + (1 − ai )2
p Pn
The standard error of m̂(x) is then available as ( i=1 vi2 ) σ̂.
The left hand plot of Figure 2 shows the estimate of the seasonal effect
in the Clyde data with curves indicating a distance of two standard errors
from the estimated regression function. Some care has to be exercised in the
interpretation of this band. The smoothing inherent in the construction of
a nonparametric regression curve inevitably leads to bias, as discussed by
Loader (2004). The band cannot therefore be given a strict interpretation in
terms of confidence. However, if does give a good indication of the degree of
variability in the estimate and so it is usually referred to as a variability band.
A natural model for the seasonal effect is a shifted and scaled cosine curve,
of the form  
(xi − θ)
yi = α + β cos 2π + εi ,
366
where the smaller effect across years, if present at all, is ignored at the moment.
McMullan et al. (2003) describe this approach. A simple expansion of the
cosine term allows this model to be written in simple linear form, which can
then be fitted to be observed data very easily.
However, some thought is required in comparing a parametric form with
a nonparametric estimate. As noted above, bias is an inevitable consequence
of nonparametric smoothing. We should therefore compare our nonparamet-
ric estimate with what we expect to see when a nonparametric estimate is
constructed from data which were generated by the cosine model. This can
easily be done by considering E{m̂(x)}, where the expectation is calculated
Smoothing techniques for visualisation 7

10

3
9

2
residuals(model)
8

1
DO

0
6

−1
5

−2
4

0 50 100 150 200 250 300 350 1975 1980 1985 1990 1995

Doy Year

Fig. 2. The left panel shows a smooth curve as an estimate of the underlying
regression function for the seasonal effect in the Clyde data, with variability bands
to indicate the precision of estimation. The thick line denotes a smoothed version
of a shifted and scaled cosine model. The right panel shows an estimate of the year
effect after adjustment for the seasonal effect. A reference band has been added to
indicate where a smooth curve is likely to lie if the underlying relationship is linear.

under the cosine model. The simple fact that E{Sy} = SE{y} suggests that
we should compare the nonparametric estimate Sy with a smoothed version
of the vector of fitted values ŷ from the cosine model, namely S ŷ. This curve
has been added to the left hand plot of Figure 2 and it agrees very closely
with the nonparametric estimate. The cosine model can therefore be adopted
as a good description of the seasonal effect.
Since the seasonal effect in the Clyde data is so strong, it is advisable to
re-examine the year effect after adjustment for this. Nonparametric models
involving more than one covariate will be discussed later in the chapter. For
the moment, a simple expedient is to plot the residuals from the cosine model
against year, as shown in the right hand panel of Figure 2. The reduction in
variation over the marginal plot of DO against year is immediately apparent.
It is now also natural to consider whether a simple linear model might be
adequate to describe the underlying relationship, with the curvature exhibited
by the nonparametric estimate attributable to sampling variation. Variability
bands provide one way of approaching this. However, a more direct way of
assessing the evidence is through a reference band which indicates where a
nonparametric estimate is likely to lie if the underlying regression is indeed
linear. Since bias in the nonparametric estimate depends on the curvature of
the underlying regression function, it follows that a nonparametric estimate,
fitted by the local linear method, is unbiased in the special case of data from a
linear
Pn model. If the fitted linear model at the covariate value x is represented
as i=1 li yi , then it is straightforward to see that the variance
Pn of the difference
between the linear and nonparametric models is simply i=1 (vi − li )2 σ 2 . On
8 Adrian W. Bowman

substituting an estimate of σ 2 , a reference band extending for a distance of


two standard errors above and below the fitted linear model can then easily
be constructed.
This is illustrated in the right hand panel of Figure 2, where the evidence
against a simple linear model is confirmed in this graphical manner. This
addition to the plot has therefore identified an important feature which is not
easily spotted from plots of the raw data.
A formal, global test can also be carried out, as described by Bowman &
Azzalini (1997), but the discussion here will be restricted to graphical aspects.
It should also be noted that the calculations for the reference band have been
adjusted to account for the correlations in the residuals from the fitted cosine
model. However, this effect is a very small one.
The idea of local fitting of a relevant parametric model is a very pow-
erful one which can be extended to a wide variety of settings and types of
data. For example, nonparametric versions of generalised linear models can
be constructed simply by adding suitable weights to the relevant log-likelihood
function. Under an assumption of independent observations,
Pn the log-likelihood
for a generalised linear model can be represented as i=1 l(α, β). A local like-
lihood for nonparametric estimation at the covariate value x can then be
constructed as
Xn
l(α, β)w(xi − x; h)
i=1

and the fitted value of the model at x extracted to provide the nonparametric
estimate m̂(x).
An example is provided by the data displayed in the top left panel of
Figure 3 which indicate whether the dissolved oxygen in water samples, taken
from two well-separated monitoring points on the river, was below (1) or above
(0) a threshold of 5mg/l, which is the level required for healthy resident and
migratory fish populations. In order to standardise for the year and seasonal
effects which were noted in the earlier analysis, the measurements considered
here are restricted to the years from 1982 onwards and to the summer months
of May and June. The level of dissolved oxygen is likely to be related to
temperature, which has therefore been used as a covariate. It is particularly
difficult to assess the nature of any underlying relationship when the response
variable is binary, even when some random variation has been added to the
response values to allow the density of points to be identified more clearly.
A natural model in this setting is a simple logistic regression where the
probability p(x) of observing a measurement below the threshold is assumed
to be described by exp(α+βx)/(1+exp(α+βx)), where α and β are unknown
parameters. The log-likelihood contribution for an observation (xi , yi ) can be
written as yi log{p(xi )} + (1 − yi ) log{1 − p(xi )}. Maximising the weighted
log-likelihood, as described above and using a smoothing parameter h = 2,
produces the nonparametric estimate shown in the top right panel of Figure 3.
The technical details of this process are described by Fan et al. (1995).
Smoothing techniques for visualisation 9

1.0

1.0
0.8

0.8
0.6

0.6
DOW < 5

DOW < 5
0.4

0.4
0.2

0.2
0.0

0.0
10 12 14 16 18 20 10 12 14 16 18 20

Temperature Temperature

1.0
1.0

0.8
0.8

Pr{as.numeric(y[ind][ind1])}

0.6
0.6
DOW < 5

0.4

0.4
0.2

0.2
0.0

0.0

10 12 14 16 18 20 10 12 14 16 18 20

Temperature x[ind][ind1]

Fig. 3. The top left panel shows data on the occurrence of very low (< 5%) levels of
dissolved oxygen in the River Clyde, related to the temperature of the water. The top
right panel shows a smooth nonparametric regression curve to display the pattern of
change in the probability of very low dissolved oxygen as a function of temperature.
The bottom left panel displays standard error bands to indicate the variability in the
nonparametric estimate. The bottom right panel shows nonparametric regression
curves simulated from the fitted logistic model, together with the nonparametric
curve from the observed data.

The generally increasing nature of the relationship between low measure-


ments and temperature is apparent. However, the downturn in the estimate of
the probability for the highest temperature is a surprise. Again, it is helpful
to add information on variability to the plot. The bottom left hand panel of
Figure 3 shows variability bands for the estimate, constructed by carrying the
weights through the usual process of deriving standard errors in generalised
linear models. These indicate a high degree of variability at high temperatures.
The suitability of a linear logistic model can be assessed more directly by
constructing a reference band. A simple way of doing this here is to simu-
10 Adrian W. Bowman

late data from the fitted linear logistic model and construct a nonparametric
estimate from each set of simulated data. The results of repeating this 50
times is displayed in the bottom right panel of Figure 3. The appearance of
some other estimates with curvature similar to that exhibited in the original
estimate offers reassurance that the data are indeed consistent with the linear
logistic model and prevents an inappropriate interpretation of a feature in
the nonparametric estimate which can reasonably be attributed to sampling
variation.

3 Smoothing in two dimensions


The implementation of smoothing techniques with two covariates is a particu-
larly important application because of the wide variety of types of data where
it is helpful to explore the combined effects of two variables. Spatial data
provide an immediate example, where the local characteristics of particular
regions lead to measurement patterns which are often not well described by
simple parametric shapes. As in the univariate case, a variety of different ap-
proaches to the construction of a smooth estimate of an underlying regression
function is available. In particular, the extension of the local linear approach
is very straightforward. From a set of data {(x1i , x2i , yi ) : i = 1, . . . , n}, where
y denotes a response variable and x1 , x2 are covariates, an estimate of m at
the covariate value x arises from minimising the weighted least squares
n
X
{yi − α − β1 (x1i − x1 ) − β2 (x2i − x2 )}2 w(x1i − x1 ; h1 )w(x2i − x2 ; h2 )
i=1

over α, β1 and β2 . The estimate m̂(x) is the fitted value of the regression
at x, namely α̂. More complex forms of weighting are possible and Härdle et
al. (2004) give a more general formulation. However, the product form shown
above is particularly attractive in the simplicity of its construction.
From the form of this weighted sum-of-squares, it is immediately obvious
that the estimator m̂(x) again has linear form v ⊤ y, for a vector of known
constants v. The concept of approximate degrees of freedom then transfers
immediately, along with automatic methods of smoothing parameter selection
such as the AIC method described in (2). Estimation of the underlying error
variance needs more specific thought, although the same principles of local
differencing apply, as described by Munk et al. (2005). For the particular case
of two covariates a method based on a very small degree of local smoothing
is also available, as described by Bock et al. (2005), and this is used in the
illustrations of this chapter.
Figure 4 displays a smooth estimate of a regression surface derived from
data on a catch score, representing the abundance of marine life on the sea
bed at various sampling points in a region near the Great Barrier Reef, as
Smoothing techniques for visualisation 11

−11.3
−11.4
−11.5
Latitude

−11.6
−11.7
−11.8

143.0 143.2 143.4 143.6 143.8

Longitude

Fig. 4. A nonparametric estimate of a regression surface relating the mean level of


a catch score to latitude and longitude, using data from the Great Barrier Reef.

a function of latitude and longitude. Poiner et al. (1997) describe the back-
ground to these data and Bowman & Azzalini (1997) illustrate the application
of smoothing techniques on various subsets. Here, the data for two successive
years are examined to investigate the relationship between the catch score and
the covariates latitude and longitude.
Several types of graphical display are available for the resulting estimate.
Figure 4 uses both colour shading and contour levels to indicate the height of
the estimated regression surface. The simultaneous use of both is helpful in
enhancing the interpretation in a form familiar to users of geographical maps.
However, three-dimensional projections are also easy to construct with many
software packages and the ability to render surfaces to a high degree of visual
quality is now commonly available. Display is further enhanced by animated
rotation, providing a very realistic perception of a real three-dimensional ob-
ject. Figure 5 displays this kind of representation in static form.
Figures 4 and 5 were produced with the smoothing parameters (h1 , h2 ) =
(0.18, 0.09) selected by AIC and equivalent to 12 degrees of freedom. This
choice was based on a single underlying parameter h which was then scaled
by the sample standard deviations, s1 , s2 , of the covariates to provide a pair
of smoothing parameters as (hs1 , hs2 ). A common smoothing parameter for
each dimension, or unrestricted choice of h1 and h2 , could also be allowed.
12 Adrian W. Bowman

2
Cath score

−1
−11.3
143.0 −11.4
143.2 −11.5
Lo 143.4 −11.6itude
ng t
itu 143.6 −11.7 La
de
143.8 −11.8

Fig. 5. A smooth surface representing an estimate of the regression function of


catch score on latitude and longitude simultaneously.

A clear drop in mean catch score as longitude increases is indicated. How-


ever, more detailed insight is available from the estimated surface, with clear
indication of a non-linear pattern with increasing longitude. In fact, due to the
orientation of the coast in this region, longitude broadly corresponds to dis-
tance offshore, leading to a natural biological interpretation, with relatively
constant levels of marine life abundance near the coast followed by rapid
decline in deeper water. The smooth surface therefore provides a significant
enhancement of the display by drawing attention to features of some potential
importance which are not immediately obvious from plots of the raw data.
It is natural to revisit in the setting of two covariates the discussion of the
benefits of adding further information, particularly on variability and reference
models, to displays of nonparametric estimates. The graphical issues are now
rather different, given the very different ways in which the estimate itself must
be represented. One possibility is to mix the different types of representation
(colour, contours, three-dimensional surfaces) with an estimate represented in
one way and information on variability in another. For example, the colour and
contour displays in Figure 4 might be used for these two different purposes,
although this particular combination can be rather difficult to interpret.
Smoothing techniques for visualisation 13

2
Catch score

−1
−11.3
143.0 −11.4
143.2 −11.5
Lo 143.4 −11.6itude
ng t
itu 143.6 −11.7 La
de
143.8 −11.8

Fig. 6. An estimate of the regression function of catch score on latitude and lon-
gitude, with colour coding to add information on the relative size of the standard
error of estimation across the surface.

One attractive option is to combine surface and colour information and


Figure 6 illustrates this by painting the estimated regression surface to indicate
the variations in standard error in different locations. Discrete levels of colour
have been used for simplicity, with a separate colour associated with each
individual polygonal surface panel. This highlights the areas where precision
in the estimate is low. It is no surprise to find these at the edges of the surface,
where information is less plentiful. However, it is particularly helpful to be able
to identify the relatively high standard errors at low latitude.
This idea extends to the assessment of reference models, such as a linear
trend across latitude and longitude. This is shown in the right hand panel
of Figure 7, where the surface panels are painted according to the size of the
standardised difference (m̂− p̂)/s.e.(m̂− p̂), to assess graphically the plausibil-
ity of a linear model p̂ in two covariates. The red panels indicate a difference
of more than 2, and the blue panels of less than −2, standard errors (s.e.).
This gives an immediate graphical indication that the curvature in the surface
is not consistent with a linear shape.
14 Adrian W. Bowman

Catch score
1

−1
−11.3
143.0 −11.4
143.2 −11.5
Lo 143.4 −11.6itu d e
ng t
itu 143.6 −11.7 La
de
143.8 −11.8

Fig. 7. Estimates of the regression function of catch score on latitude and longitude
The left hand panel displays a reference band for linearity while the right hand panel
uses colour coding to indicate the size of the difference between the estimate and a
linear regression function, in units of standard error.

Surfaces are three-dimensional objects and software to display such objects


in high quality is now widely available. The OpenGL system is a good example
of this and access to these powerful facilities is now possible from statistical
computing environments such as R, through the rgl package described by
Adler (2005). The left hand plot of Figure 7 gives one example of this, showing
a regression surface for the Reef data with additional wire mesh surfaces to
define a reference region for a linear model. The protrusion of the estimated
surface through this reference region indicates the substantial lack-of-fit of the
linear model. This higher quality of three-dimensional representation, together
with the ability to rotate the angle of view interactively, provides a very
attractive and useful display.
The two plots of Figure 8 give an example of a further extension of this
type of display to the comparison of two different surfaces, referring in this
case to two different years of sampling. The surface panels are now painted by
the values of the standardised distance (m̂1 − m̂2 )/s.e.(m̂1 − m̂2 ) and so the
added colour assesses the evidence for differences between the two underlying
surfaces m1 and m2 . The very small regions where the estimates are more
than two standard errors apart indicate that there are only relatively small
differences between the catch score patterns in the two years.
A more detailed description of the construction of displays of this type,
together with techniques for more global assessment of the evidence of dif-
ferences between surfaces and methods for incorporating correlated data, is
provided in Bowman (2005).
Smoothing techniques for visualisation 15

2 2

Catch score

Catch score
1 1

0 0

−1 −1
−11.3 −11.3
143.0 −11.4 143.0 −11.4
143.2 −11.5 143.2 −11.5
Lo 143.4 −11.6itu d e Lo 143.4 −11.6itu d e
ng t ng t
itu 143.6 −11.7 La itu 143.6 −11.7 La
de de
143.8 −11.8 143.8 −11.8

Fig. 8. Estimates of regression functions of catch score on latitude and longitude


for two different years of data collection. Colour coding has been used to indicate
the standard differences between the two surfaces.

4 Additive models
In order to be useful statistical tools, regression models need to be able to
incorporate arbitrary numbers of covariates. In principle, the local fitting ap-
proach described above could be extended to any number of covariates. How-
ever, in practice the performance of such simultaneous estimation deteriorates
rapidly as the dimensionality of the problem increases. A more parsimonious
and powerful approach is offered by additive models, developed by Friedman
& Stuetzle (1981), Hastie & Tibshirani (1990) and many other authors. These
allow each covariate to contribute to the model in a nonparametric manner
but assume that the effects of these are additive, so that a model for data
{(x1i , . . . , xpi , yi ); i = 1, . . . , n} is given by
yi = α + m1 (x1i ) + . . . + mp (xpi ) + εi .
This extends the usual linear regression model by allowing the effects of the
covariates to be nonparametric in shape. To ensure that the model is identifi-
able, the constraint that each component function mj averages to zero across
the observed covariate values can be adopted.
Additive models can be fitted to observed data through the backfitting
algorithm, where the vectors of estimates m̂j = (m̂j (xj1 ), . . . , m̂j (xjn ))⊤ are
updated from iteration r to r + 1 as
 
(r+1) (r+1) (r)
X X
m̂j = Sj y − α̂1 − m̂k − m̂k  . (3)
k<j k>j

This applies a smoothing operation, expressed in the smoothing matrix Sj


for the jth covariate, to the partial residuals constructed by subtracted the
16 Adrian W. Bowman

current estimates of all the other model components from the data vector y.
The estimate of the intercept term α can be held fixed at the sample mean
ȳ throughout. The identifiability constraint on each component function can
be incorporated by adjusting the vectors m̂j to have mean zero after each
iteration.
The backfitting algorithm described above is not the only way in which
an additive model can be fitted to observed data. In particular, Mammen et
al. (1999) proposed a smooth backfitting algorithm which has a number of
attractive properties. Nielsen & Sperlich (2005) give a clear exposition of this
approach, with practical proposals for bandwidth selection.
Hastie & Tibshirani (1990) discuss how the standard errors of the estimates
may also be constructed. Computationally, the end result of the Ppiterative
scheme (3) can be expressed in matrix form as ŷ = P y = (P0 + j=1 Pj )y,
where P0 is filled with the values 1/n to estimate α and the remaining matrices
construct the component estimates as m̂j = Pj y. The standard errors of m̂j
at each observed covariate value are then available as the square roots of the
diagonal entries of Pj Pj⊤ σ̂ 2 , where the estimate of error variance σ̂ 2 can be
constructed as RSS/df and df denotes the degrees of freedom for error. Hastie
& Tibshirani give the details on how that can be constructed, again by analogy
with its characterisation in linear models.
The left hand panels of Figure 9 show the results of fitting an additive
model in latitude and longitude to the Reef data. The top two panels show
the estimated functions for each covariate, together with the partial residuals,
while the bottom panel shows the resulting surface. The additive nature of
the surface is apparent as slices across latitude always show the same shape of
longitude effect and vice versa. (Notice that colour has been used here simply
to emphasise the heights of the surface at different locations.)
The level of smoothing was determined by setting the numbers of approx-
imate degrees of freedom to 4 for each covariate. An alternative approach,
advocated by Wood (2000), applies cross-validation as an automatic method
of smoothing parameter selection at each iteration of the estimation process
defined by (3). The effects of this strategy on the Reef data are displayed in
the right hand panels of Figure 9. The estimate of the longitude effect is very
similar but a very large smoothing parameter has been selected for latitude,
leading to a linear estimate. On the basis of the earlier estimate for the latitude
effect, using 4 degrees of freedom, a linear model for this term is a reasonable
strategy to adopt. This leads to a semiparametric model, where one compo-
nent is linear and the other is nonparametric. This hybrid approach takes
advantage of the strength of parametric estimation where model components
of this type are justified.
For a further example of additive models, the Clyde data are revisited.
When water samples are collected, a variety of measurements are made on
these. This includes temperature and salinity and it is interesting to explore
the extent to which the DO level in the samples can be explained by these
physical parameters. Clearly, temperature has a strong relationship with the
Smoothing techniques for visualisation 17

1.5
1.5

1.0
1.0

0.5
0.5

s(Longitude,4.52)
s(Longitude,4)

0.0
0.0

−0.5
−0.5

−1.0
−1.0

−1.5
142.8 143.0 143.2 143.4 143.6 143.8 142.8 143.0 143.2 143.4 143.6 143.8

Longitude Longitude

1.5
1.5

1.0
1.0

0.5
0.5
s(Latitude,4)

s(Latitude,1)

0.0
0.0

−0.5
−0.5

−1.0
−1.0

−1.5

−11.8 −11.6 −11.4 −11.2 −11.8 −11.6 −11.4 −11.2

Latitude Latitude

1.5
1.5
linear predictor

1.0
linear predictor

1.0

0.5
0.5

0.0
0.0

143.0 143.0
143.2 143.2
Lon 143.4 −11.2 Lon 143.4 −11.2
gitu 143.6 −11.4 gitu 143.6 −11.4
de 143.8 −11.6 de 143.8 −11.6
−11.8 Latitude −11.8 Latitude

Fig. 9. The left hand panels show the components and fitted surface of an additive
model for the Reef data, using 4 degrees of freedom for each covariate. The right
hand panels show the results for an additive model when cross-validation is used to
select the degree of smoothing at each step of the backfitting algorithm.
18 Adrian W. Bowman

day of the year. In fact, salinity also has a strong relationship with this vari-
able, as it measures the extent to which fresh water from the river and salt
water from the sea mix together and this has a strong seasonal component
related to the volume of river flow. It is therefore inappropriate to use all three
of these variables in a model for DO. As Hastie & Tibshirani (1990) observe,
the effect of concurvity, where explanatory variables have strong curved rela-
tionships, creates difficulties analogous to those associated with collinearity in
a linear model. The three explanatory variables to be considered are therefore
year, temperature and salinity, with the latter variable on a log(salinity + 1)
scale to reduce substantial skewness.
The top two panels of Figure 10 show nonparametric curve estimates based
on regression of DO on temperature and salinity separately. The lower three
panels of the Figure show the effects of fitting an additive model which ex-
presses the DO values as a sum of year, temperature and salinity components
simultaneously. One striking feature expressed in the partial residuals is the
substantial reduction in the variability of the data, compared to that dis-
played in the marginal scatterplots, as each component focusses only on the
variation not explained by the others. Some interesting features are displayed
in the curve estimates, with DO declining in a linear manner as tempera-
ture increases while DO is elevated at low salinity but constant elsewhere.
Reassuringly, the trend across years remains very similar to the patterns dis-
played in earlier analysis, when adjustment involved only the day of the year.
This collection of graphs therefore provides a very powerful summary of the
data across all the covariates involved and brings considerable insight into the
factors which influence the observed values of DO.

5 Discussion
The material of this chapter has aimed to introduce the concepts and aims of
nonparametric regression as a means of adding significant value to graphical
displays of data. Technical details have been limited only to those required to
give a general explanation of the methods. However, a great deal of technical
work has been carried out on this topic, which is well represented in the statis-
tical research literature. There are several books in this area and these provide
good starting points for further information. Hastie & Tibshirani (1990) give
a good general overview of smoothing techniques as well as a detailed treat-
ment of additive models. Green & Silverman (1994) give a very readable and
integrated view of the penalty function approach to smoothing models. Fan
& Gijbels (1996) gives considerable theoretical insight into the local linear
approach to smoothing while Simonoff (1996) is particularly strong in provid-
ing extensive references to the literature on nonparametric regression and is
therefore a very good starting point for further reading.
Bowman & Azzalini (1997) give a treatment which aligns most closely
with the style of exposition in this chapter and focusses particular attention
Smoothing techniques for visualisation 19

10

10
9

9
8

8
DO

DO
7

7
6

6
5

5
4

4
4 6 8 10 12 14 16 1.0 1.5 2.0 2.5 3.0 3.5

Temperature log(Salinity + 1)
3

3
2

2
1

1
s(Temperature)

s(lSalinity)

0
−1

−1
−2
−3

−2
−4

−3

4 6 8 10 12 14 16 1.0 1.5 2.0 2.5 3.0 3.5

Temperature log(Salinity + 1)
3
2
1
s(Year)

0
−1
−2
−3

1975 1980 1985 1990 1995

Year

Fig. 10. The top two panels show plots of dissolved oxygen against temperature and
log salinity with the Clyde data. The middle two panels show the fitted functions for
temperature and log salinity, and the lower panel for year, from an additive model.
20 Adrian W. Bowman

on smoothing over one and two covariates and on graphical methods. Schimek
(2000) provides an collection of contributions from a wide variety of authors
on different aspects of the topic. Härdle et al. (2004) gives a further general
overview of nonparametric modelling while Ruppert et al. (2003) give an au-
thoritative treatment of semiparametric regression in particular. Wood (2006)
provides an excellent introduction to, and overview of, additive models, fo-
cussing in particular on the penalized regression splines framework and with a
great deal of helpful practical discussion. An alternative wavelet view of mod-
elling is provided by Percival & Walden (2000) in the context of time series
analysis. On specifically graphical issues, Cleveland (1993) makes excellent
use of smoothing techniques in the general context of visualising data. Mate-
rial provided by Loader (2004) on local regression techniques and Horowitz
(2004) on semiparametric models in an earlier Handbook of Computational
Statistics are also highly relevant to the material of this chapter.
The role of smoothing techniques in visualisation has been indicated by
specific regression examples in this chapter. However, the principles behind
this approach allow it to be applied to a very wide range of data structures and
application areas. For example, Cole & Green (1992) discuss the estimation
of quantile curves while Kim & Truong (1998) describe how nonparametric
regression can accommodate censored data. Diblasi & Bowman (2001) use
smoothing to explore the shape of an empirical variogram constructed from
spatial data, examining in particular the evidence for the presence of spatial
correlation. Simonoff (1996) discusses the smoothing of ordered categorical
data while Härdle et al. (2004) describe single index models which aim to con-
dense the information in several potential explanatory variables into an index
which can then be related to the response variable, possibly in a nonparamet-
ric manner. Cook & Weisberg (1994) address the general issue of identifying
and exploring the structure of regression data, with particular emphasis on
the very helpful roles of smoothing and graphics in doing so. These references
indicate the very wide variety of ways in which smoothing techniques can be
used to great effect to highlight the patterns in a wide variety of data types,
with appropriate visualisation forming a central part of the process.
Software to implement smoothing techniques is widely available and many
standard statistical packages offer facilities for nonparametric regression in
some form. The examples and illustrations in this chapter have all been im-
plemented in the R statistical computing environment (R Development Core
Team, 2004) which offers a very extensive set of tools for nonparametric mod-
elling of all types. The one and two covariate models of this chapter were fitted
with the sm (Bowman & Azzalini, 2005) package associated with the mono-
graph of Bowman & Azzalini (1997). The mgcv (Wood, 2005) and gam (Hastie,
2005) packages provide tools for generalised additive models which can deal
with a much wider distributional family beyond the simple illustrations of this
chapter.
The web site associated with this handbook provides R software which
will allow the reader to reproduce the examples of the chapter and, by doing
Smoothing techniques for visualisation 21

so, offers encouragement for the reader to investigate the potential benefits
of nonparametric regression modelling as a tool in the exploration of other
regression datasets.

Acknowledgement
The assistance of Dr. Brian Miller of the Scottish Environment Protection
Agency in gaining access to, and advising on, the data from the River Clyde
is gratefully acknowledged.

References
1. Adler, D (2005) The R package rgl: 3D visualization device system (OpenGL)
Version 0.65 available from cran.r-project.org
2. Bock, M, Bowman, A W and Ismail, B (2005) Estimation and inference for error
variance in bivariate nonparametric regression Technical report, Department of
Statistics, The University of Glasgow.
3. Bowman A W & Azzalini A (1997) Applied Smoothing Techniques for Data
Analysis Oxford University Press, Oxford
4. Bowman A W & Azzalini A (2005) The R package sm: Smoothing methods for
nonparametric regression and density estimation. Version 2.1-0 available from
cran.r-project.org
5. Bowman, A (2005) Comparing nonparametric surfaces Technical report,
Department of Statistics, The University of Glasgow. (Available from
www.stats.gla.ac.uk/~adrian)
6. Cleveland, W S (1993) Visualising Data Hobart Press, New Jersey
7. Cole, T J & Green, P J (1992) Smoothing reference centile curves: the LMS
method and penalised likelihood. Statistics in Medicine 11, 1305–1319.
8. Cook, R D & Weisberg, S (1994). An Introduction to Regression Graphics Wiley,
New York
9. Diblasi, A & Bowman, A W (2001) On the use of the variogram for checking
independence in a Gaussian spatial process Biometrics, 57, 211–218.
10. Fan, J and Gijbels, I (1996). Local polynomial modelling and its applications.
Chapman & Hall, London.
11. Fan, J, Heckmann, N E and Wand, M P (1995). Local polynomial kernel re-
gression for generalized linear models and quasi likelihood functions. J. Amer.
Statist. Assoc, 90, 141–50
12. Friedman, J H and Stuetzle, W (1981) Projection pursuit regression J. Amer.
Statist. Assoc., 76, 817–23
13. Gasser, T, Sroka, L and Jennen-Steinmetz, C (1986) Residual variance and
residual pattern in nonlinear regression. Biometrika, 73, 625–33
14. Green, P J and Silverman, B W (1994) Nonparametric Regression and General-
ized Linear Models: A Roughness Penalty Approach. Chapman & Hall, London
15. Härdle, W, Müller, M, Sperlich, S, Werwatz, A (2004) Nonparametric and Semi-
parametric Models Springer-Verlag, Berlin
16. Hastie, T (2005) The R package gam: Generalized Additive Models Version 0.94,
available from cran.r-project.org
22 Adrian W. Bowman

17. Hastie, T and Tibshirani, R (1990) Generalized Additive Models. Chapman &
Hall, London
18. Horowitz, J L (2004) Semiparametric models. In: Gentle, J E, Härdle, W and
Mori, Y (eds) Handbook of Computational Statistics: concepts and methods.
Springer, Berlin
19. Hurvich, C M, Simonoff, J S and Tsai, C-L (1998). Smoothing parameter se-
lection in nonparametric regression using an improved Akaike information cri-
terion J.Roy.Stat.Soc., Series B, 60, 271–293
20. Kim H T & Truong Y K (1998) Nonparametric regression estimates with cen-
sored data: local linear smoothers and their applications Biometrics 54, 1434–
1444
21. Loader, C (2004) Smoothing: local regression techniques. In: Gentle, J E,
Härdle, W and Mori, Y (eds) Handbook of Computational Statistics: concepts
and methods. Springer, Berlin
22. McMullan, A, Bowman, A W and Scott, E M (2003). Non-linear and nonpara-
metric modelling of seasonal environmental data. Computational Statistics, 18,
167–183
23. McMullan, A, Bowman, A W and Scott, E M (2005). Additive models with
correlated data: an application to the analysis of water quality data. Technical
report, Department of Statistics, The University of Glasgow. (Available from
www.stats.gla.ac.uk/~adrian)
24. Mammen, E, Linton, O B and Nielsen, T J (1999) The existence and asymptotic
properties of a backfitting projection algorithm under weak conditions Annals
of Statistics 27, 1443–1490
25. Munk, A, Bissantz, N, Wagner, T and Freitag, G (2005) On difference-based
variance estimation in nonparametric regression when the covariate is high-
dimensional. J.Roy.Statistic.Soc., Series B, 67, 19–41
26. Nielsen, T J and Sperlich, S (2005) Smooth backfitting in practice
J.Roy.Statistic.Soc., Series B, 67, 43–61
27. Percival, D B and Walden, A T (2000) Wavelet Methods for Time Series Anal-
ysis Cambridge University Press, Cambridge
28. Poiner, I R, Blaber, S J M, Brewer, D T, Burridge, C Y, Caesar, D, Connell, M,
Dennis, D, Dews, G D, Ellis, A N, Farmer, M, Fry, G J, Glaister, J, Gribble,
N, Hill, B J, Long, B G, Milton, D A, Pitcher, C R, Proh, D, Salini, J P,
Thomas, M R, Toscas, P, Veronise, S, Wang, Y G, Wassenberg, T J (1997) The
effects of prawn trawling in the far northern section of the Great Barrier Reef.
Final report to GBRMPA and FRDC on 1991–96 research. CSIRO Division of
Marine Research, Queensland Dept. of Primary Industries
29. R Development Core Team (2004) R: A language and environment for statistical
computing. R Foundation for Statistical Computing. Vienna, Austria. ISBN 3-
900051-07-0, URL http://www.R-project.org
30. Ruppert, D, Wand, M P & Carroll, R J. Semiparametric Regression Cambridge
University Press, Cambridge
31. Schimek, M (2000) Smoothing and Regression: Approaches, Computation and
Application Wiley, New York
32. Simonoff, J S (1996) Smoothing Methods in Statistics. Springer-Verlag, New
York
33. Wood, S N (2000) Modelling and smoothing parameter estimation with multiple
quadratic penalties J.Roy.Stat.Soc., Series B, 62, 413–428
Smoothing techniques for visualisation 23

34. Wood, S N (2005) The R package mgcv: GAMs with GCV smoothness
estimation and GAMMs by REML/PQL. Version, 1.3-9 available from
cran.r-project.org
35. Generalized Additive Models: An Introduction with R CRC Press, London

Keywords for indexing: additive models; Akaike’s information criterion;


approximate degrees of freedom; backfitting algorithm; bandwidth; cross-
validation; curve estimation; local linear; nonparametric regression; reference
band; smoothing; surface estimation; variability band

You might also like