Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

FUNCTIONAL DATA ANALYSIS Functional Parameters

A functional data analysis may also involve


Functional data analysis (FDA) combines a functional parameters. Perhaps the example
variety of older methods and problems with best known to statisticians is the probability
some new perspectives, challenges and tech- density function p(x) describing the distribu-
niques for the analysis of data or models that tion of a random variable x. If the density
involve functions. The main reference is [5], function is determined by a small fixed num-
and [8] is a supplementary set of case studies. ber of parameters, as with the normal den-
sity, then the model is parametric. But if we
Functional Data view the density simply as a function to be
estimated from data without the imposition
The first panel of Figure 1 illustrates a num- of any shape constraints except for positivity
ber of aspects of functional data, as well as and possibly smoothness, then the density
particular challenges to be taken up later. is functional parameter, and model for the
Functional data such as these ten records data can be called functional, as opposed to
of the heights of children involve n discrete parametric.
observations hj , possibly subject to measure- The smooth curves h(t) fitting the height
ment error, that are assumed to reflect some data in the left panel of Figure 1 are func-
underlying and usually smooth process. The tional parameters for the discrete data if we
argument values tj associated with these val- estimate these ten curves in a way that is
ues may not be equally spaced, and may not sufficiently open-ended to capture any detail
be the same from one record to another. We in a curve that we need. In this growth model,
wish to consider any record, consisting in this we have an additional consideration: A curve
case of the height measurements available for hi should logically be strictly increasing as
a specific child, as a single functional datum. well as smooth, and we will see below how to
In the figure the arguments are values of achieve this.
time, but functional data may also be dis- Functional models such as density esti-
tributed over one or more spatial dimensions mates and smoothing curves are often called
when the observation is an image, over both nonparametric, an unfortunate term because
space and time, or may be taken from any parameters are usually involved in the pro-
other continuum. We usually assume that cess, because there is a much older and
argument values are sampled from a con- mostly inconsistent use of this term in statis-
tinuum, even though the actual values are tics, and because it is seldom helpful to
discrete. describe something in terms of what it isn’t.
Although the number n of discrete obser- We can also have functional parame-
vations determines what may be achieved ters for nonfunctional data. Examples are
with functional data, the data resolution is intensity functions for point processes, item
more critical. This is defined as the smallest response functions used to model psychome-
feature in the function that can be adequately tric data, hazard functions for survival data,
defined by the data, where a curve feature is as well as probability density functions.
a peak or a valley, a crossing of a threshold,
or some other localized shape characteristic
Some Roles for Derivatives
of interest. The number of values of tj per fea-
ture in the curve that are required to identify Functional data and parameters are usually
that feature depends on the nature of the assumed to be, at least implicitly, smooth
feature and the amount of noise in the data. or regular, and we consider why this should
For example, with errorless data, at least be so below. This implies in practice that a
three observations are required to identify certain number of derivatives exist, and that
the location, width, and amplitude of a peak; these derivatives are sufficiently smooth that
but if there is noise in the data, many more we can hope to use them in some way. We use
may be needed. the notation Dm x(t) to indicate the value of
Encyclopedia of Statistical Sciences, Copyright © 2006 John Wiley & Sons, Inc.

1
2 FUNCTIONAL DATA ANALYSIS

1.5
180 1

0.5
160
0

Centimeters/year2
140
Centimeters

−0.5

−1
120
−1.5
100 −2

−2.5
80
−3

60 −3.5
5 10 15 5 10 15
Years Years

Figure 1. The left panel shows irregularly spaced observations of the heights of ten girls, along
with the smooth increasing curves fit to each set of data. The right panel shows the second
derivative or height acceleration functions computed from the smooth curves in the left panel. The
heavy dashed line shows the mean of these acceleration curves.

the mth derivative of function x at argument can define a much wider variety of func-
value t. tion behaviors than conventional parameter-
The right panel of Figure 1 shows the based approaches, and this is especially so for
estimated second derivative D2 h(t) or accel- nonlinear differential equations. Hence dif-
eration of height for each girl. Height accel- ferential equations offer potentially powerful
eration displays more directly than height modelling strategies.
itself the dynamics of growth in terms of For example, returning to the need for a
the input of hormonal stimulation and other height function h to be strictly increasing or
factors into these biological systems, just as monotonic, it can be shown that any such
acceleration in mechanics reflects exogenous function satisfies the differential equation
forces acting on a moving body.
We also use derivatives to impose smooth-
ness or regularity on estimated functions by w(t)Dh(t) + D2 h(t) = 0, (1)
controlling a measure of the size of a deriva-
tive. For example, it is common practice to
where w(t) is some function that is itself
control the size of the second derivative or
unconstrained in any way. This equation
curvature of a function
 using the total cur-
transforms the difficult problem of estimating
vature measure [D2 x(t)]2 dx. The closer to
h(t) to the much easier one of estimating w(t),
zero this measure is, the more like a straight
and also plays a role in curve registration
line the curve will be. More sophisticated
discussed below. See [6] for other examples
measures are both possible and of genuine
of differential equation models for functional
practical value.
parameters in statistics.
A differential equation is an equation
Finally, given actual functional data, can
involving one or more derivatives as well as
we estimate a differential equation whose
possibly the function value. For example, the
solution will approximate the data closely?
differential equation defining harmonic or
The technique principal differential analysis
sinusoidal behavior is ωx + D2 x = 0 for some
(PDA) for doing this can be an important part
positive constant ω. Differential equations
of a functional data analyst’s toolbox. See the
entry for this topic.
FUNCTIONAL DATA ANALYSIS 3

Historical Perspectives can be large in two quite independent ways:


First, by how large their values are in the
Proposing a specific beginning to almost any
range space, and secondly, in terms of their
field is controversial, but surely an early
dimensionality. White noise, for example, is
FDA landmark was Fourier’s Théorie de
an infinitely large object even if the noise
la Chaleur (1822) demonstrating a power-
is everywhere small. This presents a serious
ful and computationally convenient basis
conceptual problem for the statistician: How
function expansion system whose range of
can we hope to accurately estimate an infi-
applications exceeds even today those of any
nite dimensional object from a finite amount
other basis system for functional approxima-
of data?
tion. Techniques for analyzing certain spe-
But dimensionality, even if infinite, is not
cial types of functional data have also been
the same thing as complexity. Complexity
around for a long time. Time series analysis
has to do with the amount of detailed vari-
focusses on functions of time, where in prac-
ation that a function exhibits over a small
tice time values are typically discrete and
circumscribed region in the domain space.
equally spaced. Assumptions of stationarity
White noise, for example, is infinitely com-
play a large role in this field, but tend to be
plex, but any reasonable account of a single
avoided in FDA. Longitudinal data analysis
child’s growth curve should be fairly simple
is typically concerned with large numbers of
over a short time period. Complexity requires
replications of rather short time series, where
energy to achieve; and energy is everywhere
the data in each record typically have little
in nature available in limited amounts, and
resolving power.
can only be moved from one place to another
The term ‘‘functional data analysis’’ is
at finite rates. The complexity of a stock
rather recent, perhaps first used in [7]. At
market index, for example, although impres-
this early stage, the objective tended to be
sive, is still limited by the amount of capital
the adaptation of familiar multivariate tech-
available and the capacity of the financial
niques such as principal components analysis
community to move the capital around.
to functional contexts, but more recent work
Mathematical tools for manipulating com-
has extended the scope to include methods
plexity and dimensionality independently are
such principal differential analysis that are
rather new. Multi-resolution analysis and
purely functional in nature.
wavelets offer basis systems in which the
variation of a function can be split into layers
SOME PERSPECTIVES ON FUNCTIONS of increasing complexity, but with the possi-
bility of controlling the total energy allocated
A function is a mapping from a set of objects to each level. These concepts will surely have
called the domain of the function to another a large impact on the future of functional
set of objects called its range, with the special data analysis.
property that any domain object is mapped to What model for ignorable variation, often
at most one range object. The natures of both called ‘‘noise’’ or ‘‘error’’, should we use? For
the domain and range can be far wider than finite dimensional data we tend to model
vectors of real or complex numbers. Indeed, noise as a set of N identically and inde-
mappings from a space of functions to another pendently distributed realizations of some
space of functions is especially important in random variable E. But the concept does not
this as in many fields, and we can indeed scale up in dimensionality well; white noise
extend concepts like derivatives and integrals is infinite dimensional, infinitely complex,
to these situations. and infinitely large, and thus will ultimately
Whatever the domain and range, an essen- overwhelm any amount of data. Moreover,
tial property of a function is its dimension- putting the mathematical analysis of white
ality. In the majority of situations, functions noise and its close cousin, Brownian motion,
can be viewed as points in an infinite dimen- in good order has challenged mathematicians
sional space of possible functions, especially for most of the last century. Ignorable func-
when the domain and/or the range is a con- tional variation, on the other hand, while
tinuum of some sort. Consequently, functions admittedly more complex that we want to
4 FUNCTIONAL DATA ANALYSIS

deal with, is still subject to limits on the have been more or less replaced by spline
energy required to produce it. These issues functions (see article for details) for non-
are of practical importance when we con- periodic data, but Fourier series remains
sider hypothesis testing and other inferential the basis of choice for periodic problems.
issues. Both splines and Fourier series offer excellent
control over derivative estimation. Wavelet
FIRST STEPS IN FUNCTIONAL DATA bases are especially important for rendering
ANALYSES sharp localized features.
In general, basis function expansions are
Techniques for Representing Functions easy to fit to data and easy to regularize.
In the typical application the discrete obser-
The first task in a functional data analy- vations yj are fit by minimizing a penalized
sis is to capture the essential features of least squares criterion such as
the data in a function. This we can always

n 
do by interpolating the data with a piece-
[yj − x(tj )]2 + λ [D2 x(t)]2 dt (3)
wise linear curve; this not only contains the
j
original data within it, but also provides
interpolating values between data points. When x(t) is defined (2), solving for the coeffi-
However, some smoothing can be desirable cients ck that minimize the criterion implies
as a means of reducing observational error, solving a matrix linear equation. The larger
and if one or more derivatives are desired, penalty parameter λ is, the more x(t) will be
then the fitted curve should also be differ- like a straight line, and he close λ is to zero,
entiable at least up to the level desired, and the more nearly x(t) will fit the data. There
perhaps a few derivatives beyond. Additional are a variety of data-dependent methods for
constraints may also be required such as pos- finding an appropriate value of λ; see entry
itivity, monotonicity, and etc. The smoothing ?? for more details. It is also possible to sub-
or regularization aspect of fitting can often stitute other more sophisticated differential
be deferred, however, by applying it to the operators for D2 , and consequently smooth
functional parameters to be estimated from toward targets other than straight lines; see
the data rather than to the data-fitting curve [3] for examples and details.
itself. For multi-dimensional arguments, the
Functions are traditionally defined by a choice of bases become more limited. Radial
small number of parameters that seem to and tensor-product spline bases are often
capture the main shape features in the data, used, but are problematical over regions
and ideally are also motivated by substantive with complex boundaries or regions where
knowledge about the process. For the height large areas do not contain observations. On
data, for example several models have been the other hand, the triangular mesh and
proposed, and the most satisfactory require local linear bases associated with finite ele-
eight or more parameters. But when the ment methods for solving partial differential
curve has a lot of detail or high accuracy is equations are extremely flexible, and soft-
required, parametric methods usually break ware tools for manipulating them are widely
down. In this case two strategies dominate available. Moreover, many data-fitting prob-
most of the applied work. lems can be expressed directly as partial
One approach is to use a system of basis differential equation solutions; see [7] and
functions φk (t) in the linear combination [9] for examples.
The other function-fitting strategy is to

K
x(t) = ck φk (t). (2) use various types of kernel methods (see [4]),
k
which are essentially convolutions of the data
sequence with a specified kernel K(s − t). The
The basis system must have enough resolv- convolution of an image with a Gaussian ker-
ing power so that, with K large enough, any nel is a standard procedure in image analysis,
interesting feature in the data can be cap- and also corresponds to a solution of the heat
tured by x. Polynomials, where φk (t) = tk−1 , or diffusion partial differential equation.
FUNCTIONAL DATA ANALYSIS 5

Basis function and convolution methods time is transformed for each child to biologi-
share many aspects, and each can be made to cal time, so that all children are in the same
look like the other. They do, however, impose growth phase with respect to the biological
limits on what kind of function behavior can time scale. This transformation of time h(t)
be accommodated in the sense that they can- is often called a time warping function, and
not do more than is possible with the bases it must, like a growth curve, be both smooth
or kernels that are used. For example, there and strictly increasing. As a consequence,
are situations where sines and cosines are, in the differential equation (1) also defines h(t),
a sense, too smooth, being infinitely differen- and, through the estimation of weight func-
tiable, and for this reason are being replaced tion w(t), can be used to register a set of
by wavelet bases in some applications, and curves.
especially in image compression where the The mean function x(t) summarizes the
image combines local sharp features with location of a set of curves, possibly after
large expanses of slow variation. registration. Variation among functions of a
Expressing both functional data and func- single argument is described by covariance
tional parameters as differential equations and correlation surfaces, v(s, t) and r(s, t),
greatly expands the horizons of function rep- respectively. These are bivariate functions
resentation. Even rather simple differential that are the analogs of the corresponding
equations can define behavior that would matrices V and R in multivariate statistics.
be difficult to mimic in any basis system, Of course, the simpler point-wise standard
and references on chaos theory and nonlin- deviation curve s(t) may also be of interest.
ear dynamics offer plenty of examples; see
[1]. Linear differential equations are already
used in fields such as control theory to repre- FUNCTIONAL DATA ANALYSES
sent functional data, typically with constant
coefficients. Linear equations with noncon- FDA involves functional counterparts of mul-
stant coefficients are even more flexible tools, tivariate methods such as principal compo-
as we saw in equation (1 for growth curves, nents analysis, canonical correlation analy-
and nonlinear equations are even more so. sis, and various linear models. In principal
More generally, functional equations of all components analysis, for example, the func-
sorts will probably emerge as models for func- tional counterparts of the eigenvectors of a
tional data. correlation matrix R are the eigenfunctions
of a correlation function r(s, t). Functional
Registration and Descriptive Statistics linear models use regression coefficient func-
tions β(s) or β(s, t) rather than β1 , . . . , βp .
Once the data are represented by samples of Naive application of standard estimation
functions, the next step might be to display methods, however, can often result in param-
some descriptive statistics. However, if we eters such as regression coefficient functions
inspect a sample of curves such as the acceler- and eigenfunctions which are unacceptably
ation estimates in the right panel of Figure 1, rough or complex, even when the data are
we see that the timing of the pubertal growth fairly smooth. Consequently, it can be essen-
spurt varies from child to child. The point- tial to impose smoothness on the estimate,
wise mean, indicated by the heavy dashed and this can be achieved through regulariza-
line, is a poor summary of the data; it has tion methods, typically involving attaching
less amplitude variation and a longer puber- a term to the fitting criterion that penal-
tal growth period than those of any curve. izes excessive roughness. Reference [5] offers
The reason is that the average is taken over a number of examples realized in the con-
children doing different things at any point in text of basis function representations of the
time; some are pre-pubertal, some are in the parameters.
middle, and some are terminating growth. Few methods for hypothesis testing, inter-
Registration is the one-to-one transforma- val estimation and other inferential problems
tion of the argument domain in order to align in the context of FDA have been developed.
salient curve or image features. That is, clock Part of the problem is conceptual; what does
6 FUNCTIONAL DATA ANALYSIS

it mean to test a hypothesis about an infi- data analyses is to apply a roughness penalty
nite dimensional state of nature on the basis directly to the estimated weight functions or
of finite information? On a technical level, eigenvalues ξ (t). This is achieved by chang-
although the theory of random functions or ing the normalizing condition to {ξ 2 (t) +
stochastic processes is well developed, testing λ[D2 ξ (t)]2 } dt = 1. In this way the smooth-
procedures based on a white noise model for ing or regularization process, controlled by
ignorable information are both difficult and roughness penalty parameter λ, may be defer-
seem unrealistic. red to the step at which we estimate the
Much progress has been made in [10], how- functional parameter that interests us.
ever, in adapting t, F, and other tests to locat- In multivariate work it is common prac-
ing regions in multidimensional domains tice to apply a rotation matrix T to principal
where the data indicate significant depar- components once a suitable number have
tures from a zero mean Gaussian random been selected in order to aid interpretability.
field. This works points the way to defining Matrix T is often required to maximize the
inference as a local decision problem rather VARIMAX criterion. The strategy of rotat-
than a global one. ing functional principal components also fre-
quently assists interpretation.
Principal Components Analysis
Linear Models
Suppose that we have a sample of N curves
xi (t), where the curves may have been The possibilities for functional linear models
registered and the mean function has been are much wider than for the multivariate
subtracted from each curve. Let v(s, t) be case. The simplest case arises when the
their covariance function. The goal of princi- dependent variable y is functional, the covari-
pal components analysis (PCA) is to identify ates are multivariate, and their values con-
an important mode of variation among the tained in a N by p design matrix Z. Then the
curves, and we can formalize this as the linear model is
search for a weighting function ξ (t), called

p
the eigenfunction, such that the amount of E[yi (t)] = zij βj (t) (5)
this variation, expressed as j

µ= ξ (s)v(s, t)ξ (t) ds dt (4) A functional version of  theusual error sum
i [yi (t) − ŷi (t)] dt,
of squares criterion is 2

and the least squares solution for the regres-


is maximized
 subject to the normalizing
sion coefficient functions βj (t) is simply β̂(t) =
condition ξ 2 (t) dt = 1. This formulation of
(Z Z)−1 Z Y(t) where Y(t) is the column vector
the PCA problem replaces the summa-
of N dependent variable functions.
tions in the multivariate version of the
It will often happen that the covariate
problem, µ= v Vv, by integrals. The eigen-
is functional, however. For example, we can
equation, v(s, t)ξ (t) dt = µξ (s) defines the
have as the model for a scalar univariate
optimal solution, and subsequent principal
observation yi
components are defined as in the classic ver-
sion; each eigenfunction ξk (t) maximizes the 
sum of squared principal component scores E[yi ] = β0 + xi (s)β(s) ds. (6)
subject to the normalizing condition
 and
to the orthogonality condition ξj (t)ξk (t) dt = A functional linear model where both the
0, j < k. dependent and covariate variables are func-
However, if the curves are not sufficiently tional that has been studied in some depth [2]
smooth, the functional principal components is the varying coefficient or point-wise linear
defined by the ξj (t) will tend to be unac- model involving p covariate functions
ceptably rough, and especially so for higher
indices j. Smoothing the functions xi (t) will 
p

certainly help, but an alternate approach that E[yi (t)] = xij (t)βj (t). (7)
may be applied to virtually all functional j
FUNCTIONAL DATA ANALYSIS 7

In this model, yi depends on the xij ’s only in is maximized subject to the two normalizing
terms of their simultaneous behavior. constraints
But much more generally, a dependent 
variable function yi can be fit by a single ξ (s)vXX (s, t)ξ (t) ds dt = 1
bivariate independent variable covariate xi 
with the linear model η(s)vYY (s, t)η(t) ds dt = 1.

E[yi (t)] = α(t) + x(s, t)β(s, t) ds (8)
t Further matching pairs of canonical weight
functions can also be computed by requiring
where α(t) is the intercept function, β(s, t) is that they be orthogonal to previously com-
the bivariate regression function and t is puted weight functions, as in PCA.
a domain of integration that can depend on However, CCA is especially susceptible
t. The notation and theorems of functional to picking up high frequency uninteresting
analysis can be used to derive the func- variation in curves, and by augmenting the
tional counterpart of a least squares fit of normalization conditions in the same way as
this model. In practice, moreover, we can for PCA by a roughness penalty, the canonical
expect that there will be multiple covariates, weight functions and the modes of covariation
and some will be multivariate and some func- that they define can be kept interpretable. As
tional. in PCA, once a set of interesting modes of
No matter what the linear model, we can covariation have been selected, rotation can
attach one or more roughness penalties to aid interpretation.
the error sum of squares criterion being min-
imized to control the roughness the regres- Principal Differential Analysis
sion functions. Indeed, we must exercise this
Principal differential analysis (PDA) is a
option when the dependent variable is either
technique for estimating a linear variable-
scalar or finite dimensional and the covari-
coefficient differential equation from one or
ate is functional because the dimensionality
more functional observations. That is, func-
of the covariate is then potentially infinite,
tional data are used to estimate an equation
and we can almost always find function a
of the form
β(s) that will provide a perfect fit. In this
situation, controlling the roughness of the w0 (t)x(t) + w1 (t)Dx(t) + . . .
functional parameter also ensures that the
solution is meaningful as well as unique. + wm−1 (t)Dm−1 x(t) + Dm x(t) = f (t). (10)

Canonical Correlation Analysis The order of the equation is m, the highest


order derivative used. The m weight coeffi-
Canonical correlation analysis (CCA) in mul-
cient functions wj (t) along with the forcing
tivariate statistics is a technique for explor-
function f (t) are what define the equation.
ing covariation between two or more groups
Some of these functional parameters may be
of variables. Although multivariate CCA is
constant, and some may be zero. For example,
much less often used than PCA, its functional
the order two differential equation (1), defin-
counterpart seems likely to see many applica-
ing growth and the time warping function
tions because we often want to see what kind
h(t) used in registration, has w0 (t) and f (t)
of variation is shared by two curves mea-
equal to zero, and is therefore defined by the
sured on the same unit, xi (t) and yi (t). Let
single weight function w1 (t).
vXX (s, t), vYY (s, t) and vXY (s, t) be the variance
The main advantage of a differential
functions for the X variable, the Y variable,
equation model is that it captures the dynam-
and the covariance function for the pair of
ics of the process in the sense it that also mod-
variables, respectively. Here, we seek two
els velocity, acceleration and higher deriva-
canonical weight functions, ξ (t) and η(t), such
tives as well as the function itself. Moreover,
that the canonical correlation criterion
 differential equations of this sort, as well
ρ= ξ (s)vXY (s, t)η(t) ds dt (9) as nonlinear functions of derivatives, can
define functional behaviors that are difficult
8 FUNCTIONAL DATA ANALYSIS

to capture in a low-dimensional basis func-


tion expansion.
See the entry for this topic for further
details.

Acknowledgments
The author’s work was prepared under a grant
from the Natural Science and Engineering
Research Council of Canada.

REFERENCES

1. Alligood, K. T., Sauer, T. D. and Yorke, J. A.


(1996). Chaos: An Introduction to Dynamical
Systems. Springer, New York, NY.
2. Hastie, T. and Tibshirani, R. (1993). Varying-
coefficient models. Journal of the Royal Sta-
tistical Society, Series B, 55, 757–796.
3. Heckman, N. and Ramsay, J. O. (2000). Penal-
ized regression with model-based penalties.
Canadian Journal of Statistics, 28, 241–
258.
4. Fan, J. and Gijbels, I. (1996). Local Polyno-
mial Modeling and Its Application—Theory
and Methodologies. Chapman and Hall, New
York.
5. Ramsay, J. O. and Silverman, B. W. (1997)
Functional Data Analysis. Springer, New
York, NY.
6. Ramsay, J. O. (2000). Differential equation
models for statistical functions. Canadian
Journal of Statistics, 28, 225–240.
7. Ramsay, J.O. and Dalzell, C. (1991). Some
tools for functional data analysis (with discus-
sion). Journal of the Royal Statistical Society,
Series B, 53, 539–572.
8. Ramsay, J. O. and Silverman, B. W. (2002)
Applied Functional Data Analysis. Springer,
New York, NY.
9. Ramsay, T. (2002) Spline smoothing over dif-
ficult regions. Journal of the Royal Statistical
Society, Series B, 64, 307–319.
10. Worsley, K. J. (1994). Local maxima and the
expected Euler characteristic of excusrsion
sets of χ 2 , F and t fields. Advances in Applied
Probability, 26, 13-42.

J. O. RAMSAY

You might also like