Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Journal of Statistical Computation and Simulation

ISSN: 0094-9655 (Print) 1563-5163 (Online) Journal homepage: http://www.tandfonline.com/loi/gscs20

Functional logistic regression: a comparison of


three methods

Seyed Nourollah Mousavi & Helle Sørensen

To cite this article: Seyed Nourollah Mousavi & Helle Sørensen (2018) Functional logistic
regression: a comparison of three methods, Journal of Statistical Computation and Simulation,
88:2, 250-268, DOI: 10.1080/00949655.2017.1386664

To link to this article: https://doi.org/10.1080/00949655.2017.1386664

Published online: 24 Oct 2017.

Submit your article to this journal

Article views: 243

View Crossmark data

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=gscs20
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2018
VOL. 88, NO. 2, 250–268
https://doi.org/10.1080/00949655.2017.1386664

Functional logistic regression: a comparison of three methods


Seyed Nourollah Mousavi a and Helle Sørensenb
a Department of Mathematics, Arak University, Arak, Iran; b Department of Mathematical Sciences, University
of Copenhagen, Copenhagen, Denmark

ABSTRACT ARTICLE HISTORY


Functional logistic regression is becoming more popular as there are Received 9 May 2017
many situations where we are interested in the relation between Accepted 27 September 2017
functional covariates (as input) and a binary response (as output). KEYWORDS
Several approaches have been advocated, and this paper goes into Discrete wavelet transform;
detail about three of them: dimension reduction via functional LASSO penalization;
principal component analysis, penalized functional regression, and functional principal
wavelet expansions in combination with Least Absolute Shrinking component analysis;
and Selection Operator penalization. We discuss the performance of supervised classification;
the three methods on simulated data and also apply the methods to penalized functional
data regarding lameness detection for horses. Emphasis is on classi- regression
fication performance, but we also discuss estimation of the unknown 2010 MSC SUBJECT
parameter function. CLASSIFICATIONS
62-07; 62J12; 62J07; 62P10

1. Introduction
Functional data consist of curves, surfaces (images), or more general objects in higher
dimensions, and occur more and more often and in many different scientific fields, for
example, medicine, economics, biology, and chemistry. Functional data are observed dis-
cretely but are generated from underlying random functions. Hence, the data points within
a curve (or image) are highly dependent, and it is expedient to use approaches that take this
dependency into account in the analysis of the data. This has led to the sub-field of statistics
called functional data analysis (FDA). A common strategy in FDA is to handle the high-
dimensionality with basis expansions, either using fixed or data-driven bases, and this will
also be the case in this paper. Textbooks on FDA include [1–3].
An increasing number of studies in FDA is concerned with the relationship between one
or more functional covariates and an outcome variable which can be scalar, binary, cate-
gorical, or functional. For scalar outcome there is a plethora of literature, see [4–9], among
others. The primary aims of these studies were to explain the variation in the data and
predict future values of the outcome using the information from the functional covariates.
There are fewer studies related to binary outcomes, and the methods have not been com-
pared thoroughly in the literature. Ratcliffe et al. [10] suggested a logistic regression analysis
for foetal heart rate data. They used Fourier expansions for the functional covariates and
the parameter function and a modified Fisher scoring algorithm for computation of the

CONTACT Seyed Nourollah Mousavi n-mousavi@araku.ac.ir

© 2017 Informa UK Limited, trading as Taylor & Francis Group


JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 251

maximum likelihood estimate. Several papers used functional principal component analy-
sis (FPCA) followed by standard logistic regression with the principal component scores as
entries in the design matrix [11–13]. The high-dimensional multicollinearity in the func-
tional covariates was thus overcome by the dimension reduction obtained by FPCA, and
the main difference lies in the approach for computing the principal component scores in
the FPCA. In Section 2.2, we will pay special attention to logistic regression in combina-
tion with the so-called PACE (principal component via conditional expectation) technique
[6] where estimation of the scores is based on conditional expectations. This method was
also used by Wei et al. [13] for gene detection in a case–control study of pancreatic cancer.
Goldsmith et al. [8] used penalized functional regression (PFR) for analysis of white-matter
tract profiles in multiple sclerosis (the separation of participants into patients and controls
constitutes the binary response). They combined FPCA for the functional covariate with
a B-spline expansion for the parameter function, incorporating penalization for regular-
ization. We will go into details about this method in Section 2.3. Recently, Mousavi and
Sørensen [14] proposed to use a combination of wavelets and LASSO penalization for mul-
tivariate outcomes; this method will be presented in Section 2.1 for the special case with
binary outcomes.
In this paper we describe and compare three of the above-mentioned methods. They
are tested on simulated data and used to analyse a dataset regarding detection of lameness
of horses. Our aim is to compare the three approaches with primary focus on the methods’
ability to correctly classify new observations. In addition, we study the accuracy of the esti-
mates of the parameter function and comment on computational efficiency. The remainder
of the paper is organized as follows. In Section 2 we describe the three methods of interest.
Two simulation studies are discussed in Section 3, and Section 4 contains an analysis of the
lameness data. Final remarks are provided in Section 5.

2. Functional logistic regression


There are many situations where the relation between input variables and a binary output is
of interest. For classification, for example, we need a rule that takes the input variables and
delivers a guess of the output. Random variation implies that there is no ‘perfect rule’, and
it is thus expedient to seek a stochastic model which is able to account for noise. Indeed, we
are interested in finding the conditional distribution of the binary response given the input
variables. In other words, if Y is the binary outcome and X is the collection of covariates,
we wish to model the conditional probability π(x) = P(Y = 1 | X = x) as a function of x,
describing the effect of X on the outcome distribution.
In the functional logistic setting, the predictor X : T → R is assumed to be a square-
integrable random function on a compact support T ⊂ R, and the conditional success
probability is assumed to be on the form


exp{α + T β(t)x(t) dt}
π(x) =  , (1)
1 + exp{α + T β(t)x(t) dt}

where α is an intercept parameter, and β : T → R is a coefficient function assumed to be


square-integrable on T. With the logit transformation the model can be written in terms
252 S. N. MOUSAVI AND H. SØRENSEN

of a linear predictor, η:
  
π(x)
η(x) = logit(π(x)) = log =α+ β(t)x(t) dt. (2)
1 − π(x) T

The coefficient function β describes the relationship between Y and X. Large values of
|β(t)| on certain parts of the domain indicate that differences in X in those regions have
larger predictive power in Y . The model is a straight-forward generalization of the func-
tional linear model for continuous response, where the expected value is modelled in the
same way as the linear predictor above.
Data consist of pairs (xi , yi ) from n subjects. The pairs are considered outcomes of inde-
pendent random variables (Xi , Yi ), i = 1, . . . , n, with conditional distribution of Yi given
Xi = x given by (1) or (2). The conditional success probability for subject i is πi = π(xi ),
and the corresponding linear predictor is denoted ηi . We make no assumptions on the
marginal distribution of Xi . In practice each xi is observed at discrete sampling points, yet
we consider xi as a function.
Central goals in functional logistic regression are estimation of β, including assessment
of the variability of the estimate, and prediction of new responses. As for the functional
linear model, we need further restrictions on xi and β in order to solve the estimation
problem [1]. A standard approach that maintains the functional nature of xi and β, is
to represent them with basis expansions. The basis can either be a fixed basis (not con-
structed from the data), for example a Fourier basis, a B-spline basis, a wavelet basis, or
a ramp basis, or it can be a basis constructed by the eigenfunctions of the covariance
operator.
Assume that we have selected θ and ω as bases for expansion of the coefficient function
and the sampling trajectory, respectively. The trajectories and the parameter function are
thus expressed as



Kx 
xi (t) = cik θk (t) = cTi θ(t), β(t) = b ω (t) = ωT (t)b, (3)
k=1 =1

where θk (t) and ωl (t) denote the kth and lth basis functions evaluated at time t, and θ (t)
and ω(t) denote vectors with those values (dimensions Kx and Kβ ). Coefficients in the
expansions are denoted cik and bl , and are collected in vectors ci and b.
We need some further notation: All coefficients corresponding to covariate functions
 collected
are in the n × Kx matrix C, and the Kx × Kβ matrix Jθ ω is defined by Jθ ω =
T θ(t)ω T (t) dt. Furthermore, let η = (η , η , . . . , η )T be the vector of linear predictors,
1 2 n
and let 1 = (1, 1, . . . , 1)T be the n-vector of ones. Substituting for β(t) and xi (t), the matrix
form of the regression model (2) becomes

η = α1 + Cθ(t)ωT (t)b dt = α1 + CJθ ω b, (4)
T

corresponding to a standard multiple logistic regression model.


Under the independence assumption, the conditional likelihood function expressing
the probability of the outcome pattern (y1 , . . . , yn ) conditional of the covariate functions,
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 253

is given by


n
y

n
eyi (α+ T xi (t)β(t) dt) 
n
eyi (α+ci Jθ ω b)
T

L(α, b) = πi i (1 − πi )1−yi =  = ,
1 + eα+ T xi (t)β(t) dt 1 + eα+ci Jθ ω b
T
i=1 i=1 i=1

and the maximum likelihood estimator can be found by numerical optimization methods.
For instance, Ratcliffe et al. [10] applied the Fisher scoring algorithm.
It is well known that parameter estimates in multiple logistic regression models are not
reliable when there is a high degree of dependence between the covariate variables (multi-
collinearity), that is, when the columns of the design matrix are highly correlated [12,15].
In functional logistic regression, because of the nature of the functional data, the correla-
tion between columns of the matrix CJθ ω Equation (4) will inherently be high. The rest
of this section will discuss different ways to overcome this problem. After estimation of
α and b, the estimated coefficient function is computed as β̂(t) = ωT (t)b̂, and predictive
probabilities π̂ (x) can be computed and used for classification of new curves x.

2.1. Functional logistic regression using wavelet basis and LASSO penalization
Zhao et al. [9] proposed to combine wavelet bases with LASSO penalization for a functional
regression with continuous response. Mousavi and Sørensen [14] modified the approach
to deal with classification of functional data, using a multinomial functional regression
model and converting the infinite-dimensional problem to a finite-dimensional problem
with a sparse matrix of wavelet coefficients. The method was applied to the lameness data
(see Section 4) and phoneme data from the literature. In the present paper we will use the
method for binary responses, and accordingly refer to the method as a functional logistic
regression with wavelets and LASSO (FLRWLASSO).
Let φ(t) and ψ(t)  be a father
√ and mother wavelet, respectively, that are linked by the
relationship, ψ(t) = k∈Z gk 2φ(2t − k). The set of coefficients G = {gk }k∈Z are high-
pass filter coefficients. For a given mother wavelet, the wavelet basis {ψj,k (t)}j,k∈Z can be
constructed by dilation and translation as

ψj,k (t) = 2j/2 ψ(2j t − k),

and similarly for the father wavelet. The indices j and k represent dilation and translation,
respectively. For a given function f and a fixed detail level j0 , the basis expansion of f in
terms of the wavelet bases is given by
j0 −1
2 −1 j

J−1 2
fj0 (t) = cj0 ,k φj0 ,k (t) + dj,k ψj,k (t), (5)
k=0 j=j0 k=0

where N = 2J and N represents the number of observations of the function f. Notice that
N must be a power of 2. A wavelet has p vanishing moments in which the wavelet scal-
ing function can generate polynomials up to degree p−1. It is well known that a complex
function can be represented with a sparse set of wavelet coefficients for more vanishing
moments.
254 S. N. MOUSAVI AND H. SØRENSEN

If we assume that the curves xi and β belong to the finite-dimensional space generated
by the wavelet basis, then xi and β can be expressed in terms of the mother and father
wavelets at detail level j0 as follows:

j0 −1
2 −1 j

J−1 2
xi,j0 (t) = ci,j0 ,k φj0 ,k (t) + di,j,k ψj,k (t), i = 1, . . . , n
k=0 j=j0 k=0
(6)
j0 −1
2 −1j

J−1 2
βj0 (t) = cj∗0 ,k φj0 ,k (t) + ∗
dj,k ψj,k (t).
k=0 j=j0 k=0

With these basis expansions for xi and β, and due to orthonormality of the wavelet basis
functions, the linear predictor from Equation (2) becomes

 j0 −1
2  −1
J−1 2 j

ηi = α + xi (t)β(t) dt = α + ci,j0 ,k cj∗0 ,k + ∗


di,j,k dj,k = α + Bi γ ,
T k=0 j=j0 k=0

where Bi is a row vector of length N involving the father and mother wavelet coefficients
of signal xi , and γ is the vector of c∗ and d∗ coefficients for the parameter function β. In
matrix form, and with B being the n × N matrix with columns equal to Bi , we may write
η = α1 + Bγ . The likelihood function is

n
eyi (α+Bi γ )
L(α, γ ) = . (7)
i=1
1 + eα+Bi γ

We use the least asymmetric Daubechies wavelets for our analyses as these orthonormal
wavelets are compactly supported, flexible and more smooth than Haar wavelets. Notice
that we use the same basis for the data functions and the coefficient function and that no
regularization or smoothing has been imposed so far. The number of unknown param-
eters, including the intercept, is N+1. With functional data the number of observations
for each subject is often much larger than the number of subjects (N  n). In these
situations unpenalized maximum likelihood estimation is not possible as the likelihood
equation has several solutions. Moreover, overfitting makes the interpretation of estimates
difficult. Therefore we need to apply regularization. It is common to use either a ridge
penalty or a LASSO penalty, adding an L2 or L1 penalty term on the coefficients onto the
log-likelihood.
It is well known that ridge regression reduces the variability and improves the accuracy
of the estimates as the coefficients are shrunk towards zero. This is of particular value in
the presence of multicollinearity. However, ridge regression is not concerned with variable
selection and does not provide a parsimonious model with few parameters. On the other
hand, the Least Absolute Shrinking and Selection Operator (LASSO) coined by Tibshirani
[16] shrinks some of the coefficients all the way to zero, thereby delivering a sparse solu-
tion with just a few non-zero coefficients. In other words, variable selection is effectively
performed, a feature that combines well with wavelets as they are known to offer sparse,
yet precise, representations of many types of functions.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 255

We therefore use the LASSO, adding a L1 penalty on the γ coefficients onto the minus
log-likelihood, so the objective function becomes


N
Q(α, γ ) = − log L(α, γ ) + λ |γr |
r=1


n 
N
=− (yi (α + Bi γ ) − log(1 + eα+Bi γ )) + λ |γr |,
i= 1 r=1

which must be minimized with respect to α and γ . The parameter λ is a tuning parame-
ter that controls the amount of the shrinkage and is selected through cross-validation on
training data using a deviance criterion. Keep in mind that the objective function Q also
depends on the specific wavelet basis, that is, on the number of vanishing moments and
the detail level parameter, j0 . Both can be chosen via cross-validation.
There is no closed form solution to the minimization problem, so we need to employ
an optimization algorithm. Various optimization algorithms have been suggested, such as
quadratic programming [16], the LARS algorithm [17], and coordinate descent algorithm
[18]. The coordinate descent algorithm has advantages since it can be performed in O(nN)
calculations and with a practical limit of variables as large as 1,000,000. In practice we
have used the cv.glmnet function in the R-package glmnet [19] for our numerical
studies. It includes cross validation for selection of the tuning parameter λ and is repeated
for various vanishing moments and detail levels.

2.2. Functional principal component approach


Now, turn to the principal component approach where the basis functions are eigenfunc-
tions estimated from the data. First, consider the n individual trajectories as independent
realizations from a random process {X(t), t ∈ T} with mean function μ(t) and covariance
function X (s, t) = Cov(X(s), X(t)). Mercer’s
∞ theorem gives us the eigen-decomposition
of the covariance function X (s, t) = k=1 λk θk (s)θk (t), where θk and λ1 ≥ λ2 ≥ . . . are
the orthogonal eigenfunctions and ordered eigenvalues, respectively. The Karhunen–Loève
expansion of the random function Xi can be written as


Xi (t) = μ(t) + ξik θk (t), (8)
k=1

where ξik is the kth FPC score for the ith individual and is given by ξik = (Xi (t) −
μ(t))θk (t) dt. The FPC score thus measures similarity between the deviation of Xi from
the mean function and the kth eigenfunction. The FPC scores are uncorrelated random
variables with mean 0 and var(ξik ) = λk .
Next, express the coefficient function β and the covariate function Xi as truncated
expansions in the Karhunen–Loève basis:


K 
K
β(t) = bk θk (t), Xi (t) = μ(t) + ξik θk (t).
k=1 k=1
256 S. N. MOUSAVI AND H. SØRENSEN

In particular, the same basis is used for β and the Xi s, and regularization is imposed
through the number of eigen functions, K, used in the expansions. The cumulative per cent
variance method or cross validation can be used to choose K. Thanks to the orthogonality
of the eigenfunctions θk (t), Equation (2) becomes
  
K
πi
ηi = log = α̃ + bk ξik . (9)
1 − πi
k=1
 
Here α̃ = α + β(t)X̄(t) dt, that is the mean function X̄(t) = (1/n) ni=1 Xi (t) is
absorbed into the intercept. This model can be written in matrix form as η = α̃1 + ξ b
where b = (b1 , . . . , bK )T , and ξ = (ξik )n×K is the matrix of FPC scores.
Once the FPC scores have been estimated, the unknown coefficients in b can be esti-
mated with a multiple logistic regression. From now on we refer to this approach as
functional logistic regression based on functional principal component analysis (FLRF-
PCA).
It remains to be explained how the FPC scores are computed. In principle, with esti-
mates of the eigenfunctions {θ̂k (t)}Kk=1 and the mean function x̄(t), the FPC scores can be
obtained by numerical integration as follows:

1 
N
ξ̂ik = (Xi (t) − x̄(t))θ̂k (t) dt ≈ (xi (tj ) − x̄(tj ))θ̂k (tj ). (10)
N j=1

These estimates are precise when the observations are dense, but for sparse data numerical
integration is no longer appropriate. As an alternative, Yao et al. [6] suggested principal
analysis via conditional expectation (PACE) to handle sparse and irregular longitudinal
data. It is known that the best prediction for FPC scores for the ith subject is the conditional
expectation which under Gaussian assumptions is given by Bibby et al. [20],
T −1
ξ̃ik = E[ξik | Xi ] = λk θi,k Xi (Xi − μ).

PACE gives better results for sparse and/or irregular functional data, and we use this
approach in our simulations and application in Sections 3 and 4.

2.3. Penalized functional logistic regression


Penalized functional logistic regression (PFLR) was discussed by Goldsmith et al. [21] for
generalized functional linear models. The procedure consists of three steps: First, the ran-
dom functions Xi are approximated by the finite series expansion Xi (t) = Kk=1 x
cik θk (t),
where θ(t) = {θ1 (t), . . . , θKx (t)} is the set of the first Kx eigenfunctions of the smoothed
covariance matrix X (s, t) = Cov[Xi (s), Xi (t)], cf. Section 2.2. Second, a truncated power
series basis or a B-spline basis is used to represent the coefficient function, hence β(t) =
 Kβ
k=1 bk ωk (t) = ω (t)b for the selected basis ω(t) = {ω1 (t), . . . , ωKβ (t)}. Third, a penal-
T

ized log-likelihood based on a linear mixed model is minimized. Notice that the first step
in PFLR is identical to the first step in FLRFPCA, so the difference lies in the expansion of
β and penalization.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 257

With these expansions the linear predictors are


  
πi
li = log = α + Xi (t)β(t) dt = α + cTi Jθ ω b, (11)
1 − πi

where ci = (ci1 , . . . , ciKx )T , b = (b1 , . . . , bKβ )T , and Jθ ω = T θ (t)ωT (t) dt have dimen-
sions Kx , Kβ , and Kx × Kβ , respectively. By emulating from the smoothing literature, Kβ
is chosen large and Kx ≥ Kβ due to identifiability constraints [1, Chapter 15].
Two different approaches have been discussed in the literature on penalized splines: (1)
P-splines, that is, B-splines with equally spaced knots and difference penalties [22] and (2)
a truncated power series basis with unequally spaced knots usually based on the quantiles
of the time points of observation and a ridge penalty [23]. Eilers and Marx [24] showed
that B-splines and difference penalties are easily adapted to smoothing of periodic data.
This can be done by wrapping around the basis functions at the ‘end’ to the ‘beginning’
and also changing the difference penalty in a similar way. They also mention that there
is no evidence of any advantage of penalized truncated power series functions over the
penalized B-splines.
The original PFR paper [21] used a truncated power series basis for β, whereas PFR
with B-splines and difference penalties was implemented in a modified version of the R
refund package [25]. We use the refund implementation for our numerical studies. In
principle, another basis than the FPC basis could be used for the covariate functions; how-
ever, the FPC approach is fast and also works for sparse and irregular data, and alternative
expansions have not been implemented in the refund package.

3. Simulation studies
In this section, two simulation studies are described which evaluate and compare the afore-
mentioned approaches. In the first study, binary outcomes were simulated according to a
functional regression model conditional on the functional covariates, whereas in the sec-
ond study the functions were simulated conditional on the group membership. Both data
generating processes are inspired by the literature on functional regression.

3.1. Simulation from a functional logistic regression model


Our first simulation approach consists of two steps: We first generated functional iid.
covariates Xi and then generated the response Yi from a logistic regression model with
a fixed β. In other words, the simulation model is in accordance with the model used for
classification. More specifically, for the first step, we considered T = [0, 10] and 28 = 256
equally spaced time points {tj ∈ [0, 10], j = 1, 2, . . . , 256}. We generated 150 functional
predictors using basis expansions on the form


13
Xi (tj ) = cik φk (tj ), i = 1, 2, . . . , 150, j = 1, 2, . . . , 256, tij ∈ [0, 10]. (12)
k=1

The basis functions {φk (t)}13


k=1 are cubic B-splines corresponding to nine equally space
interior knots over the interval [0, 10], and cik are random basis coefficients simulated
258 S. N. MOUSAVI AND H. SØRENSEN

as follows: The 150 × 13 matrix C is a product ZU where Z is a 150 × 13 matrix of iid.


standard normal variables, and U is a 13 × 13 matrix of iid. random values with uniform
distribution on [0, 1]. This method of generating the functional covariates is adapted from
the work of Escabias et al. [26]. We allowed for measurement errors on the functional data
and thus considered the curves contaminated with noise, that is, Wi (tj ) = Xi (tj ) + δi (tj )
where δi (tj ) ∼ N(0, σX2 ). We used σX = 0 (no noise) and σX = 0.5 as standard deviation
in our study. The left panel of Figure 1 shows a sample of 10 random functional covariates
Xi . For two curves the W process is also shown (dashed curves).
In the second step the binary responses Yi were generated by the following model:
 10
logit Pr{Yi = 1 | Xi } = Xi (t)β(t) dt, i = 1, 2, . . . , 150, (13)
0

where β is a known function defined on [0, 10]. Notice that the intercept α was set to
zero. We considered two different choices of true parameter functions, namely, β1 (t) =
sin(tπ/3) and β2 (t) = −p(t | 2, 0.3) + 3p(t | 5, 0.4) + p(t | 7.5, 0.5), where p(· | μ, σ ) rep-
resents the normal density with mean μ and standard deviation σ . These functions are
adapted from the work of Goldsmith et al. [8] with minor changes in order to generate
datasets compatible with the binary model, and they are displayed in the right panel of
Figure 1. The curves in the left part of the figure are coloured according to the value of Yi
as obtained from (13) with β1 as true coefficient function.
For each of β1 and β2 we simulated 100 datasets {Yi , Xi (tj ), Wi (tj ), j = 1, 2, . . . , 256,
i = 1, 2, . . . , 150} of the above type. For each dataset, 100 observations were used as train-
ing data in order to fit the logistic model, and the fitted parameter function was extracted.
The remaining 50 observations were used as test data, that is, the response was considered
unknown and the fitted model was used to predict it. We used the cut-off point 0.5 for
prediction, that is, assigned the value 1 to Y if π̂(x) ≥ 0.5 and 0 otherwise.

Figure 1. Left: Sample of 10 random functional covariates Xi generated from (12). Curves are coloured
according to the outcome of Y (black and red for 0 and 1, respectively), generated from (13) with param-
eter function β1 . For the two highlighted curves the W process is also shown (dashed curves). Right:
Parameter functions used for simulation of the response. (Colour online. Black and White in print.)
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 259

All three approaches from Section 2 (FLRFPCA, PFLR, and FLRWLASSO) were tested
for both β1 and β2 and with and without measurement noise. In the FLRPCA approach the
number of eigenfunction was selected via cross validation, which in practice gave around
5–7 (most often six) eigenfunctions. In the FLRWLASSO approach the detail level j0 and
the vanishing moment were selected by cross validation as well, and resulted in 2 and
6, respectively. For the PFLR method, we chose Kx such that 99% of the variance was
explained and used Kβ = 30, that is, 30 basis functions to represent the parameter func-
tion. This is deemed large enough to capture relevant features (and penalization ensures
smoothness).
Boxplots of misclassification rates (MCRs) for test data are shown in Figure 2, and the
average MCRs are listed in Table 1. As expected, MCRs are larger for test data compared to
training data (not shown). The MCRs are smaller for datasets generated by β2 compared
to β1 and classification rates only increase slightly when data are contaminated with noise.
The MCRs are similar for the three methods, but FLRFPCA has the largest MCR in all four
scenarios.

FLRFPCA PFLR FLRWLASSO

beta_1(t) beta_2(t)

40

30
Misclassificaton rate(%)

20

10

oise oise oise oise


ut N th N ut N th N
tho Wi tho Wi
Wi Wi

Figure 2. Boxplots of the MCRs for test data in the different scenarios and for each of the three methods.
Each boxplot is based on 100 simulated datasets, each consisting of 50 test curves. (Colour online. Black
and White in print.)

Table 1. Average misclassification rate (over 100 simulated datasets) in per cent for each method and
each simulation scenario.
FLRFPCA PFLR FLRWLASSO
σX2 =0 σX2 = 0.5 σX2 =0 σX2 = 0.5 σX2 =0 σX2 = 0.5
β1 19.74 20.20 18.60 18.94 19.02 19.10
β2 12.14 13.10 11.36 11.84 10.56 11.06
Note: Data were simulated as described in Section 3.1
260 S. N. MOUSAVI AND H. SØRENSEN

FLRFPCA PFLR FLRWLASSO

40

0.06
0.04
20
500

0.02
1 (t)

0
0

0.00
−20

−0.02
−500

−40
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10

0.06
600

40

0.04
20
200

0.02
2 (t)
0

0
−200

0.00
−20

−0.02
−600

0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
t t t

Figure 3. Estimated coefficient functions for 100 simulated datasets with true coefficient function β1
(top) and β2 (bottom). Mean functions over the 100 estimated coefficient functions are shown with bold
lines. Two datasets generated with β1 were excluded for FLRFPCA and PFLR, and two datsets generated
with β2 were excluded for FLRFPCA; these datasets gave estimated values up to ±1500 and made the
plots less illustrative. No datasets were removed for FLRWLASSO.

Recall that the simulation model and the regression model are of the same type for this
simulation scenario (the regression model is true), and another aspect is therefore the abil-
ity of the different methods to reproduce the parameter functions β. Unfortunately, it turns
out that neither of the methods does a good job in that respect, see Figure 3 and compare
to the right part of Figure 1. PFLR and FLRWLASSO estimate the shape of β1 reason-
ably well, but the sharpness of the positive peak in β2 is not reconstructed by the methods
(FLRWLASSO performs slightly better in this respect). Moreover, the scale is incorrect for
both methods. FLRFPCA produces irregular estimates and does not even catch the shape
of the coefficient functions. We conclude that the estimates are not reliable as estimates
of the true data generating mechanism. However, the estimates can still be useful as they
define the prediction model; in particular they indicate which parts of the domain are most
important for the covariate-response association.

3.2. Stratified simulation


In the second simulation study we simulated curves from two different distributions corre-
sponding to Y = 0 and Y = 1, respectively. This simulation set-up is adopted from the work
of Aguilera et al. [27]. Each simulated dataset consisted of a total of 250 curves: 125 random
curves were generated as x(t) = uh1 (t) + (1 − u)h2 (t) + (t), and 125 random curves
were generated as x(t) = uh1 (t) + (1 − u)h3 (t) + (t). Here u and (t) are iid. uniform
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 261

Figure 4. A sample of 30 random signal generated from two different distributions. Curves are coloured
according to the classes. (Colour online. Black and White in print.)

and standard normal random variables, respectively, and h1 (t) = max{6 − |t − 11|, 0},
h2 (t) = h1 (t − 3), and h3 (t) = h1 (t + 3). The sample curves were generated at 101 equally
spaced timepoints on the interval [1, 21], and the binary response Y was set to 0 for curves
belonging to the first class and 1 for the curves from the second class. Notice that there
is no true parameter function in this set-up as the curves were simulated conditionally
on the response (not the opposite). Figure 4 shows a sample of 30 random functional
covariates Xi .
We divided each dataset into training data (150 curves) and test data (100 curves), and
proceeded as in Section 3.1. A total of 200 such datasets were simulated, and the average
MCRs were 4.6% for FLRFPCA, 4.2% for PFLR, and 2.0% for FLRWLASSO. Hence, from
a classification point of view FLRWLASSO performs better than FLRFPCA and PFLR in
this set-up. Recall that there is no true β in this simulation scenario. Nevertheless, each
procedure produces an estimate of β, and it turns out that FLRWLASSO leads to stable
estimates whereas the estimates from FLRFPCA and PFLR are extremely variable.

3.3. Summary of simulation studies


Our conclusions based on the simulation studies are the following: (1) Functional logis-
tic regression can be useful for classification – even if the true data generating model is
not a functional logistic regression model; (2) the classification results do not differ much
between the three considered approaches but FLRWLASSO performs better at least for the
second simulation study; (3) the estimated regression model is not necessarily reliable as an
estimate of the true conditional distribution of the response given the curve, but can be use-
ful for identification of intervals with strong association between curve and response, and
FLRWLASSO is the most reliable method in that respect. Finally, computation time is an
important aspect, in particular if one wants to use bootstrap or other resampling methods
for statistical inference. As will be illustrated in Section 4, FLRWLASSO is roughly 25 times
faster than the other methods which gives that method a large advantage over the others.
262 S. N. MOUSAVI AND H. SØRENSEN

4. Lameness detection for horses


Lameness is a common problem for sport horses. Detection of lameness at an early stage
could prevent chronic lameness [28,29], but low-degree lameness is difficult to detect with
clinical inspection, so supplementary methods for lameness detection and identification
of the lame limb would be welcome.
Walk, trot, and canter are the most common gaits, and the first two are symmetric. It
is well known that lameness disturbs the symmetry, so continuous monitoring of activi-
ties from these gaits would be expected to be informative about the lameness status of the
horse. Horses can be monitored with accelerometers which measure the activity through
electrical signals that can be converted to proxy measurements for acceleration. Thomsen
et al. [29] recorded acceleration data from trotting horse with this technology; with and
without stimulation of lameness. In the following we will give a brief description of these
data, and then apply FLRFPCA, PFLR, and FLRWLASSO for supervised classification in
three different scenarios.

4.1. Data collection and lameness groups


A 10G, three-axis accelerometer was used to record the signal of acceleration in three direc-
tions (vertical, transversal, and longitudinal). The accelerometer was put on the lowest
point of the back of horse which is the closest surface location to the body centre of mass.
More details on the data collection process can be found in [29,30].
Eight horses with no indication of lameness were used in 2 sub-experiments to generate
a total of 85 acceleration signals in 5 lameness groups. Lameness was induced mechanically
by equipping the horse with a modified horseshoe with a screw eliciting pressure on the
sole of the hoof. The shoe was attached to one of the four hoofs and horses were also tested
without the shoe; amounting to five groups.
Experience is that acceleration is similar for lameness on limbs from the same diagonal
pair of limbs. Therefore, and in order to have larger groups, we only consider three groups
in the following: Normal (NO) consisting of 23 signals from horses with no shoe attached;
left diagonal (LD) consisting of 30 signals from horses with the shoe attached to left-fore
or right-hind limb; and right diagonal (RD) consisting of 32 signals from horses with the
shoe attached to right-fore or left-hind limb. We will study three scenarios:

(1) Left diagonal vs. right diagonal (LD/RD).


(2) Normal vs. left diagonal (NO/LD).
(3) Normal vs. right diagonal (NO/RD).

We will only use the acceleration in the vertical direction in the current study. The raw
acceleration signals consist of data from eight complete gait cycles (between 1121 and 1440
observations). Prior to the classification analysis we carried out several pre-processing steps
in order to reduce variation between gait cycles in each signal and variation between signals
due to different timing at the beginning. These steps of pre-processing were explained in
[31, Paper I, Appendix A].
After pre-processing the data consist of 85 signals on (0, 1), and each signal represents
one gait cycle. Importantly, and for all signals, the first half corresponds to stance on the
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 263

NO LD RD

2
1

1
Acceleration
0

0
−1

−1

−1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 5. Vertical acceleration signals (grey lines) from the normal (NO), left diagonal (LD), and right
diagonal (RD) groups. The bold lines are the average curves.

right diagonal whereas the second half corresponds to stance on the left diagonal. The
signals are shown in Figure 5, divided into groups. A close look reveals that for the healthy
horses (NO), the first and second halves are similar, whereas this symmetry is disturbed
in groups LD and RD. More specifically, signals from the RD group generally have smaller
amplitude on (0, 0.5) compared to (0.5, 1), and vice versa for the LD group. This is because
horses tend to put less pressure, and thus generate less upward acceleration, when they
stand on the lame compared to the healthy diagonal.
Notice that the independence assumption is not completely appropriate since there are
several signals per horse. However, simulations in [14] indicated that correlation among
signals from the same horse does not influence results much, so we proceed without
incorporating dependence in the analyses.

4.2. Analysis
The primary aim of the analysis is classification, but we will also comment on estimates of
the parameter function β and on execution times. The tuning parameters for FLRWLASSO
were selected via cross validation which gave seven vanishing moments for LD/RD and
NO/LD, and nine vanishing moments for NO/RD, and j0 = 0 for all scenarios. The number
of eigenfunctions for FLRFPCA was selected via cross validation which lead to 12, 14, and
16 eigenfunction in scenarios 1, 2, and 3, respectively. For the PFLR approach, 30 B-spline
functions were used to expand the parameter function β, and the number of eigenfunction
for the covariate functions was chosen such that 99% of the variance was explained.
First, as is common for evaluation in supervised classification, we ran a leave-one-curve-
out analysis. That is, all data (signals and groups) except for the ith observation were used
as training data to fit the model, whereafter signal i was used as test data, and the prediction
was compared to the true group. This was repeated for all signals (i = 1, 2, . . . , n where n
is the number of curves in the scenario under consideration). The results are displayed in
264 S. N. MOUSAVI AND H. SØRENSEN

Table 2. Results from the leave-one-curve-out analysis.


FLRFPCA PFLR FLRWLASSO
Scenario Predicted Predicted Predicted
LD True group LD RD True group LD RD True group LD RD
vs LD 28 2 LD 29 1 LD 30 0
RD RD 0 32 RD 3 29 RD 2 30
NO True group NO LD True group NO LD True group NO LD
vs NO 23 0 NO 20 3 NO 22 1
LD LD 7 23 LD 3 27 LD 2 28
NO True group NO RD True group NO RD True group NO RD
vs NO 23 0 NO 21 2 NO 21 2
RD RD 6 26 RD 3 29 RD 1 31

Table 3. MCRs in per cent based on leave-one-curve-out classification.


Scenario FLRFPCA PFLR FLRWLASSO
LD vs RD 3.23 6.40 3.23
NO vs LD 13.21 11.32 5.66
NO vs RD 10.90 9.09 5.45

Tables 2 and 3. The first shows the results for each scenario, methods, and group, whereas
the latter summarizes the MCRs in per cent. From a classification point of view, we see that
FLRWLASSO performs the best in all scenarios.
Second, let us examine the estimated parameter functions β̂ where for each scenario all
available observations have been used for estimation. The estimated functions are shown in
Figure 6. The shapes of β̂ are roughly the same for PFLR and FLRWLASSO, but the PFLR
estimates are more smooth and both are much more smooth than the FLRFPCA estimates.
The scales for β̂ differ a lot between estimation methods and scenarios with FLRWLASSO
being the most stable across scenarios.
For FLRWLASSO and PFLR, β̂(t) and β̂(t + 0.5) tend to have opposite signs
(0 < t < 0.5), suggesting that the difference between the two halves of a signal is associated
to the lameness status. This makes good sense. Furthermore, due to symmetry of trot, and
because all signals start with stance on the right diagonal, we would expect the behaviour
of β̂ for NO/LD on the interval (0, 0.5) to be similar to the behaviour of β̂ for NO/RD on
the interval (0.5, 1), and vice versa. This is indeed the case for FLRWLASSO when it comes
to shape, and to some extent for PFLR, but not FLRFPCA.
In order to examine the stability of the estimates we performed a bootstrap study for
each scenario, that is, we constructed new datasets by sampling with replacement from the
original curves, and estimated β for each bootstrap sample. Figure 7 shows the bootstrap
estimates for FLRWLASSO for 285 bootstrap samples with the estimate from the original
data highlighted. More specifically, β was estimated for 300 simulated datasets, and the 5%
most extreme estimates were considered outliers, using the L2 -norm as a measure of the
proximity between the estimated parameter functions from the bootstrap samples and the
estimated parameter function of the original data. The estimates are fairly stable in each
scenario; in particular the bootstrap samples agree on the sign of β in large parts of the
domain. Similar computations for FLRFPCA and PFLR often resulted in huge values of
|β̂(t)|, which is in line with Figure 6, and huge variation between bootstrap samples (not
shown).
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 265

FLRFPCA − LD vs RD PFLR − LD vs RD FLRWLASSO − LD vs RD

0.2
2000

−20

0.0
0

−0.2
−40
−3000

−0.4
−60
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
FLRFPCA − NO vs LD PFLR − NO vs LD FLRWLASSO − NO vs LD
5000

1.0
40

0.5
30
0

20

0.0
10
−10000

−0.5
0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
FLRFPCA − NO vs RD PFLR − NO vs RD FLRWLASSO − NO vs RD
1.0

0.2
200

0.6

0.1
0.0
−200

0.2
−0.2
−600

−0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 6. Estimated parameter functions β̂(t) for the three different approaches and three scenarios.
All the available data have been used in each scenario.

FLRWLASSO − LD vs RD FLRWLASSO − NO vs LD FLRWLASSO − NO vs RD


1
0.5

0.5
0
0.0
(t )

0.0
−1
−0.5

−0.5
−2
−1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Time Time Time

Figure 7. Estimates of the parameter function β for the FLRWLASSO method. The highlighted curves
represent the estimate of β for the original data, and the grey curves are the estimates from 300 bootstrap
samples with the 5% most extreme estimates removed.
266 S. N. MOUSAVI AND H. SØRENSEN

Table 4. Execution time in seconds for the different approaches and scenarios.
Scenario FLRFPCA PFLR FLRWLASSO
LD vs RD 6.62 8.75 0.26
NO vs LD 6.92 7.66 0.26
NO vs RD 6.48 7.16 0.24

Altogether the study of the estimated parameter functions confirms our impression
from the simulation studies, that the parameter estimates are unstable and not all that
reliable at least considering the scale. However, FLRWLASSO does a far better job than
FLRFPCA and PFLR, and the estimated function suggests which parts of the covariate
curves that are associated to the binary outcome.
Finally, some comments on execution time. The computation time in seconds for fit-
ting the model with all available data for each approach is reported in Table 4. The
R-implementation was executed on a i3-core Pentium processor with 4 GB of RAM.
FLRWLASSO is at least 25 times faster than the other two approaches. Evidently, this is
important when performing many model fits such as in leave-one-curve-out studies or
with bootstrap computations.

5. Discussion
In this paper we have reviewed functional logistic regression and compared three
approaches: FLRWLASSO, FLRFPCA, and PFLR. The performance was assessed in three
directions: classification, estimation of the parameter function, and computational effi-
ciency. Based on our simulation study and the application to the lameness data, we
conclude the following:

• MCRs were similar for all three methods in the simulation scenarios, but FLRWLASSO
performed the best in the application.
• None of the three methods were convincing in their ability to reconstruct the parame-
ter function. In particular, FLRFPCA and PFLR gave unreliable estimates, mainly due to
scale problems which also caused numerical problems in bootstrap computations. FLR-
WLASSO provided reasonably stable estimates that indicate which parts of the domain
are most important for the covariate-response association.
• For the current R-implementations FLRWLASSO is at least 25 times faster than the
other two methods.

In summary, FLRWLASSO seems to be preferable to FLRFPCA and PFLR from a clas-


sification point of view, and is also much faster and more stable regarding estimation of the
parameter function. For these reasons, and in particular if bootstrap or other resampling
methods are used as part of the analysis, we recommend to use FLRWLASSO for functional
logistic regression.

Disclosure statement
No potential conflict of interest was reported by the authors.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION 267

Funding
The authors advise no direct funding is associated with the research reported in this article.

ORCID
Seyed Nourollah Mousavi http://orcid.org/0000-0002-9208-1308

References
[1] Ramsay JO, Silverman BW. Functional data analysis. 2nd ed. New York (NY): Springer;
2005.
[2] Ferraty F, Vieu P. Nonparametric functional data analysis: theory and practice. New York (NY):
Springer; 2006.
[3] Horváth L, Kokoszka P. Inference for functional data with applications. New York (NY):
Springer; 2012.
[4] Ramsay JO, Dalzell C. Some tools for functional data analysis. J R Stat Soc Ser B.
1991;53:539–572.
[5] Cardot H, Ferraty F, Sarda P. Functional linear model. Stat Probab Lett. 1999;45(1):11–22.
[6] Yao F, Müller H-G, Wang J-L. Functional data analysis for sparse longitudinal data. J Am Stat
Assoc. 2005;100(470):577–590.
[7] James GM, Wang J, Zhu J. Functional linear regression that’s interpretable. Ann Stat.
2009;37(5A):2083–2108.
[8] Goldsmith J, Crainiceanu CM, Caffo BS, et al. Penalized functional regression analysis of white-
matter tract profiles in multiple sclerosis. Neuroimage. 2011;57(2):431–439.
[9] Zhao Y, Ogden RT, Reiss PT. Wavelet-based LASSO in functional linear regression. J Comput
Graph Stat. 2012;21(3):600–617.
[10] Ratcliffe SJ, Heller GZ, Leader LR. Functional data analysis with application to peri-
odically stimulated foetal heart rate data. II: functional logistic regression. Stat Med.
2002;21(8):1115–1127.
[11] Aguilera AM, Escabias M, Valderrama MJ. Using principal components for estimating
logistic regression with high-dimensional multicollinear data. Comput Stat Data Anal.
2006;50(8):1905–1924.
[12] Aguilera AM, Escabias M, Valderrama MJ. Discussion of different logistic models with
functional data. Application to systemic Lupus Erythematosus. Comput Stat Data Anal.
2008;53(1):151–163.
[13] Wei P, Tang H, Li D. Functional logistic regression approach to detecting gene by lon-
gitudinal environmental exposure interaction in a case–control study. Genet. Epidemiol.
2014;38(7):638–651.
[14] Mousavi SN, Sørensen H. Multinomial functional regression with wavelets and lasso penaliza-
tion. Econometrics Stat. 2017;1:150–166.
[15] Ryan TP. Modern regression methods. New York (NY): John Wiley & Sons; 2008.
[16] Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B.
1996;58:267–288.
[17] Efron B, Hastie T, Johnstone I, et al. Least angle regression. Ann Stat. 2004;32(2):407–499.
[18] Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat.
2008;2:224–244.
[19] Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via
coordinate descent. J Stat Softw. 2010;33(1):1–22.
[20] Bibby J, Kent J, Mardia K. Multivariate analysis. London: Harcourt Brace & Company;
1979.
[21] Goldsmith J, Bobb J, Crainiceanu CM, et al. Penalized functional regression. J Comput Graph
Stat. 2011;20(4):830–851.
[22] Eilers PH, Marx BD. Flexible smoothing with B-splines and penalties. Stat Sci. 1996;11:
89–102.
268 S. N. MOUSAVI AND H. SØRENSEN

[23] Ruppert D, Wand MP, Carroll RJ. Semiparametric regression. New York (NY): Cambridge
University Press; 2003.
[24] Eilers PHC, Marx BD. Splines, knots, and penalties. Comput Stat. 2010;2(6):637–653.
[25] Goldsmith J, Scheipl F, Huang L, et al. refund: regression with functional data. R package
version 0.1-16. 2016.
[26] Escabias M, Aguilera AM, Valderrama MJ. Principal component estimation of func-
tional logistic regression: discussion of two different approaches. J Nonparametr Stat.
2004;16(3–4):365–384.
[27] Aguilera A, Aguilera-Morillo M, Escabias M, et al. Penalized spline approaches for func-
tional principal component logit regression. In: Frédéric Ferraty, editor. Recent advances in
functional data analysis and related topics. Berlin: Springer; 2011. p. 1–7.
[28] Stashak T. Examination for lameness. In: Stashak T, editor. Adams’ lameness in horses, Vol. 5.
Baltimore: Lippincott Williams & Wilkins; 2002. p. 113–183.
[29] Thomsen MH, Jensen AT, Sørensen H, et al. Symmetry indices based on accelerometric data
in trotting horses. J Biomech. 2010;43(13):2608–2612.
[30] Sørensen H, Tolver A, Thomsen MH, et al. Quantification of symmetry for functional data
with application to equine lameness classification. J Appl Stat. 2012;39(2):337–360.
[31] Mousavi SN. Analysis of functional data with focus on multinomial regression and multilevel
data [PhD thesis]. University of Copenhagen; 2015.

You might also like