Professional Documents
Culture Documents
Oliva 14 B
Oliva 14 B
715
FuSSO: Functional Shrinkage and Selection Operator
f1 f1
f2 f2
...
...
f Y fi Y fi Y
...
...
fp-1 fp-1
fp fp
Figure 1: (a) Model where mapping takes in a function f and produces a real Y . (b) Model where response Y is
dependent on multiple input functions f1 , . . . , fp . (c) Sparse model where response Y is dependent on a sparse
subset of input functions f1 , . . . , fp .
covariates and a real-valued response, the FuSSO: LASSO-like regression estimators that work with func-
Functional Shrinkage and Selection Operator. No tional data include the following. In [6], one has a func-
parametric assumptions are made on the nature of in- tional output and several real-valued covariates. Here,
put functions. We shall assume that the response is the estimator finds a sparse set of functions to scale by
the result of a sparse set of linear combinations of in- the real valued covariates to produce the functional re-
put functions
P and other non-paramteric functions {gi }: sponse. Also, [17, 3] study the case when one has one
Y = j hfj , gj i. The resulting method is a LASSO- functional covariate f and one real valued response
like [10] estimator that effectively zeros out entire func- that is linearly Rdependent on f and some function g:
tions from consideration in regressing the response. Y = hf, gi = f g. In [17] the estimator searches
for sparsity across wavelet basis projection coefficients.
Our contributions are as follows. We introduce the
In [3], sparsity is in achieved in the time (input) do-
FuSSO, an estimator for performing regression with
main of the dth derivative of g; i.e. [Dd g](t) = 0 for
many functional covariates and a real-valued response.
many values of t where Dd is the differential opera-
Furthermore, we provide a theoretical backing of the
tor. Hence, roughly speaking, [17, 3] look for spar-
FuSSO estimator via proof of asymptotic sparsistency
sity across frequency and time domains respectively,
under certain conditions. We also illustrate the esti-
for the regressing function g. However, these methods
mator with applications on synthetic data as well as
do not consider the case where one has many input
in regressing the age of a subject when given orien-
functional covariates {f1 , . . . , fp }, and needs to choose
tation distribution function (dODF) [14] data for the
among them. That is, [17, 3] do not provide a method
subjects white matter.
to select among function covariates in an analogous
fashion to how the LASSO selects among real-valued
2 Related Work covariates.
Lastly, it is worth noting that in P
our estimator we will
As previously mentioned, recently [8] explored regres-
sion with a mapping that takes in a probability density
have an additive linear model, j hfj , gj i where we
search for {gi } in a broad, non-parametric family such
function and outputs a real value. Furthermore, [7]
that many gj are the zero function. Such a task is
studies the case when both the input and outputs are
similar in nature to the SpAM estimator [9], in which
distributions. In addition, functional analysis relates P
one also has an additive model j gj (Xj ) (in the di-
to the study for functional data [2]. In all these works,
mensions of a real vector X) and searches for {gi } in
the mappings studied take in only one functional co-
a broad, non-parametric family such that many gj are
variate. Based on them, it is not immediately evident
the zero function. Note though, that in the SpAM
how to expand on these ideas to develop an estimator
model, the {gi } functions are applied to real covari-
that simultaneously performs regression and feature
ates via a function evaluation. In the FuSSO model,
selection with multiple function covariates.
{gi } are applied to functional covariates via an inner
To the best of our knowledge, there has been no prior product; that is, FuSSO works over functional, not
work in studying sparse mappings that take multiple real-valued covariates, unlike SpAM.
functional inputs and produce a real-valued output.
716
Oliva, Poczos , Verstynen, Singh, Schneider, Yeh, Tseng
(i)
3 Model p, fj L2 [0, 1]}. Then, an additive linear model
would be:
To better understand FuSSOs model we draw several p p X
analogies to real-valued linear regression and Group- Yi =
X (i)
hfj , gj i + i =
X (i)
jm jm + i , (2)
LASSO [16]. Note that although for simplicity we fo- j=1 j=1 m=1
cus on functions working over a one dimensional do-
main, it is straightforward to extend the estimator and (i)
where g1 , . . . , gp L2 [0, 1], and jm and jm are pro-
results to the multidimensional case. Consider a model (i)
for typical real-valued linear regression with a data-set jection coefficients for fj and gj respectively.
of input-output pairs {(Xi , Yi )}N
i=1 : Suppose that one has few observations relative to the
number of features (N p). In the real-valued
Yi = hXi , wi + i , case, in order to effectively find a solution for w =
(w1T , . . . , wpT )T one may search for a group sparse so-
iid
where Yi R, Xi Rd , w Rd , i N (0, 2 ), lution where many wj = 0. To do so, one may consider
Pd
and hXi , wi = j=1 Xij wj . If instead one were work- the following Group-LASSO regression:
ing with functional data {(f (i) , Yi )}N
i=1 , where f
(i)
: p p
(i)
[0, 1] 7 R and f L2 [0, 1], one may similarly con- 1 X X
w? = argmin kY Xj wj k2 + N kwj k,
sider a linear model: w 2N j=1 j=1
(3)
Yi = hf (i) , gi + i ,
R1 where Xj is the N d matrix Xj = [X1j . . . XN j ]T ,
where, g : [0, 1] 7 R, and hf (i) , gi = 0 f (i) (t)g(t)dt. Y = (Y1 , . . . , YN )T , and k k is the Euclidean norm.
If = {m } m=1 is an orthonormal basis for L2 [0, 1]
[11] then we have that If in the functional case (2) one also has that N p,
one may set up a similar optimization to (3), whose
X direct analogue is:
f (i) (x) = (i)
m m (x), (1)
2
m=1 N p
1 X X (i)
(i) R1 g ? = argmin Yi hfj , gj i (4)
where, m = 0 f (i) (t)m (t)dt. Similarly, g(x) = g 2N i=1 j=1
P R1
m=1 m m (x) where m = 0 g(t)m (t)dt. Thus, p
X
+ N kgj k; (5)
Yi = hf (i) , gi + i j=1
X
X
=h m(i)
m (x), k k (x)i + i equivalently,
m=1 k=1 2
X N p X
X
(i) 1 X X (i)
= m k hm (x), k (x)i + i ? = argmin Yi jm jm (6)
m=1 k=1 2N i=1 j=1 m=1
v
p
X
(i) u
= m m + i , X uX
2 ,
m=1
+ N t jm (7)
j=1 m=1
where the last step follows from orthonormality of . P
where g = {gi }pi=1 = { m=1 im m , }pi=1 .
Going back to the real-valued covariate case, if instead
of having one feature vector per data instance: Xi However, it is unfeasible to directly observe functional
(i)
Rd , one had p feature vectors associated to each data inputs {fj | 1 i N, 1 j p} . Thus, we
instance: {Xij | 1 j p, Xij Rd }, an additive shall instead assume that one observes {~yj
(i)
|1i
linear model may be used for regression: N, 1 j p} where
p
(i) (i) (i)
=f~j + j ,
X
Yi = hXij , wj i + i , where w1 , . . . , wp Rd . ~yj (8)
j=1 T
(i) (i) (i) (i)
f~j = fj (1/n), fj (2/n), . . . , fj (1) , (9)
Similarly, in the functional case one may have p func- (i) iid
(i)
tions associated with data instance i: {fj | 1 j j N (0, 2 In ). (10)
717
FuSSO: Functional Shrinkage and Selection Operator
That is, we observe a grid of n noisy values for each (and efficient) to smoothness than the FuSSO. Fur-
(i)
functional input. Then, one may estimate jm as: thermore, we note that it is not necessary to have ob-
servations for input functions that are on a grid for the
(i) 1 T (i) 1 T ~ (i) (i) (i) (i) FuSSO, since one may estimate projection coefficients
jm =
~ ~y = ~ (f + j ) = jm + jm
n m j n m j in the case of an irregular design [11]. Moreover, we
(11) may also estimate projection coeffincients for density
functions with samples drawn from the pdf. Note that
T
where ~ m = (m (1/n), m (2/n), . . . , m (1)) . Fur- the Y-GL would fail to estimate our model in the irreg-
thermore, we may truncate the number of basis func- ular design case, and would not be possible in the case
(i)
tions used to express fj to Mn , estimating it as: were functions are pdf. Also, a two-stage estimator as
described in [5], where one first uses the regularization
Mn
(i)
X (i)
penalty with a large to find the support, then solves
fj (x) = jm m (x). (12) the optimization problem with a smaller on just the
m=1 estimated support to estimate the response, may be
Using the truncated estimate (12), one has: more efficient at estimating the response. Further-
more, an analogous problem as (15) may be framed
Mn to perform logistic regression and classification.
(i) (i)
X
hfj (x), gj i = jm jm , and
m=1
v 4 Theory
u Mn
(i)
u X (i)
kfj (x)k =t (jm )2 . Next, we show that the FuSSO is able to recover the
m=1 correct sparsity pattern asymptotically; i.e., that the
FuSSO estimate is sparsistent. In order to do so, we
Hence, using the approximations (12), (7) becomes:
shall show that with high probability there is an opti-
2 mal solution to our optimization problem (15) with the
N p XMn
1 X
Yi
X (i) correct sparsity pattern. We follow a similar argument
= argmin jm jm (13)
2N to [12, 9]. We shall use a witness technique to show
i=1 j=1 m=1
v that there is a coefficient/subgradient pair (, u) such
p u Mn
X uX that supp() = supp( ), for true response generating
+ N 2
jm (14)
Pp
. Let () = j=1 kj k2 , be our penalty term (14).
t
j=1 m=1 Let S denote the true set of non-zero functions; i.e.
1 Xp Xp S = {j | j 6= 0}, with s = |S|. First, we fix S c = 0,
2
= argmin kY Aj j k + N kj k, (15) and set uS = ()( )S . Note that for a vector 0 ,
2N j=1 j=1 ()( 0 ) = {u} where: uj = j0 /k 0 k2 , if j0 6= 0;
uj = kuj k2 1 if j0 = 0. We shall show that with high
where Aj is the N Mn matrix with values Aj (i, m) =
(i) probability, j S, j 6= 0 and j S c , kuj k2 < 1,
jm and j = (j1 , . . . , jMn )T . Note that one need thus showing that there is an optimal solution to our
not consider projection coefficients jm for m > Mn optimization problem (15) that has the true sparsity
since such projection coefficients will not decrease the pattern with high probability.
(i)
MSE term in (13) (because jm = 0 for m > Mn ), and
jm 6= 0 for m > Mn increases the norm penalty term First, we elaborate on our assumptions.
in (14). Hence we see that our sparse functional esti-
mates are a Group-LASSO problem on the projection 4.1 Assumptions
coefficients.
Let be the trigonometric basis, 1 (x) 1, k 2:
Extensions It is useful to note that there are sev-
2k (x) 2 cos(2kx), 2k+1 (x) 2 sin(2kx).
eral straightforward extensions to the FuSSO as pre-
(i) (i)
sented. First, we would like to note that it may Let D = {({~yj }pj=1 , Yi )}N
i=1 , where ~
yj is as (8), and
be possible to estimate the inner product of a func- Pp P (i)
(i) R (i) (i) Yi = j=1 m=1 jm jm + i as in (2). Assume that
tion fj , and gj as fj gj h~yj , n1 ~gj i, where (i)
1 i N, 1 j p: j (, Q), where:
~gj = (gj (1/n), . . . , gj (1))T . This effectively allows one
to use a naive approach of simply using Group-LASSO
X
(i)
on the ~yj feature vectors directly (well refer to this (, Q) ={ : c2k k2 Q},
method as Y-GL). It is important to note, however, k=1
that Y-GL will be less robust to noise, and adaptive ck =k if k even or one, (k 1) otherwise,
718
Oliva, Poczos , Verstynen, Singh, Schneider, Yeh, Tseng
Z 1
(i) (i) (i) (i) 4.2 Sparsistency
j ={jm R | jm = fj m , m N+ }
0
Theorem 1: P SN = S 1.
for 0 < Q < and 12 < < . Furthermore, as-
sume that that for the true generating the observed First, we state some lemmas, whose proofs may be
responses Yi , , 1 j p: j (, Q). found in the supplementary materials.
Let Aj be the N Mn matrix with entries Aj (i, m) =
(i) 4.2.1 Lemmata
jm . Let AS denote the matrix made up from hori-
zontally concatenating the Aj matrices with j S; i.e. Lemma 1 Let X be a non-negative r.v. and C be an
AS = [Aj1 . . . Ajs ], where {j1 , . . . , js } = S and ji < jk measurable event, then E [X|C] P(C) E [X].
for i < k. Suppose the following: Pn
Lemma 2 n1 k=1 m (k/n)l (k/n) = I{l = m}, for
max N1 ATS AS Cmax <
(16) 1 l, m n 1.
1 T
min N AS AS Cmin >0. (17) (i) (i) iid
Lemma 3 Let Hj be the rows of Hj , then Hj
Also, suppose (0, 1] s.t. j S c 2 (i) iid 2
N (0, n I), and HS N (0, n I).
max N1 ATj Aj Cmax <
(18) 12a
n
a a1/2 2 2
( Aj AS )( 1 ATS AS )1
1 / s Lemma 4 P (kHkmax n ) 2 pMn n
1 T
(19) e
N N 2
+1/2
Let Aj be the N Mn matrix with entries Aj (i, m) = Lemma 5 kEj kmax CQ n , where CQ (0, )
(i) (i) is a constant depending on Q.
~ Tm f~j . Let Hj be the N Mn matrix with
jm = n1
entries Hj (i, m) = jm = n1
(i) (i)
~ Tm j . Thus, Aj = Aj + Lemma 6 kS k22 Qs.
Hj . Furthermore, let Ej = Aj Aj . Then, Aj = Lemma 7 N0 , n0 , Cmin , Cmax , 0 < Cmin Cmax <
Aj + Ej + Hj . , 0 < 1 s.t. if kHkmax < na , and N > N0 ,
n > n0 then
In addition to the aforementioned assumptions, we
shall further assume the following: max N1 ATS AS Cmax < (29)
12a
pMn na /2 en
1
a < 1/2 0
s.t. (20) min N1 ATS AS Cmin >0 (30)
N minkj k
>0 (21)
jS
1
j S c ,
( N1 ATj AS )( N1 ATS AS )1
(31)
p
sMn n+ /2 + na 0
1
(22) 2 s
1 3/2 1/22 p
4.2.2 Proof of Theorem 1
s M n + log(sM n )/N 0 (23)
N
p Proposition 1 P j S j 6= 0 1.
N sMn /N 0 (24)
p
1
q
s Mn n +1/2
+ s log(N )
n 0 (25) Proof. Recall that by (21), N = minjS kj k > 0.
N
p ! Thus to prove that j S j 6= 0, it suffices to
1 sMn sMn log(N )
that : kS
show S k 2 . To do so we show
N
+ 0 (26)
N n+a1/2 na+1/2
P kS S k > N
2 0. Let B be the event that
1 p a
Mn log((p s)Mn )/N 0 (27) kHkmax < n . Note that:
N
N
1
s/(N N Mn2 /2 ) 0, (28) P kS S k >
2
and we assume 1 for the sake of simplification. We
may further simplify our assumptions if we take n = P kS S k > N B P(B) + P (B c ) .
2
N 1/2 and choose Mn optimally for function estimation:
Mn n1/(2+1) = N 1/(4+2) . Furthermore, take s = Furthermore,
O(1), N 1, and = 2. Under these conditions, our N
h i
P kS S k > B 2
E kS S k B .
1
assumptions reduce to 10 < a and taking the follow to 2 N
go to zero:
10a3 1 a
7
Then, looking at the stationarity condition for the sup-
eN 2 , N 10 /N ,
1/20
pN 20 , N N port S:
1
log(N ), 12 N /10 log(pN ).
1 9
N 2
1 T
2N N N A S A S S Y + N uS = 0. (32)
719
FuSSO: Functional Shrinkage and Selection Operator
p p
Let
P VPbe the N 1 vector with entries Vi = Qs CQ sMn n+1/2
(i)
jS m=Mn +1 jm jm ; i.e. the error from trunca- p
= QCQ s Mn n+1/2
p
tion. Then, using (32) Y = AS S + V + =
k1
h i
1 T SS k
N A S A S S A S S V + N uS = 0 = Thus, E N kATS (ES + HS )S k B
T
AS
N AS (S S ) (AS AS )S V = N uS r !!
p p s log(N )
=O sMn s Mn n+1/2 + .
n
Thus,
Furthermore, E k(ES + HS )T (ES + HS )S k B P(B)
1 T
N AS AS (S S ) = 1 T
N AS (ES + HS )S
+ 1 T
N AS V
1 T
+ N AS N uS . (33)
T
= E max |(ESj + HSj ) ((ES + HS )S ) | B P(B)
jsMn
Let 1 1 T
SS = ( N AS AS )
1
; we see that,
E max kESj + HSj k1 k(ES + HS )S k B P(B)
kS S k k1 1 T
SS ( N AS )(ES + HS )S k
jsMn
Thus, we proceed to bound each term on the LHS kES k1 + kHSj k1 N (CQ n+1/2 + na ),
in expectation. First, note that k1 1 T
SS ( N AS )(ES +
1 and, as before: E [k(ES + HS )S k |B]
HS )S k kSS k k( N AS )(ES + HS )S k =
1 T
k1 1 T T T
SS kk N (AS + ES + HS )(ES + HS )S k
r
p
+1/2 s log(N )
k1 k
SS T T C 2 s Mn n + C3 .
N kAS (ES + HS )S k + k(ES + HS ) (ES + n
HS )S k . Moreover, given that B occurs:
h i
Hence, E k1 1 T
SS ( N AS )(ES + HS )S k B P(B)
1
p sMn p p
sMn k1
p
kSS k SS k2 . =O sMn s Mn n+1/2 + s log(N )/n
Cmin
h 1
kSS k T
i 1 + n+1/2 + na
Thus, E N kA S (ES + H S ) S k
B P(B)
p p
p
=O sMn s Mn n+1/2 + s log(N )/n
sMn
kATS k E k(ES + HS )S k B P(B)
Cmin N
The next terms are bounded as follows1 .
Q sMn
kES S k + E kHS S k B P(B) ,
h i
Lemma 8 E k1 ( 1 T
A )V k
B P(B) =
Cmin 3/2 SS N S
s
O .
noting that kATS k N Q. Moreover, by Lemma 1: Mn
21/2
h i
E kHS S k B P(B) E [kHS S k ] . E k1 1 T
Lemma 9 ( A )k =
SS N S B P(B)
p
Also, HS S is normally distributed and Var[HS
(i)T
S ] O log(sMn )/N .
sM Lastly,
Xn (i) 2 2 2 Qs
= Var[HSj Sj ] = kS k2 .
n n kuS k = maxkuj k maxkuj k2 1 =
j=1 jS jS
Hence, by a Gaussian inequality (e.g. [13]) we have: N sMn
k1 u
SS N S k N k 1
k
SS ku k
S .
q Cmin
E [kHS S k ] 22 Qs log(N )/n. h i
Keeping only leading terms, E kS S k B P(B)
Unless otherwise specified, let X (i) be the ith row of
matrix X and Xj be the j th column. Also,
p
=O s /2 Mn n+1/2 + s Mn log(N )/n
3
(i)T (i)
kES S k = max |ES S | kS k2 max kES k2 1
See Supplemental Materials for proof.
1iN 1iN
720
Oliva, Poczos , Verstynen, Singh, Schneider, Yeh, Tseng
p p
3 1
+ O s /2 /Mn2 /2 + log(sMn )/N + N sMn . periments performed were as follows. First, we fix
N, n, p, and s. For i = 1, . . . , N , j = 1, . . . , p we create
Hence, by assumptions (20)-(24) we have random functions using a maximum of M projection
P kS S k > 2N 0 iid
coefficients as follows: 1) Set ajm Unif[1, 1] for
m = 1, . . . , M ; 2) set ajm = aji /c2m , where cm = m
One may similarly look at the stationarity for j S c if m = 1 or is even, cm = m 1 if m is odd; 3)
(i)
1 T
to analyze uj : 0 = N Aj AS S Y + N uj set ajm = ajm /kaj k; 4) set j = aj . (See Figures
2(b),2(c) for typical functions.) Similarly, we generate
AT
= Nj AS (S S ) (AS AS )S V + N uj . j for j = 1, . . . , s; for j = s + 1, . . . , p, we set j = 0.
Pp (i)
Then, we generate Yi as Yi = j=1 hj , j i + i =
Thus, iid
Ps (i)
j=1 hj , j i + i , where N (0, .1). Also, a grid
uj = 1N ATj AS (S S ) + 1 T
N N Aj (AS AS )S
N of n noisy function evaluations were generated to make
1 T (i)
+ N N Aj (V + ) ~yj as in (8), with = .1. These were then used to
(i)
compute jm for m = 1, . . . , Mn as in (11), Mn was
= 1 jS 1
SS
1 T
N AS (ES + HS )S 1 T
N AS V
N
chosen by cross validation. (See Figures 2(b), 2(c) for
1 T
+ N uS 1 T
+ HS )S typical noisy observations and function estimates for
N AS N N Aj (ES
n = 5 and n = 25 respectively.)
1 T
+ N N Aj (V + ),
We fixed s = 5 and chose the following con-
where jS = N1 ATj AS and using (33). We wish to figurations for the other parameters: (p, N, n)
show that j S c uj satisfies the KKT conditions, {(100, 50, 5), (1000, 500, 25), (20000, 500, 25)}. For
that is: each tuple of (p, N, n) configurations, 100 random
trails were performed. We recorded, r, the fraction
Proposition 2 P (maxjS c kuj k2 < 1) 1.
of the trails that a value was able recover the cor-
rect sparsity pattern (i.e. that only the first 5 func-
Proof. Let H
j E uj H . We proceed as follows:
tions are in the support). We also recorded the mean
length of the range of , , that were able to recover
P maxc kuj k2 < 1 P100 (t)
jS the correct support; i.e. = 1t t=1 , where
(t) (t) (t) (t) (t)
= (f l )/max , f is the largest value
H H
P maxc kj k2 + kuj j k2 < 1
jS found to recover the correct support in the tth trails,
(t) (t)
l the smallest such , and max is the smallest
p
P maxc kH k
j 2 + M n ku j H
j k < 1 (t)
jS to produce = 0 ( is taken to be zero if no
recovered the correct support). The results were as
P maxc kH k
j 2 < 1
, max
2 jS c ku j H
k
j <
2 Mn follows:
jS
1 P maxc kH (p, N, n) r
j k2 1
jS (100,50,5) .68 .2125
!
(1000,500,25) 1 .4771
H
P maxc kuj j k . (20000,500,25) 1 .4729
jS 2 Mn
We obtain the following results: Hence we see that even when the number of observa-
tions per function is small (5 or 25) and the number
Lemma 10 P maxjS c kH j k 2 1 2 0
of total number of input functional covariates is large
(we were able to test up to 20000), the FuSSO can re-
Lemma 11 P maxjS c kuj H j k 2 M
0 cover the correct support. Also, to illustrate this point
n (i)
that running Group-LASSO on the ~yj features (Y-
Hence, we have that P (max jS c kuj k2 < 1) 1. GL) is less robust to noise and adaptive to smooth-
ness, we ran noisier trails using the configuration of
5 Experiments (p, N, n) = (1000, 500, 25). We increased the standard
deviation of the noise on grid function observation and
5.1 Synthetic Data on the response to be 5 and 1 respectively. Under these
conditions the FuSSO was able to recover the support
We tested the FuSSO on synthetic data-sets of D = in 49% of the trails were as Y-GL recovered the sup-
(i) (i)
{({~yj }pj=1 , Yi )}N
i=1 (where ~
yj as in (8)). The ex- port in 32% of the trails. Furthermore the FuSSO had
721
FuSSO: Functional Shrinkage and Selection Operator
0.4
timator gave a cross-validated MSE of 70.855, where
1.5 True
Observations
Estimated
0.2 the variance for age was 156.4265; selected voxels in
2
0 0.5 1
0
0 20 40 60 the support may be seen in Figure 3(c). The LASSO
X
estimate using QA values gave a cross-validated MSE
(a) Function at n = 5 (b) Reg. Path at p = 100, of 77.1302. Thus, one may see that considering the
n = 50, n = 5 of kj k entire functional data gave us better results for age
2
Typical Function
Regressed Function Norm 1
Functional Regularization Path regression. We note that we were unable to use the
0.8
naive approach of Y-GL in this case because of mem-
1.5
0.6
ory constraints and the fact that function evaluation
1 points did not lie on a 2d square grid.
Y
0.4
0.5 True
Observations
0.2
40
Estimated
0 0
0 0.5 1 0 100 200 300 400
X 30
Frequency
(c) Function at n = 25 (d) Reg. Path at p = 1000, 20
n = 500, n = 25 of kj k
10
0
Figure 2: (a)(c) Two typical functions, noisy obser- 0 20 40 60 80
Age
vations, and estimates. (b)(d) Regularization paths
(a) Example ODF (b) Ages
showing the norms of j (in red for j in support, blue
otherwise) for a range of ; rightmost vertical line in- 20
Frequency
10
(t) (t)
a = .0743 compared to = .0254 for Y-GL. 5
0
5.2 Neurological Data 0 10 20
Absolute Error
30
722
Oliva, Poczos , Verstynen, Singh, Schneider, Yeh, Tseng
[2] F. Ferraty and P. Vieu. Nonparametric functional [16] Ming Yuan and Yi Lin. Model selection and
data analysis: theory and practice. Springer, 2006. estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B
[3] Gareth M James, Jing Wang, and Ji Zhu. Func- (Statistical Methodology), 68(1):4967, 2006.
tional linear regression thats interpretable. The
Annals of Statistics, pages 20832108, 2009. [17] Yihong Zhao, R Todd Ogden, and Philip T Reiss.
Wavelet-based lasso in functional linear regres-
[4] Michel Ledoux and Michel Talagrand. Probabil- sion. Journal of Computational and Graphical
ity in Banach Spaces: isoperimetry and processes, Statistics, 21(3):600617, 2012.
volume 23. Springer, 1991.
723