Professional Documents
Culture Documents
Bühlmann, Meinshausen - Graphical Lasso
Bühlmann, Meinshausen - Graphical Lasso
Bühlmann, Meinshausen - Graphical Lasso
REFERENCES
Linked references are available on JSTOR for this article:
http://www.jstor.org/stable/25463463?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
The Annals of Statistics
2006, Vol. 34, No. 3, 1436-1462
DOI: 10.1214/009053606000000281
? Institute of Mathematical Statistics, 2006
X = (Xi,...,Xp)~?V0i,E).
This includes Gaussian linear models where, for example, X\ is the response vari
able and {Xk; 2 < k < p) are the predictor variables. Assuming that the covariance
matrix E is nonsingular, the conditional independence structure of the distrib
ution can be conveniently represented by a graphical model % = (V, E), where
T = {1,...,/?} is the set of nodes and E the set of edges in T x T. A pair (a, b)
is contained in the edge set E if and only if Xa is conditionally dependent on Xt,,
given all remaining variables Xr\{a,b} = {Xk\ k eF \{a, b}}. Every pair of vari
ables not contained in the edge set is conditionally independent, given all remain
ing variables, and corresponds to a zero entry in the inverse covariance matrix [12].
Covariance selection was introduced by Dempster [3] and aims at discovering
the conditional independence restrictions (the graph) from a set of i.i.d. observa
tions. Covariance selection traditionally relies on the discrete optimization of an
objective function [5, 12]. Exhaustive search is computationally infeasible for all
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1437
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1438 N. MEINSHAUSEN AND P. BUHLMANN
Xa?{Xk;ker(n)\cla}\Xma.
For details see [12]. The neighborhood depends in general on n as well. However,
this dependence is notationally suppressed in the following.
It is instructive to give a slightly different definition of a neighborhood. For each
node a e r(n), consider optimal prediction of Xa, given all remaining variables.
Let 9a e Rp^ be the vector of coefficients for optimal prediction,
ma = {ber(n):0^0}.
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1439
This set corresponds to the set of effective predictor variables in regression with
response variable Xa and predictor variables {Xk',keF(n)\{a}}. Given n inde
pendent observations of X ~ <A/*(0, ?(ft)), neighborhood selection tries to estimate
the set of neighbors of a node a e F(n). As the optimal linear prediction of Xa has
nonzero coefficients precisely for variables in the set of neighbors of the node a, it
seems reasonable to try to exploit this relation.
2.1. Neighborhood selection with the Lasso. It is well known that the Lasso,
introduced by Tibshirani [16], and known as Basis Pursuit in the context of wavelet
regression [2], has a parsimonious property [11]. When predicting a variable Xa
with all remaining variables {Xk\ k e F(n) \ {a}}, the vanishing Lasso coefficient
estimates identify asymptotically the neighborhood of node a in the graph, as
shown in the following. Let the n x p(n)-dimensional matrix X contain n inde
pendent observations of X, so that the columns Xa correspond for all a e F(n) to
the vector of n independent observations of Xa. Let ( , > be the usual inner product
on W1 and || ||2 the corresponding norm.
The Lasso estimate 9a,A of 9a is given by
where ||#||i = J2ber(n) \^b\ is the /i-norm of the coefficient vector. Normalization
of all variables to a common empirical variance is recommended for the estima
tor in (3). The solution to (3) is not necessarily unique. However, if uniqueness
fails, the set of solutions is still convex and all our results about neighborhoods (in
particular Theorems 1 and 2) hold for any solution of (3).
Other regression estimates have been proposed, which are based on the /p-norm,
where p is typically in the range [0, 2] (see [7]). A value of p = 2 leads to the
ridge estimate, while p = 0 corresponds to traditional model selection. It is well
known that the estimates have a parsimonious property (with some components
being exactly zero) for p < 1 only, while the optimization problem in (3) is only
convex for p > 1. Hence l\ -constrained empirical risk minimization occupies a
unique position, as p = 1 is the only value of p for which variable selection takes
place while the optimization problem is still convex and hence feasible for high
dimensional problems.
The neighborhood estimate (parameterized by A.) is defined by the nonzero co
efficient estimates of the Ix -penalized regression,
ne* = {* r(n):0^O}.
Each choice of a penalty parameter A specifies thus an estimate of the neighbor
hood nea of node a eF(n) and one is left with the choice of a suitable penalty
parameter. Larger values of the penalty tend to shrink the size of the estimated set,
while more variables are in general included into rie* if the value of k is dimin
ished.
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1440 N. MEINSHAUSEN AND P. BUHLMANN
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1441
Var(Xa\Xr(n)\{a}) >v2.
Sparsity. The main assumption is the sparsity of the graph. This entails a re
striction on the size of the neighborhoods of variables.
This assumption limits the maximal possible rate of growth for the size of neigh
borhoods.
For the next sparsity condition, consider again the definition in (2) of the optimal
coefficient 9h,A for prediction of Xj,, given variables in the set A C F(n).
Assumption 4. There exists some & < oo so that for all neighboring nodes
a, b e F(n) and all n e N,
This assumption is, for example, satisfied if Assumption 2 holds and the size
of the overlap of neighborhoods is bounded by an arbitrarily large number from
above for neighboring nodes. That is, if there exists some m < oo so that for all
neN,
(4) max |nefl Pi ne&| < m for n -? oo,
a,ber(n),benea
then Assumption 4 is satisfied. To see this, note that Assumption 2 gives a finite
bound for the /2-norm of Qa^b\{a}^ while (4) gives a finite bound for the /o-norm.
Taken together, Assumption 4 is implied.
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1442 N. MEINSHAUSEN AND P. BUHLMANN
Assumption 6. There exists some 8 < 1 so that for all a,b e T(n) with
b <? nefl,
|Sfl(*)|<*.
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1443
The neighborhood nefl of node a was defined as the set of nonzero coefficient
of 9a, nta = {k e F(n): 9% / 0}. Define the disturbed neighborhood nta(r]) as
nefl(rj):={fc r(/i):^(ij)#0}.
It clearly holds that ma = ne?(0). The assumption of neighborhood stability is
isfied if there exists some infinitesimally small perturbation r\, which may dep
on n, so that the disturbed neighborhood nea(rj) is identical to the undisturbed
neighborhood nefl (0).
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1444 N. MEINSHAUSEN AND P. BUHLMANN
THEOREM 1. Let Assumptions 1-6 hold. Let the penalty parameter satis
kn ~ dn~^l~6^2 with some k < s < ? and d > 0. There exists some c > 0 so th
for all a e V(n),
2.5. Controlling type II errors. So far it has been shown that the probabilit
falsely including variables into the neighborhood can be controlled by the Las
The question arises whether the probability of including all neighboring varia
into the neighborhood estimate converges to 1 for n -> oo.
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1445
Theorem 2 and Proposition 4 say that edges between nodes for which partial
correlation vanishes at a rate n~^~^^2 are, with probability converging to 1 for
n -> oo, detected if ? > s and are undetected if ? < s. The results do not cover the
case ? = s, which remains a challenging question for further research.
All results so far have treated the distinction between zero and nonzero partial
correlations only. The signs of partial correlations of neighboring nodes can be
estimated consistently under the same assumptions and with the same rates, as can
be seen in the proofs.
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1446 N. MEINSHAUSEN AND P. BUHLMANN
2oa ~ i / a \
(9) X(a)
jn = ^-l(-?^ ,
\2p(n)z)
where <f> = 1 - O [<D is the c.d.f. of JV(0,1)] and a2 = n~l(Xa,Xa). The proba
bility of falsely joining two distinct connectivity components with the estimate of
the edge set is bounded by the level a under the choice k = k(a) of the penalty
parameter, as shown in the following theorem.
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1447
P(3aeF(n):Cx?Ca)<a.
A proof is given in the Appendix. This implies that if the edge set is em
(E = 0), it is estimated by an empty set with high probability,
Table 1
The average number of correctly identified edges as a function of the number k of falsely included
edges for n = 40 observations and p = 10, 20, 30 nodes for forward selection
MLE (FS), Ek,v, ?^,A and random guessing
p = 10 p = 20 p = 30
k 0 5 10 0 5 10 0 5 10
Random 0.2 1.9 3.7 0.1 0.7 1.4 0.1 0.5 0.9
FS 7.6 14.1 17.1 8.9 16.6 21.6 0.6 1.8 3.2
?A'V 8.2 15.0 17.6 9.3 18.5 23.9 11.4 21.4 26.3
Ek'A 8.5 14.7 17.6 9.5 19.1 34.0 14.1 21.4 27.4
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1448 N. MEINSHAUSEN AND P. BUHLMANN
pair of variables and <p is the density of the standard normal distribution. The max
imum number of edges connecting to each node is limited to four to achieve the
desired sparsity of the graph. Edges which connect to nodes which do not satisfy
this constraint are removed randomly until the constraint is satisfied for all edges.
Initially all variables have identical conditional variance and the partial correlation
between neighbors is set to 0.245 (absolute values less than 0.25 guarantee posi
tive definiteness of the inverse covariance matrix); that is, S^1 = 1 for all nodes
a eT, Yi~b = 0.245 if there is an edge connecting a and b and E~^ = 0 other
wise. The diagonal elements of the corresponding covariance matrix are in general
larger than 1. To achieve constant variance, all variables are finally rescaled so that
the diagonal elements of S are all unity. Using the Cholesky transformation of
the covariance matrix, n independent samples are drawn from the corresponding
Gaussian distribution.
The average number of edges which are correctly included into the estimate
of the edge set is shown in Table 1 as a function of the number of edges which
are falsely included. The accuracy of the forward selection MLE is comparable
to the proposed Lasso neighborhood selection if the number of nodes is much
smaller than the number of observations. The accuracy of the forward selection
MLE breaks down, however, if the number of nodes is approximately equal to the
number of observations. Forward selection MLE is only marginally better than
random guessing in this case. Computation of the forward selection MLE (using
MIM, [5]) on the same desktop took up to several hundred times longer than the
Lasso neighborhood selection for the full graph. For more than 30 nodes, the dif
ferences are even more pronounced.
The Lasso neighborhood selection can be applied to hundred- or thousand
dimensional graphs, a realistic size, for example, biological networks. A graph
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1449
with 1000 nodes (following the same model as described above) and its es
mates (7) and (8), using 600 observations, are shown in Figure 1. A level ot = 0.
is used for the estimate Ex,w. For better comparison, the level a was adjusted
a = 0.064 for the estimate EX'A, so that both estimates lead to the same numb
of included edges. There are two erroneous edge inclusions, while 1109 out of
1747 edges have been correctly identified by either estimate. Of these 1109 edg
907 are common to both estimates while 202 are just present in either (7) or (8).
To examine if results are critically dependent on the assumption of Gaussian
ity, long-tailed noise is added to the observations. Instead of n i.i.d. observatio
of X ~ eA/*(0, E), n i.i.d. observations of X + 0.1 Z are made, where the compo
nents of Z are independent and follow a ^-distribution. For 10 simulations (wi
each 500 observations), the proportion of false rejections among all rejections
creases only slightly from 0.8% (without long-tailed noise) to 1.4% (with l
tailed-noise) for Ex,v and from 4.8% to 5.2% for Ex,A. Our limited numer
experience suggests that the properties of the graph estimator do not seem to
critically affected by deviations from Gaussianity.
APPENDIX: PROOFS
A.l. Notation and useful lemmas. As a generalization of (3), the Lasso e
mate 0fl"A'A- of 9a'A, defined in (2), is given by
Gb(9) = -2n-{(Xa-X9,Xb).
A vector 9 with ?k = 0, Vk e F(n) \ A is a solution to (A.l) iff for all b e A,
Gb(9) = ? s\gn(9b)X in case 9b ^ 0 and \Gb(9)\ <k in case 9b = 0. Moreover, if
the solution is not unique and \Gb(9)\ < kfor some solution 9, then 9b = Ofor all
solutions of (A.l).
n-l\\Xa-X9\\2 + X\\9\\x
with respect to 9 by D(9). The vector 9 is a solution to (A.l) iff there exists an
element d e D(9) so that db = 0,Wbe A. D(9) is given by [G(9) + Xe,ee S},
where S C Rp(n) is given by S := [e e Rp{n): eb = sign(^) if 9b ^ 0 and eb e
[? 1, 1] if 0b = 0}. The first part of the claim follows. The second part follows
from the proof of Theorem 3.1. in [ 13].
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1450 N. MEINSHAUSEN AND P. BUHLMANN
LEMMA A.2. Let 0a'nta'k be defined for every a e T(n) as in (A.l). Und
the assumptions of Theorem I, for some c > 0,for all a eT(n),
PROOF. Using Bonferroni's inequality, and \nea\ = o(n) for n -> oo, it
fices to show that there exists some c > 0 so that for every a,b e r(n) w
b e nta,
where
(A.4) p(sup{Gb(0a'b'k(P))}
\p<o / <-k) = l- 0(exp(-cn?)).
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1451
(A.5) Rx(P):=Xa-X9a^x(P).
We can write Xb as
(A.9) W* = W? + WJ,
where WJ; is contained in the space W" cf, while the remaining part W^ is
chosen orthogonal to this space (in the orthogonal complement W1- of W"). The
inner product in (A.8) can be written as
(A.10) 2n~x <r?G8), Vfb) = 2H"1 (Rx(/3), W^> + 2n~x (R^), wj).
By Lemma A.3 (see below), there exists for every g > 0 some c > 0 so that, for
n ? oo,
Note that Va and Wb are independent normally distributed random variables with
variances a2 and ab, respectively. By Assumption 2, 0 < v2 < ab, a2 < 1. Note
furthermore that Wb and Xma\{b} are independent. Using 6a = 0ane? and (A.6),
Using (A.12), the definition of the residuals in (A.5) and the orthogonality property
ofW?,
2n~l(Rk(P),V/?) =2n-\6ab - j8)(W?,W^> +2n~l{\a,W^>,
>2n-l(0% - P)(W?,W?) -\2n-l(\a,w?)\.
The second term, |2n_1 (Vfl, W^) |, is stochastically smaller than \2n~l (Va, W^) |
(this can be derived by conditioning on {X^; k e nea}). Due to independence of Va
and Wb, E(VaWb) = 0. Using Bernstein's inequality (Lemma 2.2.11 in [17]), and
k ~ dn~^~?^2 with s > 0, there exists for every g > 0 some c > 0 so that
Pd^"1 (Vfl, W?)| > gA.) < P(\2n-x{Va, Vib)\ > gk)
(A. 14)
= 0(exp(-cn8)).
Instead of (A.l 1), it is sufficient by (A.13) and (A. 14) to show that there exists for
every g>0ac>0so that, for n ?> oo,
p(mf{2n-HeS
(A. 15) \p^? / - P)(W?, W?) - g(l + \P\)k] > 2gk)
= 1 - 0(exp(-cn?)).
Note that ^~2(W^-, W^-> follows a Xn-\nea\ distribution. As |nefl| = o(n) and
ob > v2 (by Assumption 2), it follows that there exists some k > 0 so that for
n >no with some no(k) e N, and any c > 0,
^^^^/(Var^lXr^^OVar^lXrox^))172
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1453
and Assumption 2, this implies that there exists some q > 0 so that 9
qn~(l~^/2. As k ~ dn~(l~?^2 and, by the assumptions in Theorem 1, ? >
it follows that for every k, I > 0 and large enough values of n,
9%k - Ik > 0.
It remains to show that for any k, ? > 0 there exists some no(?, ^) so that for all
n >no,
inf{-pk-l\/3\k}>0.
?<o
This follows as k -> 0 for n -> oo, which completes the proof.
LEMMA A.3. Assume the conditions of Theorem 1 /u?W. Let R^(P) be defined
as in (A.5) and WJj as in (A.9). For ^wy g > 0 there exists c > 0 so that for all
a,b e F(n),for n -> oo,
||R~(j8)|? = ||Xa-/3X*|&
and hence
P( \peR
SUP J^if)l12
1 + >IPl
0 < P(n-1/2max{||Xfl||2,
/ ||X*||2} > q).
Note that both HXJI2 and HX^ are /^-distributed. Thus there exist q > 1
c > 0 so that
,Alfi,P?(sup-f-->
(A.18) n-ll2\\Rx(fS)\\2 \ n( for
q ) /= 0(exp(-cnfe)) , ^ ,n -> 00.
\peR 1 + IpI
It remains to show that for every g > 0 there exists some c > 0 so that
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1454 N. MEINSHAUSEN AND P. BUHLMANN
Let f/2, U3,..., Up(n) be the random variables defined by Uk = (Wi,X^). Note
that the random variables [4, k = 2,..., p(n), are exchangeable. Let furthermore
D = (X2,X2)- keF(n),k>2
max |<X2,X*)|.
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1455
P(D<0)^0 forrc^oo.
Furthermore, as U2, ., f^p(?) are exchangeable,
Hence, for sufficiently small 8k > 0, a Lasso solution for the penalty A. ? Sk is
given by
(o,w + se2,se3,o,...).
Let Hn be the empirical covariance matrix of (X2,X3). Assume w.l.o.g. that
n~l (Xi - rmaxX2, X*) > 0 and n~l (X2, X2) = n~x <X3, X3> = 1. Following, for
example, Efron et al. ([6], page 417), the components (S92, 8O3) are then given by
H~l(l, l)T, from which it follows that 802 = 863, which we abbreviate by 80 in
the following (one can accommodate a negative sign for n~l (X\ ? rmaxX2, X*) by
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1456 N. MEINSHAUSEN AND P. BUHLMANN
reversing the sign of 8O3). Denote by L& the squared error loss for this solution.
Then, for sufficiently small 80,
E[Xa- ? CX ) + Il?||0fl||l,
V meV(n)
with respect to 0%, k e F(n) \ {a}, is given by
-2Ellxa- J2 0mXm)xk\+riek,
W meT{n) I I
where ek e [-1, 1] if 0% = 0, and ek = sign(0?) if 0% ^ 0. Using the fa
nea(r]) = nefl, it follows as in Lemma A.l that for all k e nea,
(A.24) 2EUXa-
\\ ]T 0^(rj)Xm\xb\
meT{n) 1 I < 1?.
A variable Xb with b ? nea can be written as
2EW(X?- E C(i)xm)xk)
kenea \\ mer(n) / <n.
/
Using (A.23) and da = 0a'ne\ it follows that
^^ ne"sign(^-ne") <1,
kenea
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1457
Proof of Theorem 1. The event rie^ ? nea is equivalent to the event that
there exists some node b eF(n)\ cla in the set of nonneighbors of node a such
that the estimated coefficient ??' is not zero. Thus
Conditional on the event 8, it follows from the first part of Lemma A. 1 that Qa>ne<*>k
is not only a solution of (A.l), with A ? nca, but as well a solution of (3), where
A = F(n) \ {a}. As <9^ne^ = 0 for all b e F(n) \ clfl, it follows from the second
part of Lemma A.l that 9^x = 0,Vb eF(n)\cla. Hence
-2n-l(Xa-Xea'm"^,\b).
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1458 N. MEINSHAUSEN AND P. BUHLMANN
2?_1 ? 0^a(Xa-X0a^\Xm)menca
= ( \ra]Tneasign(Cnea)^'nea)x.
/
Gb(?a,nea,X) = J ? sign(0^)0^Ak
(A.30) Wnefl /
-2n-l(Xa-X0a^^k,\b).
Using Assumption 6 and Proposition 2, there exists some 8 < 1 so that
? sign(0^)0^\<8.
|menefl I
The absolute value of the coefficient Gb of the gradient in (A.26) is hence bounded
with probability 1 ? 0(exp(?cn?)) by
(A.31) \Gb(0a*ta'k)\ <8k + \2n~l (Xa - X0"'ne?'\ V*)|.
Conditional on Xcifl = {X^; k ecla}, the random variable
(Xa-X0^\\b)
is normally distributed with mean zero and variance crb\\Xa ? X0fl,nefl,A,||2. On the
one hand, ab < 1. On the other hand, by definition of 0a'nea?\
\\Xa-X0a^k\\2<\\Xa\\2.
Thus
\2n-l(Xa-X0a'nQa'k,\b)\
is stochastically smaller than or equal to \2n~l(Xa, \b)\. Using (A.31), it remains
to be shown that for some c > 0 and 8 < 1,
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1459
which together with 8 > 1 in (A.33) shows that (A.32) holds. This completes the
proof.
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1460 N. MEINSHAUSEN AND P. BUHLMANN
It thus suffices to show that for all b e nta with \7tab\ = 0(n~^l~^/2) and ? <
P{3a e F(n): Ck <? Ca) < p(n) max P(3b e r(n) \Ca:be mxa).
aer(n)
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1461
P(3beF(n)\Ca:bemx)<p( maxJ\Gb(9a<Ca>x)\>k),
\ber(n)\Ca
where ?a,Ca,x, according to (A.l), has nonzero components only for variables in
the connectivity component Ca of node a. Hence it is sufficient to show that
The gradient is given by Gb(9a'C?+) = -2n~x {Xa - X<9aC-\ Xb). For all k e Ca
the variables Xb and Xk are independent as they are in different connectivity com
ponents. Hence, conditional on Xca ? {X^; k eCa},
G^c^)-<AA(0,/?2/tt),
where R2 = 4n~x\\Xa - X^'^^l^, which is smaller than or equal to a2 ?
An~x \\Xa ||2 by definition of 9aXa^x. Hence for all aeF(n) and b e F(n) \ Ca,
REFERENCES
[1] Buhl, S. (1993). On the existence of maximum-likelihood estimators for graphical Gaussian
models. Scand. J. Statist. 20 263-270. MR 1241392
[2] Chen, S., Donoho, D. and Saunders, M. (2001). Atomic decomposition by basis pursuit.
SIAMRev. 43 129-159. MR1854649
[3] Dempster, A. (1972). Covariance selection. Biometrics 28 157-175.
[4] Drton, M. and PERLMAN, M. (2004). Model selection for Gaussian concentration graphs.
Biometrika 91 591-602. MR2090624
[5] Edwards, D. (2000). Introduction to Graphical Modelling, 2nd ed. Springer, New York.
MR 1880319
[6] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression
(with discussion). Ann. Statist. 32 407^199. MR2060166
[7] Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools
(with discussion). Technometrics 35 109-148.
[8] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor
selection and the virtue of over-parametrization. Bernoulli 10 971-988. MR2108039
[9] Heckerman, D., Chickering, D. M., Meek, C, Rounthwaite, R. and Kadie, C.
(2000). Dependency networks for inference, collaborative filtering and data visualization.
J. Machine Learning Research 1 49-75.
[10] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regres
sion. Ann. Statist. 28 681-712. MR 1792783
[11] Knight, K. and Fu, W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28
1356-1378. MR1805787
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1462 N. MEINSHAUSEN AND P. BUHLMANN
This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms