Bühlmann, Meinshausen - Graphical Lasso

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

High-Dimensional Graphs and Variable Selection with the Lasso

Author(s): Nicolai Meinshausen and Peter Bühlmann


Source: The Annals of Statistics, Vol. 34, No. 3 (Jun., 2006), pp. 1436-1462
Published by: Institute of Mathematical Statistics
Stable URL: http://www.jstor.org/stable/25463463
Accessed: 16-08-2017 12:51 UTC

REFERENCES
Linked references are available on JSTOR for this article:
http://www.jstor.org/stable/25463463?seq=1&cid=pdf-reference#references_tab_contents
You may need to log in to JSTOR to access the linked references.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://about.jstor.org/terms

Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and


extend access to The Annals of Statistics

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
The Annals of Statistics
2006, Vol. 34, No. 3, 1436-1462
DOI: 10.1214/009053606000000281
? Institute of Mathematical Statistics, 2006

HIGH-DIMENSIONAL GRAPHS AND VARIABLE SELECTION


WITH THE LASSO

By Nicolai Meinshausen and Peter Buhlmann


ETH Zurich
The pattern of zero entries in the inverse covariance matrix of a multivari
ate normal distribution corresponds to conditional independence restrictions
between variables. Covariance selection aims at estimating those structural
zeros from data. We show that neighborhood selection with the Lasso is
a computationally attractive alternative to standard covariance selection for
sparse high-dimensional graphs. Neighborhood selection estimates the con
ditional independence restrictions separately for each node in the graph and
is hence equivalent to variable selection for Gaussian linear models. We show
that the proposed neighborhood selection scheme is consistent for sparse
high-dimensional graphs. Consistency hinges on the choice of the penalty pa
rameter. The oracle value for optimal prediction does not lead to a consistent
neighborhood estimate. Controlling instead the probability of falsely joining
some distinct connectivity components of the graph, consistent estimation for
sparse graphs is achieved (with exponential rates), even when the number of
variables grows as the number of observations raised to an arbitrary power.

1. Introduction. Consider the p -dimensional multivariate normal distributed


random variable

X = (Xi,...,Xp)~?V0i,E).
This includes Gaussian linear models where, for example, X\ is the response vari
able and {Xk; 2 < k < p) are the predictor variables. Assuming that the covariance
matrix E is nonsingular, the conditional independence structure of the distrib
ution can be conveniently represented by a graphical model % = (V, E), where
T = {1,...,/?} is the set of nodes and E the set of edges in T x T. A pair (a, b)
is contained in the edge set E if and only if Xa is conditionally dependent on Xt,,
given all remaining variables Xr\{a,b} = {Xk\ k eF \{a, b}}. Every pair of vari
ables not contained in the edge set is conditionally independent, given all remain
ing variables, and corresponds to a zero entry in the inverse covariance matrix [12].
Covariance selection was introduced by Dempster [3] and aims at discovering
the conditional independence restrictions (the graph) from a set of i.i.d. observa
tions. Covariance selection traditionally relies on the discrete optimization of an
objective function [5, 12]. Exhaustive search is computationally infeasible for all

Received May 2004; revised August 2005.


AMS 2000 subject classifications. Primary 62J07; secondary 62H20, 62F12.
Key words and phrases. Linear regression, covariance selection, Gaussian graphical models, pe
nalized regression.
1436

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1437

but very low-dimensional models. Usually, greedy forward or backward search


is employed. In forward search, the initial estimate of the edge set is the empty
set and edges are then added iteratively until a suitable stopping criterion is sat
isfied. The selection (deletion) of a single edge in this search strategy requires an
MLE fit [15] for 0(p2) different models. The procedure is not well suited for
high-dimensional graphs. The existence of the MLE is not guaranteed in general
if the number of observations is smaller than the number of nodes [1]. More dis
turbingly, the complexity of the procedure renders even greedy search strategies
impractical for modestly sized graphs. In contrast, neighborhood selection with
the Lasso, proposed in the following, relies on optimization of a convex function,
applied consecutively to each node in the graph. The method is computationally
very efficient and is consistent even for the high-dimensional setting, as will be
shown.
Neighborhood selection is a subproblem of covariance selection. The neighbor
hood nefl of a node a e F is the smallest subset of T \ {a} so that, given all vari
ables Xn&a in the neighborhood, Xa is conditionally independent of all remaining
variables. The neighborhood of a node a eF consists of all nodes b e F \ {a} so
that (a, b) e E. Given n i.i.d. observations of X, neighborhood selection aims at
estimating (individually) the neighborhood of any given variable (or node). The
neighborhood selection can be cast as a standard regression problem and can be
solved efficiently with the Lasso [16], as will be shown in this paper.
The consistency of the proposed neighborhood selection will be shown for
sparse high-dimensional graphs, where the number of variables is potentially
growing as any power of the number of observations (high-dimensionality),
whereas the number of neighbors of any variable is growing at most slightly slower
than the number of observations (sparsity).
A number of studies have examined the case of regression with a growing num
ber of parameters as sample size increases. The closest to our setting is the recent
work of Greenshtein and Ritov [8], who study consistent prediction in a triangular
setup very similar to ours (see also [10]). However, the problem of consistent esti
mation of the model structure, which is the relevant concept for graphical models,
is very different and not treated in these studies.
We study in Section 2 under which conditions, and at which rate, the neighbor
hood estimate with the Lasso converges to the true neighborhood. The choice of
the penalty is crucial in the high-dimensional setting. The oracle penalty for opti
mal prediction turns out to be inconsistent for estimation of the true model. This
solution might include an unbounded number of noise variables in the model. We
motivate a different choice of the penalty such that the probability of falsely con
necting two or more distinct connectivity components of the graph is controlled at
very low levels. Asymptotically, the probability of estimating the correct neighbor
hood converges exponentially to 1, even when the number of nodes in the graph is
growing rapidly as any power of the number of observations. As a consequence,

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1438 N. MEINSHAUSEN AND P. BUHLMANN

consistent estimation of the full edge set in a sparse high-dimensional graph is


possible (Section 3).
Encouraging numerical results are provided in Section 4. The proposed estimate
is shown to be both more accurate than the traditional forward selection MLE strat
egy and computationally much more efficient. The accuracy of the forward selec
tion MLE fit is in particular poor if the number of nodes in the graph is comparable
to the number of observations. In contrast, neighborhood selection with the Lasso
is shown to be reasonably accurate for estimating graphs with several thousand
nodes, using only a few hundred observations.

2. Neighborhood selection. Instead of assuming a fixed true underlying


model, we adopt a more flexible approach similar to the triangular setup in [8].
Both the number of nodes in the graphs (number of variables), denoted by p(n) =
\T(n)\, and the distribution (the covariance matrix) depend in general on the num
ber of observations, so that V = F(n) and E = E(n). The neighborhood nea of
a node a e F(n) is the smallest subset of T(n) \ {a} so that Xa is conditionally
independent of all remaining variables. Denote the closure of node a V(n) by
cla := nefl U {a}. Then

Xa?{Xk;ker(n)\cla}\Xma.
For details see [12]. The neighborhood depends in general on n as well. However,
this dependence is notationally suppressed in the following.
It is instructive to give a slightly different definition of a neighborhood. For each
node a e r(n), consider optimal prediction of Xa, given all remaining variables.
Let 9a e Rp^ be the vector of coefficients for optimal prediction,

(1) ea = MgmmE[Xa- ? 0kXk)


0:0a=0 \ ker(n) I .
As a generalization of (1), which will be of use later, consider optimal prediction
of Xa, given only a subset of variables {Xk', k e A}, where A c F(n) \ {a}. The
optimal prediction is characterized by the vector Qa>A,

(2) 6a'A= 0:0k=0yktA


argmin E[Xa- ? 0kXk)
\ keF(n) / .
The elements of 6a are determined by the inverse covariance matrix [12]. For
beF\{a} and K(n) = X~l(n), it holds that d? = -Kab(n)/Kaa(n). The set of
nonzero coefficients of 6a is identical to the set {b e F(n) \ {a}: Kab(n) ^ 0} of
nonzero entries in the corresponding row vector of the inverse covariance matrix
and defines precisely the set of neighbors of node a. The best predictor for Xa is
thus a linear function of variables in the set of neighbors of the node a only. The
set of neighbors of a node a eT(n) can hence be written as

ma = {ber(n):0^0}.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1439

This set corresponds to the set of effective predictor variables in regression with
response variable Xa and predictor variables {Xk',keF(n)\{a}}. Given n inde
pendent observations of X ~ <A/*(0, ?(ft)), neighborhood selection tries to estimate
the set of neighbors of a node a e F(n). As the optimal linear prediction of Xa has
nonzero coefficients precisely for variables in the set of neighbors of the node a, it
seems reasonable to try to exploit this relation.

2.1. Neighborhood selection with the Lasso. It is well known that the Lasso,
introduced by Tibshirani [16], and known as Basis Pursuit in the context of wavelet
regression [2], has a parsimonious property [11]. When predicting a variable Xa
with all remaining variables {Xk\ k e F(n) \ {a}}, the vanishing Lasso coefficient
estimates identify asymptotically the neighborhood of node a in the graph, as
shown in the following. Let the n x p(n)-dimensional matrix X contain n inde
pendent observations of X, so that the columns Xa correspond for all a e F(n) to
the vector of n independent observations of Xa. Let ( , > be the usual inner product
on W1 and || ||2 the corresponding norm.
The Lasso estimate 9a,A of 9a is given by

(3) ea^ = ^gmm(n-x\\Xa-Xe\\l + k\\9\\x),


0:6a=0

where ||#||i = J2ber(n) \^b\ is the /i-norm of the coefficient vector. Normalization
of all variables to a common empirical variance is recommended for the estima
tor in (3). The solution to (3) is not necessarily unique. However, if uniqueness
fails, the set of solutions is still convex and all our results about neighborhoods (in
particular Theorems 1 and 2) hold for any solution of (3).
Other regression estimates have been proposed, which are based on the /p-norm,
where p is typically in the range [0, 2] (see [7]). A value of p = 2 leads to the
ridge estimate, while p = 0 corresponds to traditional model selection. It is well
known that the estimates have a parsimonious property (with some components
being exactly zero) for p < 1 only, while the optimization problem in (3) is only
convex for p > 1. Hence l\ -constrained empirical risk minimization occupies a
unique position, as p = 1 is the only value of p for which variable selection takes
place while the optimization problem is still convex and hence feasible for high
dimensional problems.
The neighborhood estimate (parameterized by A.) is defined by the nonzero co
efficient estimates of the Ix -penalized regression,

ne* = {* r(n):0^O}.
Each choice of a penalty parameter A specifies thus an estimate of the neighbor
hood nea of node a eF(n) and one is left with the choice of a suitable penalty
parameter. Larger values of the penalty tend to shrink the size of the estimated set,
while more variables are in general included into rie* if the value of k is dimin
ished.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1440 N. MEINSHAUSEN AND P. BUHLMANN

2.2. The prediction-oracle solution. A seemingly useful choice of the pen


parameter is the (unavailable) prediction-oracle value,

^oracle = arg min EI Xa - ]T 0^xXk j .


The expectation is understood to be with respect to a new X, which is independent
of the sample on which 0a,x is estimated. The prediction-oracle penalty minimizes
the predictive risk among all Lasso estimates. An estimate of Aoracie is obtained by
the cross-validated choice A.cv.
For /o-penalized regression it was shown by Shao [14] that the cross-validated
choice of the penalty parameter is consistent for model selection under certain
conditions on the size of the validation set. The prediction-oracle solution does not
lead to consistent model selection for the Lasso, as shown in the following for a
simple example.

PROPOSITION 1. Let the number of variables grow to infinity, p(n) ? oo,


for n ? oo, with p(n) = o(ny) for some y > 0. Assume that the covariance
matrices E(n) are identical to the identity matrix except for some pair (a,b) e
V(n) x T(n),for which Yab(n) = Ttba(n) = s,for some 0 < s < 1 and all n e N.
The probability of selecting the wrong neighborhood for node a converges to 1
under the prediction-oracle penalty,

P(ne?oracle ^ nea) -> 1 forn-^oo.


A proof is given in the Appendix. It follows from the proof of Proposition 1
that many noise variables are included in the neighborhood estimate with the
prediction-oracle solution. In fact, the probability of including noise variables with
the prediction-oracle solution does not even vanish asymptotically for a fixed num
ber of variables. If the penalty is chosen larger than the prediction-optimal value,
consistent neighborhood selection is possible with the Lasso, as demonstrated in
the following.

2.3. Assumptions. We make a few assumptions to prove consistency of neigh


borhood selection with the Lasso. We always assume availability of n independent
observations from X ~ Jf(0, E).

High-dimensionality. The number of variables is allowed to grow as the num


ber of observations n raised to an arbitrarily high power.

Assumption 1. There exists y > 0, so that


p(n) = 0(ny) for/i->oo.
In particular, it is allowed for the following analysis that the number of variables
is very much larger than the number of observations, p(n)^>n.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1441

Nonsingularity. We make two regularity assumptions for the covariance ma


trices.

Assumption 2. For all a e F(n) and n e N, Var(Xa) = 1. There exists


v2 > 0, so that for all n e N and a e F(n),

Var(Xa\Xr(n)\{a}) >v2.

Common variance can always be achieved by appropriate scaling of the vari


ables. A scaling to a common (empirical) variance of all variables is desirable,
as the solutions would otherwise depend on the chosen units or dimensions in
which they are represented. The second part of the assumption explicitly excludes
singular or nearly singular covariance matrices. For singular covariance matrices,
edges are not uniquely defined by the distribution and it is hence not surprising
that nearly singular covariance matrices are not suitable for consistent variable
selection. Note, however, that the empirical covariance matrix is a.s. singular if
p(n) > n, which is allowed in our analysis.

Sparsity. The main assumption is the sparsity of the graph. This entails a re
striction on the size of the neighborhoods of variables.

Assumption 3. There exists some 0 < k < 1 so that

max |nefl| = 0(nK) for a ?? oo.


aeT{n)

This assumption limits the maximal possible rate of growth for the size of neigh
borhoods.
For the next sparsity condition, consider again the definition in (2) of the optimal
coefficient 9h,A for prediction of Xj,, given variables in the set A C F(n).

Assumption 4. There exists some & < oo so that for all neighboring nodes
a, b e F(n) and all n e N,

This assumption is, for example, satisfied if Assumption 2 holds and the size
of the overlap of neighborhoods is bounded by an arbitrarily large number from
above for neighboring nodes. That is, if there exists some m < oo so that for all
neN,
(4) max |nefl Pi ne&| < m for n -? oo,
a,ber(n),benea

then Assumption 4 is satisfied. To see this, note that Assumption 2 gives a finite
bound for the /2-norm of Qa^b\{a}^ while (4) gives a finite bound for the /o-norm.
Taken together, Assumption 4 is implied.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1442 N. MEINSHAUSEN AND P. BUHLMANN

Magnitude of partial correlations. The next assumption bounds the magnitude


of partial correlations from below. The partial correlation 7tab between variables
Xa and Xb is the correlation after having eliminated the linear effects from all
remaining variables {Xk\ k e T(n) \ {a, b}}; for details see [12].

Assumption 5. There exist a constant S > 0 and some ? > k, with k as in


Assumption 3, so that for every (a,b) e E,

It will be shown below that Assumption 5 cannot be relaxed in general. Note


that neighborhood selection for node a e F(n) is equivalent to simultaneously
testing the null hypothesis of zero partial correlation between variable Xa and
all remaining variables Xb,b eF(n)\{a}. The null hypothesis of zero partial cor
relation between two variables can be tested by using the corresponding entry in
the normalized inverse empirical covariance matrix. A graph estimate based on
such tests has been proposed by Drton and Perlman [4]. Such a test can only be
applied, however, if the number of variables is smaller than the number of obser
vations, p(n) < n, as the empirical covariance matrix is singular otherwise. Even
if p(n) = n ? c for some constant c > 0, Assumption 5 would have to hold with
? = 1 to have a positive power of rejecting false null hypotheses for such an es
timate; that is, partial correlations would have to be bounded by a positive value
from below.

Neighborhood stability. The last assumption is referred to as neighborhood


stability. Using the definition of 0a'A in (2), define for all a, be V(n),

(5) Sa(b):= Yl sign(^'ne?)^ne".


fcenea

The assumption of neighborhood stability restricts the magnitude of the quanti


ties Sa(b) for nonneighboring nodes a, b e T(n).

Assumption 6. There exists some 8 < 1 so that for all a,b e T(n) with
b <? nefl,

|Sfl(*)|<*.

It is shown in Proposition 3 that this assumption cannot be relaxed.


We give in the following a more intuitive condition which essentially implies
Assumption 6. This will justify the term neighborhood stability. Consider the def
inition in (1) of the optimal coefficients 6a for prediction of Xa. For n > 0, define
6a(n) as the optimal set of coefficients under an additional l\ -penalty,

(6) efl(i7):=aigmin?(Xfl- ? ekXk)


0:0a=0 \ ker(n) I +r,\\0\\i.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1443

The neighborhood nefl of node a was defined as the set of nonzero coefficient
of 9a, nta = {k e F(n): 9% / 0}. Define the disturbed neighborhood nta(r]) as

nefl(rj):={fc r(/i):^(ij)#0}.
It clearly holds that ma = ne?(0). The assumption of neighborhood stability is
isfied if there exists some infinitesimally small perturbation r\, which may dep
on n, so that the disturbed neighborhood nea(rj) is identical to the undisturbed
neighborhood nefl (0).

PROPOSITION 2. If there exists some r) > 0 so that nea(^) = nea(0), th


\Sa(b)\ < lforallbeF(n)\ma.

A proof is given in the Appendix.


In light of Proposition 2 it seems that Assumption 6 is a very weak conditio
To give one example, Assumption 6 is automatically satisfied under the m
stronger assumption that the graph does not contain cycles. We give a brief r
soning for this. Consider two nonneighboring nodes a and b. If the nodes are
different connectivity components, there is nothing left to show as Sa(b) = 0.
they are in the same connectivity component, then there exists one node k e
that separates b from nea \ [k], as there is just one unique path between any t
variables in the same connectivity component if the graph does not contain cy
cles. Using the global Markov property, the random variable X\, is independen
of Xma\{k], given X*. The random variable E(Xb\XnQa) is thus a function of X
only. As the distribution is Gaussian, E(Xb\XnGa) = 9k'ntaXk. By Assumption 2,
Var(Xa|Xnefl) = v2 for some v2 > 0. It follows that Var(X^) = v2 + (0if'nefl)2 = 1
and hence 9k ,nQa = Vl ? v2 < 1, which implies that Assumption 6 is indeed satis
fied if the graph does not contain cycles.
We mention that Assumption 6 is likewise satisfied if the inverse covariance
matrices S_1(w) are for each n eN diagonally dominant. A matrix is said to be
diagonally dominant if and only if, for each row, the sum of the absolute values
of the nondiagonal elements is smaller than the absolute value of the diagonal
element. The proof of this is straightforward but tedious and hence is omitted.

2.4. Controlling type I errors. The asymptotic properties of Lasso-type esti


mates in regression have been studied in detail by Knight and Fu [11] for a fixed
number of variables. Their results say that the penalty parameter A. should decay
for an increasing number of observations at least as fast as n~x^2 to obtain an
ft1/2-consistent estimate. It turns out that a slower rate is needed for consistent
model selection in the high-dimensional case where p(n) ^> n. However, a rate
n-(\-e)/2 wjtj1 any K < e < ^ (where k ? are defined as in Assumptions 3 and 5)
is sufficient for consistent neighborhood selection, even when the number of vari
ables is growing rapidly with the number of observations.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1444 N. MEINSHAUSEN AND P. BUHLMANN

THEOREM 1. Let Assumptions 1-6 hold. Let the penalty parameter satis
kn ~ dn~^l~6^2 with some k < s < ? and d > 0. There exists some c > 0 so th
for all a e V(n),

P(nQa ? nea) = 1 ? 0(exp(?cn?)) for n ? oo.

A proof is given in the Appendix.


Theorem 1 states that the probability of (falsely) including any of the nonne
boring variables of the node a e F(n) into the neighborhood estimate vanis
exponentially fast, even though the number of nonneighboring variables may g
very rapidly with the number of observations. It is shown in the following t
Assumption 6 cannot be relaxed.

PROPOSITION 3. If there exists some a,b e F(n) with b ? nefl


\$a(b)\ > 1, then, for X = kn as in Theorem 1,

P(&a ? ne?) ~* 0 for n -> oo.

A proof is given in the Appendix. Assumption 6 of neighborhood stability i


hence critical for the success of Lasso neighborhood selection.

2.5. Controlling type II errors. So far it has been shown that the probabilit
falsely including variables into the neighborhood can be controlled by the Las
The question arises whether the probability of including all neighboring varia
into the neighborhood estimate converges to 1 for n -> oo.

THEOREM 2. Let the assumptions of Theorem 1 be satisfied. For X = kn as i


Theorem 1, for some c > 0

P(nea c rie^) = 1 - 0(exp(-cn?)) for n -> oo.

A proof is given in the Appendix.


It may be of interest whether Assumption 5 could be relaxed, so that edges a
detected even if the partial correlation is vanishing at a rate w_(1~^/2 for som
? < k. The following proposition says that ? > s (and thus ?>/cas?>/c
necessary condition if a stronger version of Assumption 4 holds, which is satisf
for forests and trees, for example.

PROPOSITION 4. Let the assumptions of Theorem 1 hold with $ < 1 in


sumption 4, except that for a eF(n), let there be some b e F(n) \ {a} with nab
and \nab\ = 0(n~^l~^^2) forn -> oo for some ? < e. Then

P(b rie^) -? 0 for n -> oo.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1445

Theorem 2 and Proposition 4 say that edges between nodes for which partial
correlation vanishes at a rate n~^~^^2 are, with probability converging to 1 for
n -> oo, detected if ? > s and are undetected if ? < s. The results do not cover the
case ? = s, which remains a challenging question for further research.
All results so far have treated the distinction between zero and nonzero partial
correlations only. The signs of partial correlations of neighboring nodes can be
estimated consistently under the same assumptions and with the same rates, as can
be seen in the proofs.

3. Covariance selection. It follows from Section 2 that it is possible under


certain conditions to estimate the neighborhood of each node in the graph consis
tently, for example,

P(ne^ = ma) ?> \ for n ? oo.


The full graph is given by the set F(n) of nodes and the edge set E = E(n). The
edge set contains those pairs (a, b) e F(n) x F(n) for which the partial correlation
between Xa and X^ is not zero. As the partial correlations are precisely nonzero
for neighbors, the edge set E c F(n) x F(n) is given by

E = {(a, b):a e ne/? Abe nta}.


The first condition, a e ne^, implies in fact the second, b e ma, and vice versa, so
that the edge is as well identical to {(a, b): a e ne^ v b e nea}. For an estimate of
the edge set of a graph, we can apply neighborhood selection to each node in the
graph. A natural estimate of the edge set is then given by Ek'A c F(n) x F(n),
where

(7) E^A = {(a,b):aemxbAbe rife*}.


Note that a e rie? does not necessarily imply b e nex and vice versa. We can hence
also define a second, less conservative, estimate of the edge set by

(8) Ex<v - {(a, b):aenexbvbe mx}.


The discrepancies between the estimates (7) and (8) are quite small in our expe
rience. Asymptotically the difference between both estimates vanishes, as seen in
the following corollary. We refer to both edge set estimates collectively with the
generic notation Ex, as the following result holds for both of them.

COROLLARY 1. Under the conditions of Theorem 2, for some c > 0,

P(Ek = E) = 1 - 0(exp(-cn?)) for n -> oo.

The claim follows since |r(n)|2 = p(n)2 = 0(n2y) by Assumption 1 and


neighborhood selection has an exponentially fast convergence rate as described
by Theorem 2. Corollary 1 says that the conditional independence structure of a

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1446 N. MEINSHAUSEN AND P. BUHLMANN

multivariate normal distribution can be estimated consistently by combining the


neighborhood estimates for all variables.
Note that there are in total 2^p ~^l2 distinct graphs for a /7-dimensional vari
able. However, for each of the p nodes there are only 2p~l distinct potential neigh
borhoods. By breaking the graph selection problem into a consecutive series o
neighborhood selection problems, the complexity of the search is thus reduced sub
stantially at the price of potential inconsistencies between neighborhood estimates
Graph estimates that apply this strategy for complexity reduction are sometimes
called dependency networks [9]. The complexity of the proposed neighborhoo
selection for one node with the Lasso is reduced further to 0(npmin{n, p})
as the Lars procedure of Efron, Hastie, Johnstone and Tibshirani [6] requires
0(min{n, p}) steps, each of complexity 0(np). For high-dimensional problem
as in Theorems 1 and 2, where the number of variables grows as p(n) ~ cny
for some c > 0 and y > 1, this is equivalent to 0(p2+2/y) computations for th
whole graph. The complexity of the proposed method thus scales approximately
quadratic with the number of nodes for large values of y.
Before providing some numerical results, we discuss in the following the choice
of the penalty parameter.

Finite-sample results and significance. It was shown above that consistent


neighborhood and covariance selection is possible with the Lasso in a high
dimensional setting. However, the asymptotic considerations give little advice on
how to choose a specific penalty parameter for a given problem. Ideally, one would
like to guarantee that pairs of variables which are not contained in the edge set enter
the estimate of the edge set only with very low (prespecified) probability. Unfor
tunately, it seems very difficult to obtain such a result as the probability of falsel
including a pair of variables into the estimate of the edge set depends on the exact
covariance matrix, which is in general unknown. It is possible, however, to con
strain the probability of (falsely) connecting two distinct connectivity component
of the true graph. The connectivity component Ca Q F(n) of a node a eT(n) is the
set of nodes which are connected to node a by a chain of edges. The neighborhoo
nea is clearly part of the connectivity component Ca.
Let C* be the connectivity component of a in the estimated graph (r, Ex). For
any level 0 < a < 1, consider the choice of the penalty

2oa ~ i / a \
(9) X(a)
jn = ^-l(-?^ ,
\2p(n)z)
where <f> = 1 - O [<D is the c.d.f. of JV(0,1)] and a2 = n~l(Xa,Xa). The proba
bility of falsely joining two distinct connectivity components with the estimate of
the edge set is bounded by the level a under the choice k = k(a) of the penalty
parameter, as shown in the following theorem.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1447

THEOREM 3. Let Assumptions 1-6 be satisfied. Using the penalty param


ter X(a),we have for all n eN that

P(3aeF(n):Cx?Ca)<a.
A proof is given in the Appendix. This implies that if the edge set is em
(E = 0), it is estimated by an empty set with high probability,

P(EX = 0) > 1 -a.


Theorem 3 is a finite-sample result. The previous asymptotic results in Theo
rems 1 and 2 hold if the level a vanishes exponentially to zero for an increasin
number of observations, leading to consistent edge set estimation.

4. Numerical examples. We use both the Lasso estimate from Section 3 a


forward selection MLE [5, 12] to estimate sparse graphs. We found it diffi
to compare numerically neighborhood selection with forward selection MLE f
more than 30 nodes in the graph. The high computational complexity of the f
ward selection MLE made the computations for such relatively low-dimensiona
problems very costly already. The Lasso scheme in contrast handled with
graphs with more than 1000 nodes, using the recent algorithm developed in [6
Where comparison was feasible, the performance of the neighborhood selec
scheme was better. The difference was particularly pronounced if the ratio of
servations to variables was low, as can be seen in Table 1, which will be descri
in more detail below.
First we give an account of the generation of the underlying graphs which
are trying to estimate. A realization of an underlying (random) graph is given
the left panel of Figure 1. The nodes of the graph are associated with spatial l
cation and the location of each node is distributed identically and uniformly i
the two-dimensional square [0, l]2. Every pair of nodes is included initially in
edge set with probability cp(d/^/p), where d is the Euclidean distance between t

Table 1
The average number of correctly identified edges as a function of the number k of falsely included
edges for n = 40 observations and p = 10, 20, 30 nodes for forward selection
MLE (FS), Ek,v, ?^,A and random guessing

p = 10 p = 20 p = 30
k 0 5 10 0 5 10 0 5 10
Random 0.2 1.9 3.7 0.1 0.7 1.4 0.1 0.5 0.9
FS 7.6 14.1 17.1 8.9 16.6 21.6 0.6 1.8 3.2
?A'V 8.2 15.0 17.6 9.3 18.5 23.9 11.4 21.4 26.3
Ek'A 8.5 14.7 17.6 9.5 19.1 34.0 14.1 21.4 27.4

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1448 N. MEINSHAUSEN AND P. BUHLMANN

^^^^ %^'^f^ ^^%


>^t^^' ^M^i^ S^{^x^
FIG. 1. A realization of a graph is shown on the left, generated as described in the text. The graph
consists of 1000 nodes and 1747 edges out of 449,500 distinct pairs of variables. The estimated edge
set, using estimate (7) at level a = 0.05 [see (9)], is shown in the middle. There are two erroneously
included edges, marked by an arrow, while 1109 edges are correctly detected. For estimate (8) and
an adjusted level as described in the text, the result is shown on the right. Again two edges are
erroneously included. Not a single pair of disjoint connectivity components of the true graph has
been {falsely) joined by either estimate.

pair of variables and <p is the density of the standard normal distribution. The max
imum number of edges connecting to each node is limited to four to achieve the
desired sparsity of the graph. Edges which connect to nodes which do not satisfy
this constraint are removed randomly until the constraint is satisfied for all edges.
Initially all variables have identical conditional variance and the partial correlation
between neighbors is set to 0.245 (absolute values less than 0.25 guarantee posi
tive definiteness of the inverse covariance matrix); that is, S^1 = 1 for all nodes
a eT, Yi~b = 0.245 if there is an edge connecting a and b and E~^ = 0 other
wise. The diagonal elements of the corresponding covariance matrix are in general
larger than 1. To achieve constant variance, all variables are finally rescaled so that
the diagonal elements of S are all unity. Using the Cholesky transformation of
the covariance matrix, n independent samples are drawn from the corresponding
Gaussian distribution.
The average number of edges which are correctly included into the estimate
of the edge set is shown in Table 1 as a function of the number of edges which
are falsely included. The accuracy of the forward selection MLE is comparable
to the proposed Lasso neighborhood selection if the number of nodes is much
smaller than the number of observations. The accuracy of the forward selection
MLE breaks down, however, if the number of nodes is approximately equal to the
number of observations. Forward selection MLE is only marginally better than
random guessing in this case. Computation of the forward selection MLE (using
MIM, [5]) on the same desktop took up to several hundred times longer than the
Lasso neighborhood selection for the full graph. For more than 30 nodes, the dif
ferences are even more pronounced.
The Lasso neighborhood selection can be applied to hundred- or thousand
dimensional graphs, a realistic size, for example, biological networks. A graph

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1449

with 1000 nodes (following the same model as described above) and its es
mates (7) and (8), using 600 observations, are shown in Figure 1. A level ot = 0.
is used for the estimate Ex,w. For better comparison, the level a was adjusted
a = 0.064 for the estimate EX'A, so that both estimates lead to the same numb
of included edges. There are two erroneous edge inclusions, while 1109 out of
1747 edges have been correctly identified by either estimate. Of these 1109 edg
907 are common to both estimates while 202 are just present in either (7) or (8).
To examine if results are critically dependent on the assumption of Gaussian
ity, long-tailed noise is added to the observations. Instead of n i.i.d. observatio
of X ~ eA/*(0, E), n i.i.d. observations of X + 0.1 Z are made, where the compo
nents of Z are independent and follow a ^-distribution. For 10 simulations (wi
each 500 observations), the proportion of false rejections among all rejections
creases only slightly from 0.8% (without long-tailed noise) to 1.4% (with l
tailed-noise) for Ex,v and from 4.8% to 5.2% for Ex,A. Our limited numer
experience suggests that the properties of the graph estimator do not seem to
critically affected by deviations from Gaussianity.

APPENDIX: PROOFS
A.l. Notation and useful lemmas. As a generalization of (3), the Lasso e
mate 0fl"A'A- of 9a'A, defined in (2), is given by

(A.l) ?a^x= argmin (n~x \\Xa - X9\\2 + k\\0\\x).


0:0k=0Vk<?A

The notation 9a,x is thus just a shorthand notation for 0?>rM\{a),\

LEMMA A.l. Given 9 e M.p^n\ let G(9) be a p(n)-dimensional vector with


elements

Gb(9) = -2n-{(Xa-X9,Xb).
A vector 9 with ?k = 0, Vk e F(n) \ A is a solution to (A.l) iff for all b e A,
Gb(9) = ? s\gn(9b)X in case 9b ^ 0 and \Gb(9)\ <k in case 9b = 0. Moreover, if
the solution is not unique and \Gb(9)\ < kfor some solution 9, then 9b = Ofor all
solutions of (A.l).

Proof. Denote the subdifferential of

n-l\\Xa-X9\\2 + X\\9\\x

with respect to 9 by D(9). The vector 9 is a solution to (A.l) iff there exists an
element d e D(9) so that db = 0,Wbe A. D(9) is given by [G(9) + Xe,ee S},
where S C Rp(n) is given by S := [e e Rp{n): eb = sign(^) if 9b ^ 0 and eb e
[? 1, 1] if 0b = 0}. The first part of the claim follows. The second part follows
from the proof of Theorem 3.1. in [ 13].

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1450 N. MEINSHAUSEN AND P. BUHLMANN

LEMMA A.2. Let 0a'nta'k be defined for every a e T(n) as in (A.l). Und
the assumptions of Theorem I, for some c > 0,for all a eT(n),

P(sign(^'nea'A) = sign(0?), Vb e nea) = 1 - 0(txp(-cn?)) forn^ oo

For the sign-function, it is understood that sign(0) = 0. The lemma says


other words, that if one could restrict the Lasso estimate to have zero coefficie
for all nodes which are not in the neighborhood of node a, then the signs of
partial correlations in the neighborhood of node a are estimated consistently un
the given assumptions.

PROOF. Using Bonferroni's inequality, and \nea\ = o(n) for n -> oo, it
fices to show that there exists some c > 0 so that for every a,b e r(n) w
b e nta,

P(sign(^'ne^) = sign(0?)) = 1 - 0(exp(-cn?)) for n -> oo.


Consider the definition of ?a'nQa>k in (A.l),

(A.2) ?a,**a,k= argmin (n-i ||Xfl _ X0\\22 + k\\9\\i).


6: 0k=0Vk<?nQa

Assume now that component b of this estimate is fixed at a constant value p.


Denote this new estimate by 6a'b'k(P),

(A.3) ?a>b>k(P)= aigmin(n-1||Xfl-X0||^ + A.||0||i),

where

?aMP) '= \e Rpin):0b = 0; 0k=0,Vk ? nefl}.

There always exists a value p (namely p = ??>nCa>k) so that 0a*b>k(P) is iden


tical to <9a'ne*'\ Thus, if sign(^'nea'^) # sign(0?), there would exist some p
with sign()8)sign(0?) < 0 so that ?a^k(P) would be a solution to (A.2). Using
sign(0?) / 0 for all b e nea, it is thus sufficient to show that for every P with
sign(/J) sign(0?) < 0, 0a'b'k(P) cannot be a solution to (A.2) with high probabil
ity.
We focus in the following on the case where 6% > 0 for notational simplic
ity. The case 6% < 0 follows analogously. If 6% > 0, it follows by Lemma A.l
that Ga>b>k(P) with 0%'b'k(P) = P < 0 can be a solution to (A.2) only if
Gb(0a'b,k(P)) > -k. Hence it suffices to show that for some c> 0 and all b e nefl
with Og > 0, for n -> oo,

(A.4) p(sup{Gb(0a'b'k(P))}
\p<o / <-k) = l- 0(exp(-cn?)).

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1451

Let in the following RX(P) be the n-dimensional vector of residuals,

(A.5) Rx(P):=Xa-X9a^x(P).
We can write Xb as

(A.6) **= ? 0bk'neamXk


kenta\{b}
+ Wb,
where Wb is independent of {X*; k e nea \ {b}}. By straightforward calculation,
using (A.6),

Gb(9a>b>x(/3)) = -2n-l(Rx(l3), Wb) - ? e^{b](2n'1 (Rx(/3), Xk)).


kema\{b}

By Lemma A.l, for all k e ne? \ [b], \Gk(9a^x(P))\ = \2n~l{Rx(P),Xk)\ < k.


This together with the equation above yields

(A.7) Gb(9a^x($)) < -2n-l(Rx(/3), Wb) + k\\9b>m^{b}\\v


Using Assumption 4, there exists some # < oo, so that ||0^ne<AM || i < #. For prov
ing (A.4) it is therefore sufficient to show that there exists for every g > 0 some
c > 0 so that for all b e nea with 0% > 0, for n -> oo,

(A.8) pfinf {2n-x(Rx(f5), W^)} > gk) = 1 - 0(exp(-cn?)).


With a little abuse of notation, let W" cRn be the at most (|nefl| ? 1)-dimensional
space which is spanned by the vectors {X^, k e nefl \ {b}} and let W1- be the orthog
onal complement of W" in W1. Split the n-dimensional vector W^ of observations
of Wb into the sum of two vectors

(A.9) W* = W? + WJ,
where WJ; is contained in the space W" cf, while the remaining part W^ is
chosen orthogonal to this space (in the orthogonal complement W1- of W"). The
inner product in (A.8) can be written as

(A.10) 2n~x <r?G8), Vfb) = 2H"1 (Rx(/3), W^> + 2n~x (R^), wj).
By Lemma A.3 (see below), there exists for every g > 0 some c > 0 so that, for
n ? oo,

P(jnf {^(R^jS), W?)/(l + |/J|)} > -^ = 1 - 0(exV(-cn?)).


To show (A.8), it is sufficient to prove that there exists for every g > 0 some c > 0
so that, for n ?> oo,

(A.ll) PJjnf {2n-l(Rx(l3), w?) - #(1 + |jS|)A.} > g^ = 1 - 0(exp(-cnfi)).


This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1452 N. MEINSHAUSEN AND P. BUHLMANN

For some random variable Va, independent of XnQa, we have

Xa= J2 tfXk + Va.


kema

Note that Va and Wb are independent normally distributed random variables with
variances a2 and ab, respectively. By Assumption 2, 0 < v2 < ab, a2 < 1. Note
furthermore that Wb and Xma\{b} are independent. Using 6a = 0ane? and (A.6),

(A.12) xa= J2 {oak+eabe^[b])xk + eabwb + va.


k nea\{b}

Using (A.12), the definition of the residuals in (A.5) and the orthogonality property
ofW?,
2n~l(Rk(P),V/?) =2n-\6ab - j8)(W?,W^> +2n~l{\a,W^>,
>2n-l(0% - P)(W?,W?) -\2n-l(\a,w?)\.
The second term, |2n_1 (Vfl, W^) |, is stochastically smaller than \2n~l (Va, W^) |
(this can be derived by conditioning on {X^; k e nea}). Due to independence of Va
and Wb, E(VaWb) = 0. Using Bernstein's inequality (Lemma 2.2.11 in [17]), and
k ~ dn~^~?^2 with s > 0, there exists for every g > 0 some c > 0 so that

Pd^"1 (Vfl, W?)| > gA.) < P(\2n-x{Va, Vib)\ > gk)
(A. 14)
= 0(exp(-cn8)).
Instead of (A.l 1), it is sufficient by (A.13) and (A. 14) to show that there exists for
every g>0ac>0so that, for n ?> oo,

p(mf{2n-HeS
(A. 15) \p^? / - P)(W?, W?) - g(l + \P\)k] > 2gk)
= 1 - 0(exp(-cn?)).
Note that ^~2(W^-, W^-> follows a Xn-\nea\ distribution. As |nefl| = o(n) and
ob > v2 (by Assumption 2), it follows that there exists some k > 0 so that for
n >no with some no(k) e N, and any c > 0,

P(2w"1(W^, W^) > ifc) = 1 - (9(exp(-cAz?)).


To show (A. 15), it hence suffices to prove that for every k, ? > 0 there exists some
no(k, I) e N so that, for all n > no,

(A.16) inf {(0g - P)k -1(\ + \p\)k) > 0.


By Assumption 5, \nab\ is of order at least n~^l~^^2. Using

^^^^/(Var^lXr^^OVar^lXrox^))172

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1453

and Assumption 2, this implies that there exists some q > 0 so that 9
qn~(l~^/2. As k ~ dn~(l~?^2 and, by the assumptions in Theorem 1, ? >
it follows that for every k, I > 0 and large enough values of n,

9%k - Ik > 0.
It remains to show that for any k, ? > 0 there exists some no(?, ^) so that for all
n >no,
inf{-pk-l\/3\k}>0.
?<o
This follows as k -> 0 for n -> oo, which completes the proof.

LEMMA A.3. Assume the conditions of Theorem 1 /u?W. Let R^(P) be defined
as in (A.5) and WJj as in (A.9). For ^wy g > 0 there exists c > 0 so that for all
a,b e F(n),for n -> oo,

pf sup \2n-l(Rx(P), Wj)|/(1 + |0|) < g*) = 1 - 0(exp(-cn?)).

PROOF. By Schwarz's inequality,

(A.17) \2n-l(Rfa), Wj)|/(1 + \0\) < 2az-1/2||W{||2" 'y'2.


The sum of squares of the residuals is increasing with increasing value of A. Thus,
IIRtf (?)ll2 < l|RS?Cj8)lli- By definition of R* in (A.5), and using (A.3),

||R~(j8)|? = ||Xa-/3X*|&
and hence

\\RX(P)\\2 < 0 + |^|)2max{||XJ|2, HX^II2}.


Therefore, for any q > 0,

P( \peR
SUP J^if)l12
1 + >IPl
0 < P(n-1/2max{||Xfl||2,
/ ||X*||2} > q).
Note that both HXJI2 and HX^ are /^-distributed. Thus there exist q > 1
c > 0 so that

,Alfi,P?(sup-f-->
(A.18) n-ll2\\Rx(fS)\\2 \ n( for
q ) /= 0(exp(-cnfe)) , ^ ,n -> 00.
\peR 1 + IpI
It remains to show that for every g > 0 there exists some c > 0 so that

(A.19) /)(n~1/2||Wj||2 > gk) = 0(exp(-cn?)) forn-> 00.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1454 N. MEINSHAUSEN AND P. BUHLMANN

The expression a"^"2(WJ[, WJ|) is X|nea|-rdistributed- As o& < 1 and |nefl| =


0(nK), it follows that ?_1^2||W^||2 is for some t > 0 stochastically smaller than

where Z is x2*-distributed. Thus, for every g > 0,

P(n-1/2||W{||2 > gk) < P{(Z/nK) > (g/t)2n^-K)k2).


As k~l = 0(n{l-?)l2), it follows that nl~Kk2 > hn?-K for some h > 0 and suffi
ciently large n. By the properties of the y} distribution and e > k, by assumption
in Theorem 1, claim (A. 19) follows. This completes the proof.

Proof of Proposition 1. All diagonal elements of the covariance matri


ces E(n) are equal to 1, while all off-diagonal elements vanish for all pairs ex
cept for a, b g F(n), where Tiab(n) = s with 0 < s < 1. Assume w.l.o.g. that a
corresponds to the first and b to the second variable. The best vector of coeffi
cients 9a for linear prediction of Xa is given by 9a = (0, ?Kab/Kaa,0, 0,...) =
(0, s, 0,0,...), where K = Y~l(n). A necessary condition for ne* = nefl is that
?a,x _ ^ r^ o, 0,...) is the oracle Lasso solution for some r ^ 0. In the follow
ing, we show first that

(A.20) P(3k, x > s : 9a'X = (0, x, 0, 0,...)) -> 0, n -> oo.


The proof is then completed by showing in addition that (0, r, 0,0,...) cannot be
the oracle Lasso solution as long as r < s.
We begin by showing (A.20). If 9 = (0, x, 0,0,...) is a Lasso solution for some
value of the penalty, it follows that, using Lemma A.l and positivity of x,

(A.21) (Xi - rX2,X2) > |(Xi - rX2,X*)| V* F(n), k > 2.


Under the given assumptions, X2,X3,... can be understood to be indepen
dently and identically distributed, while X\ = sX2 + W\9 with Wx independent
of (X2, X3,...). Substituting Xi = sX2 + W\ in (A.21) yields for all k e F(n)
with k > 2,

<Wi,X2> - (t - s)<X2,X2) > KWi.X*) - (x - s)(X2,Xk)\.

Let f/2, U3,..., Up(n) be the random variables defined by Uk = (Wi,X^). Note
that the random variables [4, k = 2,..., p(n), are exchangeable. Let furthermore

D = (X2,X2)- keF(n),k>2
max |<X2,X*)|.

The inequality above implies then


C/2 > max Uk + (x ? s)D.
keV(n),k>2

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1455

To show the claim, it thus suffices to show that

(A.22) p(u2> max Uk )+ (r - s)D) -> 0 forn-^oo.


\ ker(n),k>2
Using r ? s > 0,

p(u2> max Uk + (r-s)D]<p(u2>


V ?er(n),fc>2 max /
/ V * r(w),*>2 Uk) + P(D<0).
Using the assumption that s < 1, it follows by p(n) = o(ny) for some y
Bernstein-type inequality that

P(D<0)^0 forrc^oo.
Furthermore, as U2, ., f^p(?) are exchangeable,

P[U2> max t/* ) = (p(n) - 1)~ -> 0 for n ? 00,


V fcel?,?>2 /
which shows that (A.22) holds. The claim (A.20) follows.
It hence suffices to show that (0, r, 0, 0,...) with r < s cannot be th
cle Lasso solution. Let rmax be the maximal value of r so that (0, r, 0,..
Lasso solution for some value k > 0. By the previous assumption, rmax <
^ < ^max> the vector (0, r, 0,...) cannot be the oracle Lasso solution. We show
the following that (0, rmax, 0,...) cannot be an oracle Lasso solution either. S
pose that (0, rmax, 0, 0,...) is the Lasso solution 9a'k for some k = k > 0. As r
is the maximal value such that (0, r, 0,...) is a Lasso solution, there exists so
ke T(n) > 2, such that

|n_1(Xi -rmaxX2,X2)| = \n~l(X{ - rmaxX2, X*)|,

and the value of both components G2 and Gk of the gradient is equal to


appropriately reordering the variables we can assume that k = 3. Furthermo
holds a.s. that

max |(Xi -rmaxX2,X^)| < k.


ker(n),k>3

Hence, for sufficiently small 8k > 0, a Lasso solution for the penalty A. ? Sk is
given by
(o,w + se2,se3,o,...).
Let Hn be the empirical covariance matrix of (X2,X3). Assume w.l.o.g. that
n~l (Xi - rmaxX2, X*) > 0 and n~l (X2, X2) = n~x <X3, X3> = 1. Following, for
example, Efron et al. ([6], page 417), the components (S92, 8O3) are then given by
H~l(l, l)T, from which it follows that 802 = 863, which we abbreviate by 80 in
the following (one can accommodate a negative sign for n~l (X\ ? rmaxX2, X*) by

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1456 N. MEINSHAUSEN AND P. BUHLMANN

reversing the sign of 8O3). Denote by L& the squared error loss for this solution.
Then, for sufficiently small 80,

L8-L0 = E{XX - (rmax + 80)X2 + 80X3)2 - E(XX - rmaxX2)2


= (s - (rmax + SO))2 + 802 -(s- rmax)2
= -2(s-rmiiX)80 + 2802.
It holds that Lso ? Lo < 0 for any 0 < 80 < 1/2(5 ? rmax), which shows that
(0, r, 0,...) cannot be the oracle solution for r < s. Together with (A.20), this
completes the proof.

Proof of Proposition 2. The subdifferential of the argument in (6),

E[Xa- ? CX ) + Il?||0fl||l,
V meV(n)
with respect to 0%, k e F(n) \ {a}, is given by

-2Ellxa- J2 0mXm)xk\+riek,
W meT{n) I I
where ek e [-1, 1] if 0% = 0, and ek = sign(0?) if 0% ^ 0. Using the fa
nea(r]) = nefl, it follows as in Lemma A.l that for all k e nea,

(A.23) 2Eltxa- ? 0Z{r,)Xm\xk\=Tisigfi(0Z)


W meV(n) I J
and, for b ? nea,

(A.24) 2EUXa-
\\ ]T 0^(rj)Xm\xb\
meT{n) 1 I < 1?.
A variable Xb with b ? nea can be written as

Xb= ? obk aXk + wb,


kema
where Wb is independent of {Xk\ k e cla}. Using this in (A.24) yields

2EW(X?- E C(i)xm)xk)
kenea \\ mer(n) / <n.
/
Using (A.23) and da = 0a'ne\ it follows that

^^ ne"sign(^-ne") <1,
kenea

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1457

which completes the proof.

Proof of Theorem 1. The event rie^ ? nea is equivalent to the event that
there exists some node b eF(n)\ cla in the set of nonneighbors of node a such
that the estimated coefficient ??' is not zero. Thus

(A.25) P(mx c ma) = 1 - P{3b e F(n) \c\a :0?'V<>).


Consider the Lasso estimate 0a'ne?'\ which is by (A.l) constrained to have
nonzero components only in the neighborhood of node a e F(n). Using |nea| =
0(nK) with some k < 1, we can assume w.l.o.g. that |nea \<n. This in turn implies
(see, e.g., [13]) that ?a,ma,x is a.s. a unique solution to (A.l) with A = ma. Let 8
be the event

max \Gk(9a^a'X)\ <k.


ker(n)\c\a

Conditional on the event 8, it follows from the first part of Lemma A. 1 that Qa>ne<*>k
is not only a solution of (A.l), with A ? nca, but as well a solution of (3), where
A = F(n) \ {a}. As <9^ne^ = 0 for all b e F(n) \ clfl, it follows from the second
part of Lemma A.l that 9^x = 0,Vb eF(n)\cla. Hence

P(3b r(/i) \cla :^'V0) < 1 - P(8)


= p( \ker(n)\c\a
max \Gk(9a'ma'X)\
)
> k),
where

(A.26) Gb(9a^x) = -2n~l(Xa -X9a^x,Xb).


Using Bonferroni's inequality and p(n) = 0(ny) for any y > 0, it suffices to show
that there exists a constant c > 0 so that for all b e F(n) \ clfl,

(A.27) P(\Gb(0a>n*a*)\ >k) = 0(exp(-cn?)).


One can write for any b e F(n) \ c\a,

(A.28) Xb= J2 8mneaXm


menea
+ Vb,
where Vb ~ ^(0, a^) for some a% < 1 and Vb is independent of {Xm; m e cl^}.
Hence

Gb(9a>ma'x) = -2n~x J2 emUQa(xa - X0fl'nc*'\ Xm)


menefl

-2n-l(Xa-Xea'm"^,\b).

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1458 N. MEINSHAUSEN AND P. BUHLMANN

By Lemma A.2, there exists some c > 0 so that with probability


0(exp(-cn?)),
(A.29) sign(0^k) = sign(^'ne") V* e ma.
In this case by Lemma A. 1

2?_1 ? 0^a(Xa-X0a^\Xm)menca
= ( \ra]Tneasign(Cnea)^'nea)x.
/

If (A.29) holds, the gradient is given by

Gb(?a,nea,X) = J ? sign(0^)0^Ak
(A.30) Wnefl /

-2n-l(Xa-X0a^^k,\b).
Using Assumption 6 and Proposition 2, there exists some 8 < 1 so that

? sign(0^)0^\<8.
|menefl I

The absolute value of the coefficient Gb of the gradient in (A.26) is hence bounded
with probability 1 ? 0(exp(?cn?)) by
(A.31) \Gb(0a*ta'k)\ <8k + \2n~l (Xa - X0"'ne?'\ V*)|.
Conditional on Xcifl = {X^; k ecla}, the random variable

(Xa-X0^\\b)
is normally distributed with mean zero and variance crb\\Xa ? X0fl,nefl,A,||2. On the
one hand, ab < 1. On the other hand, by definition of 0a'nea?\

\\Xa-X0a^k\\2<\\Xa\\2.
Thus
\2n-l(Xa-X0a'nQa'k,\b)\
is stochastically smaller than or equal to \2n~l(Xa, \b)\. Using (A.31), it remains
to be shown that for some c > 0 and 8 < 1,

P{\2n-l(Xa, \b)\ > (1 - 8)k) = 0(exp(-cn?)).


As Vb and Xa are independent, E(Xa Vb) = 0. Using the Gaussianity and bounded
variance of both Xa and Vb, there exists some g < oo so that E(exp(\XaVb\)) < g.
Hence, using Bernstein's inequality and the boundedness of k, for some c> 0, for
all b e nefl, P(\2n'l{Xa,\b)\ > (1 - 8)k) = 0(exp(-cnk2)). The claim (A.27)
follows, which completes the proof.

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1459

Proof of Proposition 3. Following a similar argument as in Theorem


up to (A.27), it is sufficient to show that for every a, b with b eF(n)\ clfl
\Sa(b)\ > 1,

(A.32) P(\Gb(9a>ma<x)\ >k)->\ forn -* oo.


Using (A.30) in the proof of Theorem 1, one can conclude that for some 8 > 1,
with probability converging to 1 forn ?> oo,

(A.33) \Gb(0a' a'k)\ >8k- \2n~l(Xa - X0fl'ne"\ \b)\.


Using the identical argument as in the proof of Theorem 1 below (A.31), for the
second term, for any g > 0,

P(\2n~x(Xa - X0fl'nefl'\ Vfc) | > gk) -* 0 for n -* oo,

which together with 8 > 1 in (A.33) shows that (A.32) holds. This completes the
proof.

Proof of Theorem 2. First, P(ma c rib*) = 1 - P(3b e nea:9^x = 0).


Let 8 be again the event

(A.34) max \Gk(9a'nQa'X)\ < k.


keT(n)\c\a

Conditional on 8, we can conclude as in the proof of Theorem 1 that 9a'nea^


and 9a>x are unique solutions to (A.l) and (3), respectively, and ?a,nQa'x = 9a'x.
Thus

P(3b e nea :9^x =0)<P(3be ma :tynQa'x =0) + P(8C).


It follows from the proof of Theorem 1 that there exists some c > 0 so that
P(8C) = 0(exp(?cn?)). Using Bonferroni's inequality, it hence remains to show
that there exists some c > 0 so that for all b e nea,

(A.35) p(0?,nea,A. = Q) = (9(exp(_CT^)>


This follows from Lemma A.2, which completes the proof.

Proof of Proposition 4. The proof of Proposition 4 is to a large extent


analogous to the proofs of Theorems 1 and 2. Let 8 be again the event (A.34).
Conditional on the event 8, we can conclude as before that 9a,nta'x and 9a,x are
unique solutions to (A.l) and (3), respectively, and ?a'nQa'X = 9a'x. Thus, for any
b e nefl,

P(b i ntx) = P(9^x = 0) > P(9^x = 0\8)P(8).


Since P(8) ? 1 for n -> oo by Theorem 1,

p0a,tKa,k = 0|g)P(g) _> P(??**a* = 0) for n _+ 0Ci

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1460 N. MEINSHAUSEN AND P. BUHLMANN

It thus suffices to show that for all b e nta with \7tab\ = 0(n~^l~^/2) and ? <

p0a,nea,X = Q) _> j for /I - OO.


This holds if

(A.36) P(\Gb(0a'ma\{b]'k)\ < k) - 1 for n -* oo,


as \Gb(0a>ne"\^k)\ < k implies that ??> a\{bhk = ?a,nea,\ and hence ?a,nea,X = Q
Using (A.7),

\Gb(0a'm"\{b]>k)\ < \2n-l(Rk(0),yVb)\+k\\0b^a\{b]\\v


By assumption ||0Mefl\{fc} || j < l. it is thus sufficient to show that for any g > 0,

(A.37) P(\2n~l{Rk(0), Vfb)\ < gk) -* 1 for n -* oo.


Analogously to (A. 10), we can write

(A.38) |2n-^R?(0), W*)| < |2#i-^R?(0), W^)| + ^"^(O), w{)|.


Using Lemma A.3, it follows for the last term on the right-hand side that for every
S>0,
Pfl^-^R^O), Wj)| < gk) -> 1 forn -> oo.
Using (A.13) and (A. 14), it hence remains to show that for every g > 0,

(A.39) P(\2n-l0^nQa(W?,Wi)\ < gk) -* 1 forrc -* oo.


We have already noted above that the term ab2(Wb, Wb) follows a Xn-\nea\ ^s"
tribution and is hence stochastically smaller than a x? -distributed random vari
able. By Assumption 2, ab < 1. Furthermore, using Assumption 2 and \nab\ =
0(rc-(1-^)/2), |0^'nea| = 0(/i"(1-^/2). Hence, with A. ~ rfn"(1-?)/2, it follows
that for some constant k > 0, A./|0^,nea| > kn^?~^^2. Thus, for some constant
c>0,
(A.40) P(|2n~10^nCfl(W^-> W^)| < gk) > P(Z/n < cn(?-^)/2),
where Z follows a x2 distribution. By the properties of the x2 distribution and the
assumption ? < s, the right-hand side in (A.40) converges to 1 for n -> oo, from
which (A.39) and hence the claim follow.

Proof of Theorem 3. A necessary condition for Ck <? Ca is that there


exists an edge in Ek joining two nodes in two different connectivity components.
Hence

P{3a e F(n): Ck <? Ca) < p(n) max P(3b e r(n) \Ca:be mxa).
aer(n)

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
VARIABLE SELECTION WITH THE LASSO 1461

Using the same arguments as in the proof of Theorem 1,

P(3beF(n)\Ca:bemx)<p( maxJ\Gb(9a<Ca>x)\>k),
\ber(n)\Ca

where ?a,Ca,x, according to (A.l), has nonzero components only for variables in
the connectivity component Ca of node a. Hence it is sufficient to show that

(A.41) p(n)2 max P(\Gb(9aXa'X)\ > k) < a.


aeT{n),beT{n)\Ca

The gradient is given by Gb(9a'C?+) = -2n~x {Xa - X<9aC-\ Xb). For all k e Ca
the variables Xb and Xk are independent as they are in different connectivity com
ponents. Hence, conditional on Xca ? {X^; k eCa},

G^c^)-<AA(0,/?2/tt),
where R2 = 4n~x\\Xa - X^'^^l^, which is smaller than or equal to a2 ?
An~x \\Xa ||2 by definition of 9aXa^x. Hence for all aeF(n) and b e F(n) \ Ca,

P(\Gb(9a^x)\ > k\XCa) < 2?{V^k/(2aa)),


where <J> = 1 - *. It follows for the k proposed in (9) that P(\Gb(9a'Ca'X)\ >
k\XCa) < ap(n)~2, and therefore P(\Gb(9a^c^x)\ > k) < ap(n)~2. Thus (A.41)
follows, which completes the proof.

REFERENCES
[1] Buhl, S. (1993). On the existence of maximum-likelihood estimators for graphical Gaussian
models. Scand. J. Statist. 20 263-270. MR 1241392
[2] Chen, S., Donoho, D. and Saunders, M. (2001). Atomic decomposition by basis pursuit.
SIAMRev. 43 129-159. MR1854649
[3] Dempster, A. (1972). Covariance selection. Biometrics 28 157-175.
[4] Drton, M. and PERLMAN, M. (2004). Model selection for Gaussian concentration graphs.
Biometrika 91 591-602. MR2090624
[5] Edwards, D. (2000). Introduction to Graphical Modelling, 2nd ed. Springer, New York.
MR 1880319
[6] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression
(with discussion). Ann. Statist. 32 407^199. MR2060166
[7] Frank, I. and Friedman, J. (1993). A statistical view of some chemometrics regression tools
(with discussion). Technometrics 35 109-148.
[8] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor
selection and the virtue of over-parametrization. Bernoulli 10 971-988. MR2108039
[9] Heckerman, D., Chickering, D. M., Meek, C, Rounthwaite, R. and Kadie, C.
(2000). Dependency networks for inference, collaborative filtering and data visualization.
J. Machine Learning Research 1 49-75.
[10] Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regres
sion. Ann. Statist. 28 681-712. MR 1792783
[11] Knight, K. and Fu, W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28
1356-1378. MR1805787

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms
1462 N. MEINSHAUSEN AND P. BUHLMANN

[12] Lauritzen, S. (1996). Graphical Models. Clarendon Press, Oxford. MR1419991


[13] Osborne, M., Presnell, B. and Turlach, B. (2000). On the lasso and its dual. J. Comput.
Graph. Statist. 9 319-337. MR1822089
[14] Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88
486-494. MR1224373
[15] SPEED, T. and Kiiveri, H. (1986). Gaussian Markov distributions over finite graphs. Ann.
Statist. 14 138-150. MR0829559
[16] TlBSHlRANl, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc.
Ser. B 58 267-288. MR1379242
[17] VAN DER VAART, A. and WELLNER, J. (1996). Weak Convergence and Empirical Processes:
With Applications to Statistics. Springer, New York. MR 1385671

Seminar fur Statistik


ETH Zurich
CH-8092 Zurich
Switzerland
E-mail: nicolai@stat.math.ethz.ch
buhlmann @ stat.math.ethz.ch

This content downloaded from 128.178.42.73 on Wed, 16 Aug 2017 12:51:03 UTC
All use subject to http://about.jstor.org/terms

You might also like