Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

High Dimensional Robust Sparse Regression

Liu Liu Yanyao Shen Tianyang Li Constantine Caramanis


The University of Texas at Austin

Abstract preprocessing is essentially impossible when the num-


ber of variables scales with the number of samples.
We provide a novel – and to the best of We propose a computationally efficient estimator for
our knowledge, the first – algorithm for high outlier-robust sparse regression that has near-optimal
dimensional sparse regression with constant sample complexity, and is the first algorithm resilient
fraction of corruptions in explanatory and/or to a constant fraction of arbitrary outliers with cor-
response variables. Our algorithm recovers rupted covariates and/or response variables. Unless we
the true sparse parameters with sub-linear specifically mention otherwise, all future mentions of
sample complexity, in the presence of a con- outliers mean corruptions in covariates and/or response
stant fraction of arbitrary corruptions. Our variables.
main contribution is a robust variant of Itera- We assume that the authentic samples are indepen-
tive Hard Thresholding. Using this, we pro- dent and identically distributed (i.i.d.) drawn from an
vide accurate estimators: when the covariance uncorrupted distribution P , where P represents the
matrix in sparse regression is identity, our er- linear model yi = x> ∗
i β + ξi , where xi ∼ N (0, Σ), and
ror guarantee is near information-theoretically ∗ d
β ∈ R is the true parameter (see Section 1.3 for com-
optimal. We then deal with robust sparse plete details and definitions). To model the corruptions,
regression with unknown structured covari- the adversary can choose an arbitrary -fraction of the
ance matrix. We propose a filtering algorithm authentic samples, and replace them with arbitrary
which consists of a novel randomized outlier values. We refer to the observations after corruption as
removal technique for robust sparse mean es- -corrupted samples (Definition 1.1). This corruption
timation that may be of interest in its own model allows the adversary to select an -fraction of
right: the filtering algorithm is flexible enough authentic samples to delete and corrupt, hence it is
to deal with unknown covariance. Also, it is stronger than Huber’s -contamination model Huber
orderwise more efficient computationally than (1964), where the adversary independently corrupts
the ellipsoid algorithm. Using sub-linear sam- each sample with probability .
ple complexity, our algorithm achieves the
best known (and first) error guarantee. We Outlier-robust regression is a classical problem within
demonstrate the effectiveness on large-scale robust statistics (e.g., Rousseeuw and Leroy (2005)),
sparse regression problems with arbitrary cor- yet even in the low-dimensional setting, efficient al-
ruptions. gorithms robust to corruption in the covariates have
proved elusive, until recent breakthroughs in Prasad
et al. (2018); Diakonikolas et al. (2019a) and Klivans
1 Introduction et al. (2018), which built on important results in Ro-
bust Mean Estimation Diakonikolas et al. (2016); Lai
Learning in the presence of arbitrarily (even adver- et al. (2016) and Sums of Squares Barak and Steurer
sarially) corrupted outliers in the training data has a (2016), respectively.
long history in Robust Statistics (Huber, 2011; Hampel In the sparse setting, the parameter β ∗ we seek to
et al., 2011; Tukey, 1975), and has recently received recover is also k-sparse, and a key goal is to provide
much renewed attention. The high dimensional set- recovery guarantees with sample complexity scaling
ting poses particular challenges as outlier removal via with k, and sublinearly with d. Without outliers, by
now classical results (e.g., Donoho (2006)) show that
Proceedings of the 23rd International Conference on Artificial n = Ω(k log d) samples from a i.i.d sub-Gaussian dis-
Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. tribution are enough to give recovery guarantees on β ∗
PMLR: Volume 108. Copyright 2020 by the author(s). with and without additive noise. These strong assump-
High Dimensional Robust Sparse Regression

tions on the probabilistic distribution are necessary, SDP framework gives polynomial time algorithms, they
since in the worst case, sparse recovery is known to be are often not practical (Hopkins et al., 2016).
NP-hard Bandeira et al. (2013); Zhang et al. (2014).
Much less appears to be known in the high-dimensional
Sparsity recovery with a constant fraction of arbitrary regime. Exponential time algorithm, such as Gao
corruption is fundamentally hard. For instance, to the (2017); Johnson and Preparata (1978), optimizes Tukey
best of our knowledge, there’s no previous work can depth Tukey (1975); Chen et al. (2018). Their results
provide exact recovery for sparse linear equations with reveal that handling a constant fraction of outliers
arbitrary corruption in polynomial time. In contrast, a ( = const.) is actually minimax-optimal. Work in
simple exhaustive search can easily enumerate the sam- Chen et al. (2013) first provided a polynomial time
ples and recover the sparse parameter in exponential algorithm for this problem. They show that replacing
time. the standard inner product in Matching Pursuit with
a trimmed version, one can √ recover from an -fraction
In this work, we seek to give an efficient, sample-
of outliers, with  = O(1/ k). Very recently, Liu et al.
complexity optimal algorithm that recovers β ∗ to
(2019) considered more general sparsity constrained
within accuracy depending on  (the fraction of out-
M -estimation by using a trimmed estimator in each
liers). In the case of no additive noise, we are interested
step of gradient
√ descent, yet the robustness guarantee
in algorithms that can guarantee exact recovery, inde-
 = O(1/ k) is still sub-optimal. Another approach
pendent of .
follows as a byproduct of a recent algorithm for ro-
bust sparse mean estimation, in Balakrishnan et al.
1.1 Related work (2017a). However, their error guarantee scales with
kβ ∗ k2 , and moreover, does not provide exact recovery
The last 10 years have seen a resurgence in interest in in the adversarial corruption case without stochastic
robust statistics, including the problem of resilience noise (i.e., noise variance σ 2 = 0). We note that this
to outliers in the data. Important problems attacked is an inevitable consequence of their approach, as they
have included PCA (Klivans et al., 2009; Xu et al., directly use sparse mean estimation on {yi xi }, rather
2012, 2013; Lai et al., 2016; Diakonikolas et al., 2016), than considering Maximum Likelihood Estimation.
and more recently robust regression (as in this pa-
per) (Prasad et al., 2018; Diakonikolas et al., 2019a,b; Robust mean estimation. The idea in (Prasad
Klivans et al., 2018) and robust mean estimation (Di- et al., 2018; Diakonikolas et al., 2019a,b) is to leverage
akonikolas et al., 2016; Lai et al., 2016; Balakrishnan recent breakthroughs in robust mean estimation. Very
et al., 2017a), among others. We focus now on the recently, (Diakonikolas et al., 2016; Lai et al., 2016) pro-
recent work most related to the present paper. vided the first robust mean estimation algorithms that
can handle a constant fraction of outliers (though Lai
Robust regression. Earlier work in robust regres- et al. (2016) incurs a small (logarithmic) dependence
sion considers corruption only in the output, and shows in the dimension). Following their work, Balakrishnan
that algorithms nearly as efficient as for regression with- et al. (2017a) extended the ellipsoid algorithm from
out outliers succeeds in parameter recovery, even with Diakonikolas et al. (2016) to robust sparse mean esti-
a constant fraction of outliers (Li, 2013; Nguyen and mation in high dimensions. They show that k-sparse
Tran, 2013; Bhatia et al., 2015, 2017; Karmalkar and mean estimation in Rd with a constant fraction of out-
Price, 2018). Yet these algorithms (and their analysis) liers can be done with n = Ω k 2 log (d) samples. The
focus on corruption in y, and do not seem to extend k 2 term appears to be necessary, as n = Ω(k 2 ) follows
to the setting of corrupted covariates – the setting of from an oracle-based lower bound Diakonikolas et al.
this work. In the low dimensional setting, there has (2017).
been remarkable recent progress. The work in Klivans
et al. (2018) shows that the Sum of Squares (SOS) 1.2 Main contributions
based semidefinite hierarchy can be used for solving
robust regression. Essentially concurrent to the SOS • Our result is a robust variant of Iterative Hard
work, (Chen et al., 2017; Holland and Ikeda, 2017; Thresholding (IHT) Blumensath and Davies (2009).
Prasad et al., 2018; Diakonikolas et al., 2019a) use We provide a deterministic stability result showing
robust gradient descent for empirical risk minimization, that IHT works with any robust sparse mean esti-
by using robust mean estimation as a subroutine to mation algorithm. We show our robust IHT does
compute robust gradients at each iteration. Diakoniko- not accumulate the errors of a (any) robust sparse
las et al. (2019b) uses filtering algorithm Diakonikolas mean estimation subroutine for computing the gra-
et al. (2016) for robust regression. Computationally, dient. Specifically, robust IHT produces a final solu-
these latter works scale better than the algorithms in tion whose error is orderwise the same as the error
Klivans et al. (2018), as although the Sum of Squares guaranteed by an single use of the robust mean esti-
Liu Liu, Yanyao Shen, Tianyang Li, Constantine Caramanis

mation subroutine. We refer to Definition 2.1 and or known identity) the result is strong enough to
Theorem 2.1 for the precise statement. Thus our guarantee exact recovery when either σ or  goes
result can be viewed as a meta-theorem that can be to zero. We demonstrate the practical effectiveness
coupled with any robust sparse mean estimator. of our filtering algorithm in Section 6. This is the
content of Section 4 and Section 5.
• Coupling robust IHT with a robust sparse mean esti-
mation subroutine based on a version of the ellipsoid 1.3 Setup, Notation and Outline
algorithm given and analyzed in Balakrishnan et al.
(2017a), our results Corollary 3.1 show that given In this subsection, we formally define the corruption
-corrupted sparse regression samples with identity model and the sparse regression model. We first intro-
covariance, we recover β ∗ within additive error O(σ) duce the -corrupted samples described above:
(which is minimax optimal Gao (2017)). The proof Definition 1.1 (-corrupted samples). Let {zi , i ∈ G}
of the ellipsoid algorithm’s performance in Balakr- be i.i.d. observations follow from a distribution P . The
ishnan et al. (2017a) hinges on obtaining an upper -corrupted samples {zi , i ∈ S} are generated by the
bound on the sparse operator norm (their Lemmas following process: an adversary chooses an arbitrary
A.2 and A.3). As we show (see Appendix B), the -fraction of the samples in G and modifies them with
statement of Lemma A.3 seems to be incorrect, and arbitrary values. After the corruption, we use S to
the general approach of upper bounding the sparse denote the observations, and use B = S \ G to denote
operator norm may not work. Nevertheless, the al- the corruptions.
gorithm performance they claim is correct, as we
show through a different avenue (see Lemma D.3 in The parameter  represents the fraction of outliers.
Appendix D.3). Throughout, we assume that it is a (small) constant,
independent of dimension or other problem parameters.
Using this ellipsoid algorithm, In particular, we ob-
Furthermore, we assume that the distribution P is the
tain exact recovery if either the fraction of outliers
standard Gaussian-design AWGN linear model.
goes to zero (this is just ordinary sparse regression),
or in the presence of a constant fraction of outliers Model 1.1. The observations {zi = (yi , xi ), i ∈ G}
but with the additive noise term going to zero (this follow from the linear model yi = x> ∗
i β + ξi , where
∗ d
is the case of robust sparse linear equations). To β ∈ R is the model parameter, and assumed to
the best of our knowledge, this is the first result be k-sparse.  We assume that xi ∼ N (0, Σ) and
that shows exact recovery for robust sparse linear ξi ∼ N 0, σ 2 , where Σ is the normalized covariance
equations with a constant fraction of outliers. This matrix with Σjj ≤ 1 for all j ∈ [d]. We denote µα
is the content of Section 3. as the smallest eigenvalue of Σ, and µβ as its largest
eigenvalue. They are assumed to be universal constants
• For robust sparse regression with unknown covari- in this paper, and we denote the constant cκ = µβ /µα .
ance matrix, we consider the wide class of sparse
covariance matrices Bickel and Levina (2008). We As in Balakrishnan et al. (2017a), we pre-process by
then prove a result that may be of interest in its own removing “obvious” outliers; we henceforth assume that
right: we provide a novel robust sparse mean estima- all authentic and corrupted points are within a radius
tion algorithm that is based on a filtering algorithm bounded by a polynomial in n, d and 1/.
for sequentially screening and removing potential Notation. We denote the hard thresholding operator
outliers. We show that the filtering algorithm is of sparsity k 0 by Pk0 . We define the k-sparse operator
flexible enough to deal with unknown covariance, norm as kM kek,op = maxkvk =1,kvk ≤ek |v > M v|, where
2 0
whereas the ellipsoid algorithm cannot. It also runs M is not required to be positive-semidefinite (p.s.d.).
a factor of O(d2 ) faster than the ellipsoid algorithm. We use trace inner produce hA, Bi to denote Tr A> B .

If the covariance matrix is sufficiently sparse, our We use Ei∈u S to denote the expectation operator ob-
filtering algorithm gives a robust sparse mean esti- tained by the uniform distribution over all samples i
mation algorithm, that can then be coupled with in a set S. Finally, we use the notation O(·)
e to hide
our meta-theorem. Together, these two guarantee √ the dependency on poly log(1/), and Ω(·) to hide the
recovery of β ∗ within an additive error of O(σ ).
e
dependency on poly log(k) in our bounds.
In the case of unknown covariance, this is the best
(and in fact, only) result we are aware of for robust
sparse regression. We note that it can be applied to 2 Hard thresholding with robust
the case of known and identity covariance, though it gradient estimation
is weaker than the optimal results we obtain using
the computationally more expensive ellipsoid algo- In this section, we present our method of using robust
rithm. Nevertheless, in both cases (unknown sparse, sparse gradient updates in IHT. We then show sta-
High Dimensional Robust Sparse Regression

Algorithm 1 Robust sparse regression with RSGE distribution P , at the point β.


1: Input: Samples {yi , xi }N i=1 , RSGE subroutine. Definition 2.1 (ψ ()-RSGE). Given n (k, d, , ν) -
2: Output: The estimation β. b corrupted samples {zi }ni=1 from Model 1.1, we call
3: Parameters: Hard thresholding parameter k 0 . Gb (β) a ψ ()-RSGE, if given {zi }n , G
i=1
b (β) guaran-
4: Split samples into T subsets of size n. β 0 = 0. 2 2
tees kG (β) − G (β)k2 ≤ α()kG (β)k2 + ψ (), with
b
5: for t = 0 to T − 1, do probability at least 1 − ν.
6: At current β t , calculate all gradients for current
n samples: git = xi x> Here, we use n (k, d, , ν) to denote the sample com-
t

i β − yi , i ∈ [n].
For {git }ni=1 , We use a RSGE to get bt plexity as a function of (k, d, , ν), and note that the
7: G .  definition of RSGE does not require Σ to be iden-
8: Update the parameter: β t+1 = Pk0 β t − η G bt .
tity matrix. The parameters α() and ψ () will be
9: end for specified by concrete robust sparse mean estimators
b = βT .
10: Output the estimation β in subsequent sections. Equipped with Definition 2.1,
we propose Algorithm 1, which takes any RSGE as a
subroutine in line 7, and runs a robust variant of IHT
with the estimated sparse gradient G b t at each iteration
tistical recovery guarantees given any accurate robust 1
sparse gradient estimation, which is formally defined in line 8.
in Definition 2.1. Global linear convergence and parameter recov-
We define the notation for the stochastic gradient ery guarantees. In each single IHT update step,
gi corresponding to the ith point zi , and the pop- RSGE introduces a controlled amount of error. Theo-
ulation gradient for zi ∼ P based on Model 1.1, rem 2.1 gives a global linear convergence guarantee for
git = xi x> t t t Algorithm 1 by showing that IHT does not accumulate
i β − yi , and G = Ezi ∼P (gi ) , where
P is the distribution of the authentic points. too much error. p In particular, we are able to recover β ∗
Since Ezi ∼P xi x>

= Σ, the population mean within error O( ψ ()) given any ψ ()-RSGE subrou-
i
of all authentic gradients is given by Gt = tine. We give the proof of Theorem 2.1 in Appendix A.
Ezi ∼P xi x> (β t
− β ∗
)

= Σ(β t
− β ∗ ). The hyper-parameter k 0 = c2κ k guarantees global linear
i
convergence of IHT when cκ > 1 (when Σ 6= Id ). This
In the uncorrupted case where all samples {zi , i ∈ G} setup has been used in Jain et al. (2014); Shen and Li
follow from Model 1.1, a single iteration of IHT updates (2017), and is proved to be necessary in Liu and Bar-
β t via β t+1 = Pk0 (β t − Ei∈u G git ). Here, the hard ber (2018). Note that Theorem 2.1 is a deterministic
thresholding operator Pk0 selects the k 0 largest elements stability result in nature, and we obtain probabilistic
in magnitude, and the parameter k 0 is proportional to k results by certifying the RSGE condition.
(specified in Theorem 2.1). However, given -corrupted
Theorem 2.1 (Meta-theorem). Suppose we observe
samples {zi , i ∈ S} according to Definition 1.1, the
N (k, d, , ν) -corrupted samples from Model 1.1. Al-
IHT update based on empirical average of all gradient
gorithm 1, with ψ ()-RSGE defined in Definition 2.1,
samples {gi , i ∈ S} can be arbitrarily bad.
p η = 1/µβ outputs β, such that kβ −
with step size b b
The key goal in this paper is to find a robust estimate ∗
β k2 = O( ψ ()), with probability at least
Gb t to replace Gt in each step of IHT, with sample p 1 − ν, by
setting k 0 = c2κ k and T = Θ(log(kβ ∗ k2 / ψ())). The
complexity sub-linear in the dimension d. For instance, sample complexity is N (k, d, , ν) = n (k, d, , ν/T ) T .
we consider robust sparse regression with Σ = Id .
Then, Gt = β t − β ∗ is guaranteed to be (k 0 + k)-sparse
in each iteration of IHT. In this case, given -corrupted 3 Robust sparse regression with
samples, we can use a robust sparse mean estimator near-optimal guarantee
|S|
to recover the unknown true Gt from {git }i=1 , with
sub-linear sample complexity. In this section, we provide near optimal statistical
guarantee for robust sparse regression when the co-
More generally, we propose Robust Sparse Gradient variance matrix is identity. Under the assumption
Estimator (RSGE) for gradient estimation given - Σ = Id , Balakrishnan et al. (2017a) proposes a ro-
corrupted samples, as defined in Definition 2.1, which bust sparse regression estimator based on robust sparse
guarantees that the deviation between the robust es-
1
timate Gb (β) and true G (β), with sample complexity Our results require sample splitting to maintain inde-
n  d. For a fixed k-sparse parameter β, we drop pendence between subsequent iterations, though we believe
this is an artifact of our analysis. Similar approach has been
the superscript t without abuse of notation, and use used in Balakrishnan et al. (2017b); Prasad et al. (2018)
gi in place of git , and G in place of Gt ; G (β) denotes for theoretical analysis. We do not use sample splitting
the population gradient over the authentic samples’ technique in the experiments.
Liu Liu, Yanyao Shen, Tianyang Li, Constantine Caramanis

Algorithm 2 Separation oracle for robust sparse esti- subject to H < 0, kHk1,1 ≤ e
k, Tr (H) = 1. (1)
mation Balakrishnan et al. (2017a)
1: Input: Weights from the previous iteration Here, G,
b Σ
b are weighted first and second order moment
{wi , i ∈ S}, gradient samples {gi , i ∈ S}. estimates from -corrupted samples, and F : Rd →
2: Output: Weight {wi0 , i ∈ S} Rd×d is a function with closed-form F (G)
b = kGkb 2 Id +
2
3: Parameters: Hard thresholding parameter e k, pa- > 2
GG + σ Id . Given the population mean G, we have
b b
>
rameter ρsep . F (G) = Ezi ∼P ((gi − G) (gi − G) ), which calculates
4: Compute the weighted sample mean G
e = the underlying true covariance matrix. We provide
P  more details about the calculation of F (·), as well as
i∈S wi gi , and G = P2ek G .
b e
5: Compute the weighted sample covariance matrix
some smoothness properties, in Appendix C.
  >
b =P
Σ wi gi − Gb gi − G b . The key component in the separation oracle Algo-
i∈S
rithm 2 is to use convex relaxation of Sparse PCA
6: Let λ∗ be the optimal value, and H ∗ be the corre-
eq. (1). This idea generalizes existing work on using
sponding solution of the following program
PCA to detect outliers in low dimensional robust mean

b −F G
 
b ·H ,
 estimation Diakonikolas et al. (2016); Lai et al. (2016).
max Tr Σ
H To gain some intuition for eq. (1), if gi is an outlier,
subject to H < 0, kHk1,1 ≤ e
k, Tr (H) = 1. then the optimal solution of eq. (1), H ∗ , may detect
the direction of this outlier. And this outlier will be
7: if λ∗ ≤ ρsep , then return “Yes”. down-weighted in the output of Algorithm 2 by the
return
8:  The hyperplane: `(w0 )  = separating hyperplane. Finally, Algorithm 2 will ter-
> minate with λ∗ ≤ ρsep (line 7) and output the robust
wi0 gi − G gi − G − F G , H − λ∗ .

P  
b b b
i∈S sparse mean estimation of the gradients G. b

Indeed, the ellipsoid-algorithm-based robust sparse


mean estimator gives a RSGE, which we can combine
mean estimation on {yi xi , i ∈ S}, leveraging the fact with Theorem 2.1 to obtain stronger results. We state
that Ezi ∼P (yi xi ) = β ∗ . With sample complexity these as Corollary 3.1. We note again that the analysis
2
N = Ω k log(d/ν)

2 , this algorithm produces a βe such in Balakrishnan et al. (2017a) has a flaw. Their Lemma
that kβe − β ∗ k22 = O(e 2 (kβ ∗ k2 + σ 2 )), with probabil- A.3 is incorrect, as our counterexample in Appendix B
2
ity at least 1 − ν. Using Theorem 2.1, we show that demonstrates. We provide a correct route of analysis
we can obtain significantly stronger statistical guaran- in Lemma D.3 of Appendix D.
tees which are statistically optimal; in particular, our
guarantees are independent of kβ ∗ k2 and yield exact 3.2 Near-optimal statistical guarantees
recovery when σ = 0.
Corollary 3.1. Suppose we observe N (k, d, , ν) -
3.1 RSGE via the ellipsoid algorithm corrupted samples from Model 1.1 with Σ = Id .
By setting e k = k 0 + k, if we use the ellipsoid al-
More specifically, the ellipsoid-based robust sparse gorithm for robust sparse  gradient estimation with
mean estimation algorithm Balakrishnan et al. (2017a) ρsep = Θ  kGt k22 + σ 2 , it requires N (k, d, , ν) =
2
deals with outliers by trying to optimize the set of Ω k log(dT
2
/ν) 
T samples, and guarantees ψ () =
weights {wi , i ∈ S} on each of the samples in Rd – ide- Oe 2 σ 2 . Hence, Algorithm 1 outputs β, b such that
ally outliers would receive lower weight and hence their ∗
impact would be minimized. Since the set of weights kβ − β k2 = O (σ),
b e probability at least 1 − ν, by
 with 
kβ ∗ k2
is convex, this can be approached using a separation setting T = Θ log σ .
oracle Algorithm 2. The Algorithm 2 depends on a
convex relaxation of Sparse PCA, and the hard thresh- Any one-dimensional robust variance estimate of the
olding parameter is ek = k 0 + k, as the population mean regression residual {ri , i ∈ [n]} in each iteration is
of all authentic gradient samples Gt is guaranteed to sufficient for ρsep , where ri = yi − x> t
i β . For a desired
0
be (k 0 + k)-sparse. In line 4 and 5, we calculate the error level  ≥ , we only require sample complexity
2
weighted mean and covariance based on a hard thresh- N (k, d, , ν) = Ω k log(dT
 02
/ν) 
T . Hence, we can achieve
olding operator. In line 6 of Algorithm 2, with each call
p 
2
statistical error O σ k log (d)/N ∨  . Our error
e
to the relaxation of Sparse PCA, we obtain an optimal bound is nearly optimal compared to the information-
value, λ∗ , and optimal solution, H ∗ , to the problem:
p 
theoretically optimal O σ k log (d)/N ∨  in Gao
    (2017), as the k 2 term is necessary by an oracle-based
λ∗ = max Tr Σ b −F G b ·H ,
SQ lower bound Diakonikolas et al. (2017).
H
High Dimensional Robust Sparse Regression

Proof sketch of Corollary 3.1 The key to the proof relies Algorithm 3 RSGE via filtering
on showing that λ∗ controls the quality of the weights of 1: Input: A set Sin .
the current iteration, i.e., small λ∗ means good weights 2: Output: A set Sout or sparse mean vector G.
b
and thus a good current solution. Showing this relies 3: Parameters: Hard thresholding parameter e k, pa-
on using λ∗ to control Σ−F
b (G).
b Lemma A.3 in Balakr- rameter ρsep .
ishnan et al. (2017a) claims that λ∗ ≥ kΣ b − F (G)k
b e .
k,op 4: Compute the sample mean G

e = Ei∈ S gi , and
u in
As we show in Appendix B, however, this need not 
Gb=P e G e .
hold. This is because the trace norm maximization 2k
5: Compute the sample covariance matrix
eq. (1) is not a valid convex relaxation for the ek-sparse
operator norm when the term Σ − F (G) is not p.s.d.
b b   >
b = Ei∈ S gi − G
Σ b gi − G .
(which indeed it need not be). We provide a differ-
b
u in

ent line of analysis in Lemma D.3, essentially showing


that even without the claimed (incorrect) bound, λ∗ 6: Solve the following convex program:
can still provide the control we need. With the cor-  
max Tr Σb ·H ,
rected analysis for λ∗ , the ellipsoid algorithm guaran- H
tees kG b − Gk2 = O(e 2 (kβ − β ∗ k2 + σ 2 )) with proba-
2 2
subject to H < 0, kHk1,1 ≤ e
k, Tr (H) = 1. (2)
bility at least
 1 − ν. Therefore, the algorithm provides
an O e 2 σ 2 -RSGE.
Let λ∗ be the optimal value, and H ∗ be the corre-
sponding solution.
4 Robust sparse mean estimation via 7: if λ∗ ≤ ρsep , then return with G. b
filtering 8: Calculate projection score for each i ∈ Sin :
   > 
From a computational viewpoint, the time complexity τi = Tr H ∗ · gi − G
b gi − G
b . (3)
of Algorithm 1 depends on the RSGE in each iterate.
The time complexity of the ellipsoid algorithm is indeed 9: Randomly remove a sample r from Sin according
polynomial in the dimension, but it requires O d2 to
calls to a relaxation of Sparse PCA (Bubeck (2015)). τi
In this section, we introduce a faster algorithm as a Pr (gi is removed) = P . (4)
i∈Sin τi
RSGE, which only requires O (n) calls of Sparse PCA
(recall that n only scales with k 2 log d). Importantly,
10: return the set Sout = Sin \ {r}.
this RSGE is flexible enough to deal with unknown
covariance matrix, yet the ellipsoid algorithm cannot.
Before we move to the result for unknown covariance
matrix in Section 5, we first introduce Algorithm 3 Algorithm 3 can be used for other robust sparse func-
and analyze its performance when the covariance is tional estimation problems (e.g., robust sparse mean
identity. These supporting Lemmas will be later used estimation for N (µ, Id ), where µ ∈ Rd is k-sparse). To
in the unknown case. use Algorithm 3 as a RSGE given n gradient samples
(denoted as Sin ), we call Algorithm 3 repeatedly on Sin
Our proposed RSGE (Algorithm 3) attempts to remove and then on its output, Sout , until it returns a robust
one outlier at each iteration, as long as a good solution estimator G.b The next theorem provides guarantees
has not already been identified. It first estimates the on this iterative application of Algorithm 3.
gradient G b by hard thresholding (line 4) and then
2
Theorem 4.1. Suppose we observe n = Ω k log(d/ν)

estimates the corresponding sample covariance matrix 
Σ
b (line 5). By solving (a relaxation of) Sparse PCA, -corrupted samples from Model 1.1 with Σ = Id . Let
we obtain a scalar λ∗ as well as a matrix H ∗ . If λ∗ Sin be an -corrupted set of gradient samples {git }ni=1 .
is smaller than the predetermined threshold ρsep , we By setting ek = k 0 + k, if we run Algorithm 3 iteratively
have a certificate that the effect of the outliers is well- with initial set Sin , and
 subsequently on Sout , and use
controlled (specified in eq. (5)). Otherwise, we compute ρsep = Cγ kGt k22 + σ 2 , then this repeated use of Algo-
a score for each sample based on H ∗ , and discard one rithm 3 will stop after at most 1.1γγ−1 n iterations, and
of the samples according to a probability distribution t t t 2 e  kGt k2 + σ 2 ,

where each sample’s probability of being discarded is output G , such that kG −G k2 = O
b b
2

proportional to the score we have computed. 2 with probability at least 1 − ν − exp (−Θ (n)). Here,
Cγ is a constant depending on γ, where γ ≥ 4 is a
2
Although we remove one sample in Algorithm 3, our constant.
theoretical analysis naturally extend to removing constant
number of outliers. This speeds up the algorithm in practice, yet shares the same computational complexity
Liu Liu, Yanyao Shen, Tianyang Li, Constantine Caramanis

hand, when λ∗ ≥ ρsep , the contribution of i∈Sgood τi


P
Thus, Theorem 4.1 shows that with high probability,
Algorithm 3 provides a Robust  Sparse Gradient Es- is relatively small, which can be obtained through con-
timator where ψ () = Oe σ 2 . For example, we can centration inequalities for the samples in Sgood . Based
take ν = d−Θ(1) . Combining now with Theorem 2.1, we on Lemma 4.1, if λ∗ ≤ ρsep , then the RHS of eq. (5) is
obtain an error guarantee for robust sparse regression. bounded, leading to the error guarantee of kG b t − Gt k2 .
2

Corollary 4.1. Suppose we observe N (k, d, , ν) - On the other hand, if λ ≥ ρsep , we can show that
corrupted samples from Model 1.1 with Σ = Id . Un- eq. (4) is more likely to throw out samples of Sbad
der the same setting as Theorem 4.1, if we use Algo- rather than Sgood . Iteratively applying Algorithm 3 on
rithm 3 for robust sparse the remaining samples, we can remove those outliers
 2 gradient  estimation, it requires with large effect, and keep the remaining outliers’ ef-
k log(dT /ν)
N (k, d, , ν) = Ω  T samples, and T =
fect well-controlled. This leads to the final bounds in
e (σ √)
  ∗ 
kβ k2
Θ log σ√ , then we have kβb − β ∗ k2 = O Theorem 4.1.
with probability at least 1 − ν − T exp (−Θ (n)).
5 Robust sparse regression with
Similarp to Section 3, we can achieve statistical error
Oe σ k 2 log (d)/N ∨ √ . The scaling of  in Corol- unknown covariance
lary 4.1 is Oe (√). These guarantees are worse than
In this section, we consider robust sparse regression
Oe () achieved by ellipsoid methods. Nevertheless, this
with unknown covariance matrix Σ, which has addi-
result is strong enough to guarantee exact recovery
tional sparsity structure. Formally, we define the sparse
when either σ or  goes to zero. The simulation of
covariance matrices as follows:
robust estimation for the filtering algorithm is in Sec-
tion 6. Model 5.1 (Sparse covariance matrices). In Model 1.1,
the authentic covariates {xi , i ∈ G} are drawn from
The key step in Algorithm 3 is outlier removal eq. (4) N (0, Σ). We assume that each row and column of Σ
based on the solution of Sparse PCA’s convex relaxation is r-sparse, but the positions of the non-zero entries
eq. (2). We describe the outlier removal below, and are unknown.
then give the proofs in Appendix E and Appendix F.
Outlier removal guarantees in Algorithm 3. We Model 5.1 is widely studied in high dimensional statis-
denote samples in the input set Sin as gi . This input tics Bickel and Levina (2008); El Karoui (2008); Wain-
set Sin can be partitioned into two parts: Sgood = {i : wright (2019). Under Model 5.1, for the population
i ∈ G and i ∈ Sin }, and Sbad = {i : i ∈ B and i ∈ gradient Gt = EP xi x> t ∗ t
i (β − β ) = Σω , where we
Sin }. Lemma 4.1 shows that Algorithm 3 can return use ω to denote the (k + k)-sparse vector β t − β ∗ , we
t 0

a guaranteed gradient estimate, or the outlier removal can guarantee the kGt k0 = kΣω t k0 ≤ r(k 0 + k). Hence,
step eq. (4) is likely to discard an outlier. The guarantee we can use the filtering algorithm (Algorithm 3) with
on the Poutlier removal step eq.P(4) hinges on the fact k = r(k 0 + k) as a RSGE for robust sparse regression
e
that if i∈Sgood τi is less than i∈Sbad τi , we can show with unknown Σ. When the covariance is unknown,
eq. (4) is likely to remove an outlier. we cannot evaluate F (·) a priori, thus the ellipsoid al-
2 gorithm is not applicable to this case. And we provide
Lemma 4.1. Suppose we observe n = Ω k log(d/ν)

 - error guarantees as follows.
corrupted samples from Model 1.1 with Σ = Id . Let Sin
be an -corrupted set {git }ni=1 . Algorithm 3 computes Theorem 5.1. Suppose we observe N (k, d, , ν) -
λ∗ that satisfies corrupted samples from Model 1.1, where the co-
variates xi ’s follow from Model 5.1. If we use
Algorithm 3 for 2 robust sparse gradient estimation,
   > 
λ∗ ≥ max v > Ei∈ S gi − G
u in
b gi − G b v. r k2 log(dT /ν)

kvk2 =1,kvk0 ≤e
k it requires Ωe
 T samples, and T =
e (σ √),
  ∗ 
(5) Θ log σ√
kβ k2
, then, we havekβb − β ∗ k2 = O

2
 with probability at least 1 − ν − T exp (−Θ (n)).
If λ∗ ≥ ρsep = Cγ kGt k2 + σ 2 , then with probability
at least 1 − ν, we have i∈Sgood τi ≤ γ1 i∈Sin τi , where
P P
The proof of Theorem 5.1 is collected in Appendix G,
τi is defined in eq. (3), Cγ is a constant depending on and the main technique hinges on previous analysis
γ, and γ ≥ 4 is a constant. for the identity covariance case (Theorem 4.1 and
Lemma 4.1). In the case of unknown covariance, this
The proofs are collected in Appendix E. In a nutshell, is the best (and in fact, only) recovery guarantee we
eq. (5) is a natural convex relaxation for the sparsity are aware of for robust sparse regression. We show the
constraint {v : kvk2 = 1, kvk0 ≤ e k}. On the other performance of robust estimation using our filtering
High Dimensional Robust Sparse Regression

1.6
3
1.6 4
1.4
epsilon = 0.1 epsilon = 0.1
epsilon = 0.1 σ2 = 0.0
2
epsilon = 0.15 1.4 2
Rescaled relative MSE epsilon = 0.15
epsilon = 0.15
σ2 = 0.1

Rescaled relative MSE


1.2 epsilon = 0.2 1
epsilon = 0.2 0

log(parameter error)

log(parameter error)
1.2 epsilon = 0.2
1 0 -2 σ2 = 0.2
1
-1 -4
0.8
0.8 -6
-2
0.6
0.6 -8
-3
0.4 -10
0.4
-4
-12
0.2 0.2
-5 -14
0 0
4 5 6 7 8 500 1000 1500 2000 -6 -16
0 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 9 10
Sparsity Dimension Iterations Iterations

Figure 1: Simulations for Algorithm 3 showing the de- Figure 2: Empirical illustration of the linear convergence
pendence of relative MSE on sparsity and dimension. For of log(kβ t − β ∗ k22 ) vs. iteration counts in the Algorithm 1.
each parameter, we choose corresponding sample complex- In all cases, we fix k = 5, d = 500, and choose the sample
ity n ∝ k2 log(d)/. Different curves for  ∈ {0.1, 0.15, 0.2} complexity n ∝ 1/. The left plot considers different  with
are the average of 15 trials. Consistent with the theory, the fixed σ 2 = 0.1. The right plot considers different σ 2 with
rescaled relative MSE’s are nearly independent of sparsity fixed  = 0.1. As expected, the convergence is linear, and
and dimension. Furthermore, by rescaling for different , flatten out at the level of the final error.
three curves have the same magnitude.
3 5

epsilon = 0.1 σ2 = 0.0


2
epsilon = 0.15 0 σ2 = 0.1

log(parameter error)
epsilon = 0.2

log(parameter error)
1 σ2 = 0.2
algorithm with unknown covariance in Section 6, and 0
-5

we observe same linear convergence as Section 4. -1


-10

-2

-15
-3

6 Numerical results -4
0 5 10
Iterations
15 20
-20
1 2 3 4 5 6
Iterations
7 8 9 10

We provide the complete details for our experiment Figure 3: When the unknown covariance matrix is a
2
setup and experiments on wall-clock time in Appendix. Toeplitz matrix with a decay Σij = exp−(i−j) , and the
other settings are the same as Figure 2. We observe similar
Robust sparse mean estimation. We first demon- linear convergence as Figure 2.
strate the performance of Algorithm 3 for robust sparse
mean estimation, and then move to Algorithm 1 for
robust sparse regression. For the robust sparse gra- Robust sparse regression. We use Algorithm 1 for
dient estimation, we generate samples through gi = robust sparse regression. Similarly as in robust sparse
xi x>
i G − xi ξi , where the unknown true mean G is k-
mean estimation, we use Algorithm 3 as our Robust
sparse. The authentic xi ’s are generated from N (0, Id ). Sparse Gradient Estimator, and leverage the Sparse
We set σ = 0, since the main part of the error in robust PCA solver from d’Aspremont et al. (2008). In the
sparse mean estimation is G. We plot the relative MSE simulation, we fix d = 500, and k = 5, hence the
2
of parameter recovery, defined as kG−Gk
b 2
2 /kGk2 , with
corresponding sample complexity is n ∝ 1/. We do not
respect to different sparsities and dimensions. use the sample splitting technique in the simulations.
Parameter error vs. sparsity k. We fix the dimension To show the performance of Algorithm 1 under dif-
to be d = 50. We solve the trace norm maximization in ferent settings, we use different levels of  and σ in
2
Algorithm 3 using CVX Grant et al. (2008). We solve Figure 2, and track the parameter error kβ t − β ∗ k2
robust sparse gradient estimation under different levels of Algorithm 1 in each iteration. Consistent with the
of outlier fraction  and different sparsity values k. theory, the algorithm displays linear convergence, and
the error curves flatten out at the level of the final error.
Parameter error vs. dimension d. We fix k = 5. We use Furthermore, Algorithm 1 achieves machine precision
a Sparse PCA solver from d’Aspremont et al. (2008) when σ 2 = 0 in the right plot of Figure 2.
which is much more efficient for higher dimensions.
We run robust sparse gradient estimation Algorithm 3 We then study the empirical performance of robust
under different levels of outlier fraction  and different sparse regression with unknown covariance matrix Σ
dimensions d. following from Model 5.1. We use the same experimen-
tal setup as in identity covariance case, but modify the
For each parameter, the corresponding number of sam- covariance matrix to be a Toeplitz matrix with a decay
ples required for the authentic data is n ∝ k 2 log(d)/ 2
Σij = exp−(i−j) . Under this setting, the covariance
according to Theorem 4.1. Therefore, we add n/(1 − ) matrix is sparse, thus follows from Model 5.1. Figure 3
outliers (so that the outliers are an -fraction of indicates that we have nearly the same performance as
the total samples), and then run Algorithm 3. Ac- the Σ = Id case.
cording to Theorem 4.1, the rescaled relative MSE:
kGb − Gk2 /(kGk2 ) should be independent of the pa-
2 2
rameters {, k, d}. We plot this in Figure 1, and these
plots validate our theorem on the sample complexity
in robust sparse mean estimation problems.
Liu Liu, Yanyao Shen, Tianyang Li, Constantine Caramanis

7 Acknowledgments d’Aspremont, A., Bach, F., and Ghaoui, L. E. (2008).


Optimal solutions for sparse principal component
This work was partially funded by NSF grants 1609279, analysis. Journal of Machine Learning Research,
1646522, and 1704778, and the authors would like to 9(Jul):1269–1294.
thank Simon S. Du for helpful discussions. d’Aspremont, A., Ghaoui, L. E., Jordan, M. I., and
Lanckriet, G. R. (2007). A direct formulation for
References sparse PCA using semidefinite programming. SIAM
Review, 49(3):434–448.
Balakrishnan, S., Du, S. S., Li, J., and Singh, A.
(2017a). Computationally efficient robust sparse es- Diakonikolas, I., Kamath, G., Kane, D., Li, J., Stein-
timation in high dimensions. In Proceedings of the hardt, J., and Stewart, A. (2019a). Sever: A robust
2017 Conference on Learning Theory. meta-algorithm for stochastic optimization. In In-
ternational Conference on Machine Learning, pages
Balakrishnan, S., Wainwright, M. J., and Yu, B. 1596–1606.
(2017b). Statistical guarantees for the em algorithm:
From population to sample-based analysis. The An- Diakonikolas, I., Kamath, G., Kane, D. M., Li, J.,
nals of Statistics, 45(1):77–120. Moitra, A., and Stewart, A. (2016). Robust estima-
tors in high dimensions without the computational
Bandeira, A. S., Dobriban, E., Mixon, D. G., and Sawin, intractability. In Foundations of Computer Science
W. F. (2013). Certifying the restricted isometry (FOCS), 2016 IEEE 57th Annual Symposium on,
property is hard. IEEE Transactions on Information pages 655–664. IEEE.
Theory, 59(6):3448–3450.
Diakonikolas, I., Kane, D. M., and Stewart, A. (2017).
Barak, B. and Steurer, D. (2016). Proofs, beliefs, and al- Statistical query lower bounds for robust estima-
gorithms through the lens of sum-of-squares. Course tion of high-dimensional gaussians and gaussian mix-
notes: http://www. sumofsquares. org/public/index. tures. In Foundations of Computer Science (FOCS),
html. 2017 IEEE 58th Annual Symposium on, pages 73–84.
Bhatia, K., Jain, P., and Kar, P. (2015). Robust regres- IEEE.
sion via hard thresholding. In Advances in Neural Diakonikolas, I., Kong, W., and Stewart, A. (2019b).
Information Processing Systems, pages 721–729. Efficient algorithms and lower bounds for robust
Bhatia, K., Jain, P., and Kar, P. (2017). Consistent linear regression. In Proceedings of the Thirtieth An-
robust regression. In Advances in Neural Information nual ACM-SIAM Symposium on Discrete Algorithms,
Processing Systems, pages 2107–2116. pages 2745–2754. SIAM.

Bickel, P. J. and Levina, E. (2008). Covariance regu- Donoho, D. L. (2006). Compressed sensing. IEEE
larization by thresholding. The Annals of Statistics, Transactions on information theory, 52(4):1289–
36(6):2577–2604. 1306.

Blumensath, T. and Davies, M. E. (2009). Iterative El Karoui, N. (2008). Operator norm consistent estima-
hard thresholding for compressed sensing. Applied tion of large-dimensional sparse covariance matrices.
and computational harmonic analysis, 27(3):265–274. The Annals of Statistics, 36(6):2717–2756.

Bubeck, S. (2015). Convex optimization: Algorithms Gao, C. (2017). Robust regression via mutivariate
and complexity. Foundations and Trends® in Ma- regression depth. arXiv preprint arXiv:1702.04656.
chine Learning, 8(3-4):231–357. Grant, M., Boyd, S., and Ye, Y. (2008). Cvx: Matlab
Chen, M., Gao, C., and Ren, Z. (2018). Robust covari- software for disciplined convex programming.
ance and scatter matrix estimation under hubers con- Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J.,
tamination model. Ann. Statist., 46(5):1932–1960. and Stahel, W. A. (2011). Robust statistics: the
Chen, Y., Caramanis, C., and Mannor, S. (2013). Ro- approach based on influence functions, volume 196.
bust sparse regression under adversarial corruption. John Wiley & Sons.
In International Conference on Machine Learning, Holland, M. J. and Ikeda, K. (2017). Efficient learn-
pages 774–782. ing with robust gradient descent. arXiv preprint
Chen, Y., Su, L., and Xu, J. (2017). Distributed arXiv:1706.00182.
statistical machine learning in adversarial settings: Hopkins, S. B., Schramm, T., Shi, J., and Steurer, D.
Byzantine gradient descent. Proceedings of the ACM (2016). Fast spectral algorithms from sum-of-squares
on Measurement and Analysis of Computing Systems, proofs: tensor decomposition and planted sparse
1(2):44. vectors. In Proceedings of the forty-eighth annual
High Dimensional Robust Sparse Regression

ACM symposium on Theory of Computing, pages Shen, J. and Li, P. (2017). A tight bound of hard thresh-
178–191. ACM. olding. The Journal of Machine Learning Research,
18(1):7650–7691.
Huber, P. J. (1964). Robust estimation of a location
parameter. The annals of mathematical statistics, Stein, C. M. (1981). Estimation of the mean of a multi-
pages 73–101. variate normal distribution. The annals of Statistics,
pages 1135–1151.
Huber, P. J. (2011). Robust statistics. In International
Encyclopedia of Statistical Science, pages 1248–1251. Tukey, J. W. (1975). Mathematics and the picturing of
Springer. data. In Proceedings of the International Congress of
Mathematicians, Vancouver, 1975, volume 2, pages
Jain, P., Tewari, A., and Kar, P. (2014). On itera- 523–531.
tive hard thresholding methods for high-dimensional
m-estimation. In Advances in Neural Information Vershynin, R. (2010). Introduction to the non-
Processing Systems, pages 685–693. asymptotic analysis of random matrices. arXiv
preprint arXiv:1011.3027.
Johnson, D. S. and Preparata, F. P. (1978). The densest
hemisphere problem. Theoretical Computer Science, Vu, V. Q., Cho, J., Lei, J., and Rohe, K. (2013). Fan-
6(1):93–107. tope projection and selection: A near-optimal convex
relaxation of sparse PCA. In Advances in neural in-
Karmalkar, S. and Price, E. (2018). Compressed sens- formation processing systems, pages 2670–2678.
ing with adversarial sparse noise via l1 regression. In
Wainwright, M. (2019). High-dimensional statistics:
2nd Symposium on Simplicity in Algorithms (SOSA
A non-asymptotic viewpoint. Cambridge University
2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Infor-
Press.
matik.
Xu, H., Caramanis, C., and Mannor, S. (2013). Outlier-
Klivans, A., Kothari, P. K., and Meka, R. (2018). Ef-
robust PCA: the high-dimensional case. IEEE trans-
ficient algorithms for outlier-robust regression. In
actions on information theory, 59(1):546–572.
Conference On Learning Theory, pages 1420–1430.
Xu, H., Caramanis, C., and Sanghavi, S. (2012). Robust
Klivans, A. R., Long, P. M., and Servedio, R. A. (2009). PCA via outlier pursuit. IEEE Transactions on
Learning halfspaces with malicious noise. Journal of Information Theory, 58(5):3047–3064.
Machine Learning Research, 10(Dec):2715–2740.
Zhang, Y., Wainwright, M. J., and Jordan, M. I. (2014).
Lai, K. A., Rao, A. B., and Vempala, S. (2016). Agnos- Lower bounds on the performance of polynomial-time
tic estimation of mean and covariance. In Founda- algorithms for sparse linear regression. In Conference
tions of Computer Science (FOCS), 2016 IEEE 57th on Learning Theory, pages 921–948.
Annual Symposium on, pages 665–674. IEEE.
Li, X. (2013). Compressed sensing and matrix comple-
tion with constant proportion of corruptions. Con-
structive Approximation, 37(1):73–99.
Liu, H. and Barber, R. F. (2018). Between hard and
soft thresholding: optimal iterative thresholding al-
gorithms. arXiv preprint arXiv:1804.08841.
Liu, L., Li, T., and Caramanis, C. (2019). High dimen-
sional robust m-estimation: Arbitrary corruption
and heavy tails. arXiv preprint arXiv:1901.08237.
Nguyen, N. H. and Tran, T. D. (2013). Exact re-
coverability from dense corrupted observations via
l1-minimization. IEEE transactions on information
theory, 59(4):2017–2035.
Prasad, A., Suggala, A. S., Balakrishnan, S.,
and Ravikumar, P. (2018). Robust estimation
via robust gradient estimation. arXiv preprint
arXiv:1802.06485.
Rousseeuw, P. J. and Leroy, A. M. (2005). Robust
regression and outlier detection, volume 589. John
wiley & sons.

You might also like