Professional Documents
Culture Documents
Liu Et Al. - 2020 - High Dimensional Robust Sparse Regression
Liu Et Al. - 2020 - High Dimensional Robust Sparse Regression
tions on the probabilistic distribution are necessary, SDP framework gives polynomial time algorithms, they
since in the worst case, sparse recovery is known to be are often not practical (Hopkins et al., 2016).
NP-hard Bandeira et al. (2013); Zhang et al. (2014).
Much less appears to be known in the high-dimensional
Sparsity recovery with a constant fraction of arbitrary regime. Exponential time algorithm, such as Gao
corruption is fundamentally hard. For instance, to the (2017); Johnson and Preparata (1978), optimizes Tukey
best of our knowledge, there’s no previous work can depth Tukey (1975); Chen et al. (2018). Their results
provide exact recovery for sparse linear equations with reveal that handling a constant fraction of outliers
arbitrary corruption in polynomial time. In contrast, a ( = const.) is actually minimax-optimal. Work in
simple exhaustive search can easily enumerate the sam- Chen et al. (2013) first provided a polynomial time
ples and recover the sparse parameter in exponential algorithm for this problem. They show that replacing
time. the standard inner product in Matching Pursuit with
a trimmed version, one can √ recover from an -fraction
In this work, we seek to give an efficient, sample-
of outliers, with = O(1/ k). Very recently, Liu et al.
complexity optimal algorithm that recovers β ∗ to
(2019) considered more general sparsity constrained
within accuracy depending on (the fraction of out-
M -estimation by using a trimmed estimator in each
liers). In the case of no additive noise, we are interested
step of gradient
√ descent, yet the robustness guarantee
in algorithms that can guarantee exact recovery, inde-
= O(1/ k) is still sub-optimal. Another approach
pendent of .
follows as a byproduct of a recent algorithm for ro-
bust sparse mean estimation, in Balakrishnan et al.
1.1 Related work (2017a). However, their error guarantee scales with
kβ ∗ k2 , and moreover, does not provide exact recovery
The last 10 years have seen a resurgence in interest in in the adversarial corruption case without stochastic
robust statistics, including the problem of resilience noise (i.e., noise variance σ 2 = 0). We note that this
to outliers in the data. Important problems attacked is an inevitable consequence of their approach, as they
have included PCA (Klivans et al., 2009; Xu et al., directly use sparse mean estimation on {yi xi }, rather
2012, 2013; Lai et al., 2016; Diakonikolas et al., 2016), than considering Maximum Likelihood Estimation.
and more recently robust regression (as in this pa-
per) (Prasad et al., 2018; Diakonikolas et al., 2019a,b; Robust mean estimation. The idea in (Prasad
Klivans et al., 2018) and robust mean estimation (Di- et al., 2018; Diakonikolas et al., 2019a,b) is to leverage
akonikolas et al., 2016; Lai et al., 2016; Balakrishnan recent breakthroughs in robust mean estimation. Very
et al., 2017a), among others. We focus now on the recently, (Diakonikolas et al., 2016; Lai et al., 2016) pro-
recent work most related to the present paper. vided the first robust mean estimation algorithms that
can handle a constant fraction of outliers (though Lai
Robust regression. Earlier work in robust regres- et al. (2016) incurs a small (logarithmic) dependence
sion considers corruption only in the output, and shows in the dimension). Following their work, Balakrishnan
that algorithms nearly as efficient as for regression with- et al. (2017a) extended the ellipsoid algorithm from
out outliers succeeds in parameter recovery, even with Diakonikolas et al. (2016) to robust sparse mean esti-
a constant fraction of outliers (Li, 2013; Nguyen and mation in high dimensions. They show that k-sparse
Tran, 2013; Bhatia et al., 2015, 2017; Karmalkar and mean estimation in Rd with a constant fraction of out-
Price, 2018). Yet these algorithms (and their analysis) liers can be done with n = Ω k 2 log (d) samples. The
focus on corruption in y, and do not seem to extend k 2 term appears to be necessary, as n = Ω(k 2 ) follows
to the setting of corrupted covariates – the setting of from an oracle-based lower bound Diakonikolas et al.
this work. In the low dimensional setting, there has (2017).
been remarkable recent progress. The work in Klivans
et al. (2018) shows that the Sum of Squares (SOS) 1.2 Main contributions
based semidefinite hierarchy can be used for solving
robust regression. Essentially concurrent to the SOS • Our result is a robust variant of Iterative Hard
work, (Chen et al., 2017; Holland and Ikeda, 2017; Thresholding (IHT) Blumensath and Davies (2009).
Prasad et al., 2018; Diakonikolas et al., 2019a) use We provide a deterministic stability result showing
robust gradient descent for empirical risk minimization, that IHT works with any robust sparse mean esti-
by using robust mean estimation as a subroutine to mation algorithm. We show our robust IHT does
compute robust gradients at each iteration. Diakoniko- not accumulate the errors of a (any) robust sparse
las et al. (2019b) uses filtering algorithm Diakonikolas mean estimation subroutine for computing the gra-
et al. (2016) for robust regression. Computationally, dient. Specifically, robust IHT produces a final solu-
these latter works scale better than the algorithms in tion whose error is orderwise the same as the error
Klivans et al. (2018), as although the Sum of Squares guaranteed by an single use of the robust mean esti-
Liu Liu, Yanyao Shen, Tianyang Li, Constantine Caramanis
mation subroutine. We refer to Definition 2.1 and or known identity) the result is strong enough to
Theorem 2.1 for the precise statement. Thus our guarantee exact recovery when either σ or goes
result can be viewed as a meta-theorem that can be to zero. We demonstrate the practical effectiveness
coupled with any robust sparse mean estimator. of our filtering algorithm in Section 6. This is the
content of Section 4 and Section 5.
• Coupling robust IHT with a robust sparse mean esti-
mation subroutine based on a version of the ellipsoid 1.3 Setup, Notation and Outline
algorithm given and analyzed in Balakrishnan et al.
(2017a), our results Corollary 3.1 show that given In this subsection, we formally define the corruption
-corrupted sparse regression samples with identity model and the sparse regression model. We first intro-
covariance, we recover β ∗ within additive error O(σ) duce the -corrupted samples described above:
(which is minimax optimal Gao (2017)). The proof Definition 1.1 (-corrupted samples). Let {zi , i ∈ G}
of the ellipsoid algorithm’s performance in Balakr- be i.i.d. observations follow from a distribution P . The
ishnan et al. (2017a) hinges on obtaining an upper -corrupted samples {zi , i ∈ S} are generated by the
bound on the sparse operator norm (their Lemmas following process: an adversary chooses an arbitrary
A.2 and A.3). As we show (see Appendix B), the -fraction of the samples in G and modifies them with
statement of Lemma A.3 seems to be incorrect, and arbitrary values. After the corruption, we use S to
the general approach of upper bounding the sparse denote the observations, and use B = S \ G to denote
operator norm may not work. Nevertheless, the al- the corruptions.
gorithm performance they claim is correct, as we
show through a different avenue (see Lemma D.3 in The parameter represents the fraction of outliers.
Appendix D.3). Throughout, we assume that it is a (small) constant,
independent of dimension or other problem parameters.
Using this ellipsoid algorithm, In particular, we ob-
Furthermore, we assume that the distribution P is the
tain exact recovery if either the fraction of outliers
standard Gaussian-design AWGN linear model.
goes to zero (this is just ordinary sparse regression),
or in the presence of a constant fraction of outliers Model 1.1. The observations {zi = (yi , xi ), i ∈ G}
but with the additive noise term going to zero (this follow from the linear model yi = x> ∗
i β + ξi , where
∗ d
is the case of robust sparse linear equations). To β ∈ R is the model parameter, and assumed to
the best of our knowledge, this is the first result be k-sparse. We assume that xi ∼ N (0, Σ) and
that shows exact recovery for robust sparse linear ξi ∼ N 0, σ 2 , where Σ is the normalized covariance
equations with a constant fraction of outliers. This matrix with Σjj ≤ 1 for all j ∈ [d]. We denote µα
is the content of Section 3. as the smallest eigenvalue of Σ, and µβ as its largest
eigenvalue. They are assumed to be universal constants
• For robust sparse regression with unknown covari- in this paper, and we denote the constant cκ = µβ /µα .
ance matrix, we consider the wide class of sparse
covariance matrices Bickel and Levina (2008). We As in Balakrishnan et al. (2017a), we pre-process by
then prove a result that may be of interest in its own removing “obvious” outliers; we henceforth assume that
right: we provide a novel robust sparse mean estima- all authentic and corrupted points are within a radius
tion algorithm that is based on a filtering algorithm bounded by a polynomial in n, d and 1/.
for sequentially screening and removing potential Notation. We denote the hard thresholding operator
outliers. We show that the filtering algorithm is of sparsity k 0 by Pk0 . We define the k-sparse operator
flexible enough to deal with unknown covariance, norm as kM kek,op = maxkvk =1,kvk ≤ek |v > M v|, where
2 0
whereas the ellipsoid algorithm cannot. It also runs M is not required to be positive-semidefinite (p.s.d.).
a factor of O(d2 ) faster than the ellipsoid algorithm. We use trace inner produce hA, Bi to denote Tr A> B .
If the covariance matrix is sufficiently sparse, our We use Ei∈u S to denote the expectation operator ob-
filtering algorithm gives a robust sparse mean esti- tained by the uniform distribution over all samples i
mation algorithm, that can then be coupled with in a set S. Finally, we use the notation O(·)
e to hide
our meta-theorem. Together, these two guarantee √ the dependency on poly log(1/), and Ω(·) to hide the
recovery of β ∗ within an additive error of O(σ ).
e
dependency on poly log(k) in our bounds.
In the case of unknown covariance, this is the best
(and in fact, only) result we are aware of for robust
sparse regression. We note that it can be applied to 2 Hard thresholding with robust
the case of known and identity covariance, though it gradient estimation
is weaker than the optimal results we obtain using
the computationally more expensive ellipsoid algo- In this section, we present our method of using robust
rithm. Nevertheless, in both cases (unknown sparse, sparse gradient updates in IHT. We then show sta-
High Dimensional Robust Sparse Regression
Algorithm 2 Separation oracle for robust sparse esti- subject to H < 0, kHk1,1 ≤ e
k, Tr (H) = 1. (1)
mation Balakrishnan et al. (2017a)
1: Input: Weights from the previous iteration Here, G,
b Σ
b are weighted first and second order moment
{wi , i ∈ S}, gradient samples {gi , i ∈ S}. estimates from -corrupted samples, and F : Rd →
2: Output: Weight {wi0 , i ∈ S} Rd×d is a function with closed-form F (G)
b = kGkb 2 Id +
2
3: Parameters: Hard thresholding parameter e k, pa- > 2
GG + σ Id . Given the population mean G, we have
b b
>
rameter ρsep . F (G) = Ezi ∼P ((gi − G) (gi − G) ), which calculates
4: Compute the weighted sample mean G
e = the underlying true covariance matrix. We provide
P more details about the calculation of F (·), as well as
i∈S wi gi , and G = P2ek G .
b e
5: Compute the weighted sample covariance matrix
some smoothness properties, in Appendix C.
>
b =P
Σ wi gi − Gb gi − G b . The key component in the separation oracle Algo-
i∈S
rithm 2 is to use convex relaxation of Sparse PCA
6: Let λ∗ be the optimal value, and H ∗ be the corre-
eq. (1). This idea generalizes existing work on using
sponding solution of the following program
PCA to detect outliers in low dimensional robust mean
b −F G
b ·H ,
estimation Diakonikolas et al. (2016); Lai et al. (2016).
max Tr Σ
H To gain some intuition for eq. (1), if gi is an outlier,
subject to H < 0, kHk1,1 ≤ e
k, Tr (H) = 1. then the optimal solution of eq. (1), H ∗ , may detect
the direction of this outlier. And this outlier will be
7: if λ∗ ≤ ρsep , then return “Yes”. down-weighted in the output of Algorithm 2 by the
return
8: The hyperplane: `(w0 ) = separating hyperplane. Finally, Algorithm 2 will ter-
> minate with λ∗ ≤ ρsep (line 7) and output the robust
wi0 gi − G gi − G − F G , H − λ∗ .
∗
P
b b b
i∈S sparse mean estimation of the gradients G. b
Proof sketch of Corollary 3.1 The key to the proof relies Algorithm 3 RSGE via filtering
on showing that λ∗ controls the quality of the weights of 1: Input: A set Sin .
the current iteration, i.e., small λ∗ means good weights 2: Output: A set Sout or sparse mean vector G.
b
and thus a good current solution. Showing this relies 3: Parameters: Hard thresholding parameter e k, pa-
on using λ∗ to control Σ−F
b (G).
b Lemma A.3 in Balakr- rameter ρsep .
ishnan et al. (2017a) claims that λ∗ ≥ kΣ b − F (G)k
b e .
k,op 4: Compute the sample mean G
e = Ei∈ S gi , and
u in
As we show in Appendix B, however, this need not
Gb=P e G e .
hold. This is because the trace norm maximization 2k
5: Compute the sample covariance matrix
eq. (1) is not a valid convex relaxation for the ek-sparse
operator norm when the term Σ − F (G) is not p.s.d.
b b >
b = Ei∈ S gi − G
Σ b gi − G .
(which indeed it need not be). We provide a differ-
b
u in
proportional to the score we have computed. 2 with probability at least 1 − ν − exp (−Θ (n)). Here,
Cγ is a constant depending on γ, where γ ≥ 4 is a
2
Although we remove one sample in Algorithm 3, our constant.
theoretical analysis naturally extend to removing constant
number of outliers. This speeds up the algorithm in practice, yet shares the same computational complexity
Liu Liu, Yanyao Shen, Tianyang Li, Constantine Caramanis
a guaranteed gradient estimate, or the outlier removal can guarantee the kGt k0 = kΣω t k0 ≤ r(k 0 + k). Hence,
step eq. (4) is likely to discard an outlier. The guarantee we can use the filtering algorithm (Algorithm 3) with
on the Poutlier removal step eq.P(4) hinges on the fact k = r(k 0 + k) as a RSGE for robust sparse regression
e
that if i∈Sgood τi is less than i∈Sbad τi , we can show with unknown Σ. When the covariance is unknown,
eq. (4) is likely to remove an outlier. we cannot evaluate F (·) a priori, thus the ellipsoid al-
2 gorithm is not applicable to this case. And we provide
Lemma 4.1. Suppose we observe n = Ω k log(d/ν)
- error guarantees as follows.
corrupted samples from Model 1.1 with Σ = Id . Let Sin
be an -corrupted set {git }ni=1 . Algorithm 3 computes Theorem 5.1. Suppose we observe N (k, d, , ν) -
λ∗ that satisfies corrupted samples from Model 1.1, where the co-
variates xi ’s follow from Model 5.1. If we use
Algorithm 3 for 2 robust sparse gradient estimation,
>
λ∗ ≥ max v > Ei∈ S gi − G
u in
b gi − G b v. r k2 log(dT /ν)
kvk2 =1,kvk0 ≤e
k it requires Ωe
T samples, and T =
e (σ √),
∗
(5) Θ log σ√
kβ k2
, then, we havekβb − β ∗ k2 = O
2
with probability at least 1 − ν − T exp (−Θ (n)).
If λ∗ ≥ ρsep = Cγ kGt k2 + σ 2 , then with probability
at least 1 − ν, we have i∈Sgood τi ≤ γ1 i∈Sin τi , where
P P
The proof of Theorem 5.1 is collected in Appendix G,
τi is defined in eq. (3), Cγ is a constant depending on and the main technique hinges on previous analysis
γ, and γ ≥ 4 is a constant. for the identity covariance case (Theorem 4.1 and
Lemma 4.1). In the case of unknown covariance, this
The proofs are collected in Appendix E. In a nutshell, is the best (and in fact, only) recovery guarantee we
eq. (5) is a natural convex relaxation for the sparsity are aware of for robust sparse regression. We show the
constraint {v : kvk2 = 1, kvk0 ≤ e k}. On the other performance of robust estimation using our filtering
High Dimensional Robust Sparse Regression
1.6
3
1.6 4
1.4
epsilon = 0.1 epsilon = 0.1
epsilon = 0.1 σ2 = 0.0
2
epsilon = 0.15 1.4 2
Rescaled relative MSE epsilon = 0.15
epsilon = 0.15
σ2 = 0.1
log(parameter error)
log(parameter error)
1.2 epsilon = 0.2
1 0 -2 σ2 = 0.2
1
-1 -4
0.8
0.8 -6
-2
0.6
0.6 -8
-3
0.4 -10
0.4
-4
-12
0.2 0.2
-5 -14
0 0
4 5 6 7 8 500 1000 1500 2000 -6 -16
0 2 4 6 8 10 12 14 16 18 20 1 2 3 4 5 6 7 8 9 10
Sparsity Dimension Iterations Iterations
Figure 1: Simulations for Algorithm 3 showing the de- Figure 2: Empirical illustration of the linear convergence
pendence of relative MSE on sparsity and dimension. For of log(kβ t − β ∗ k22 ) vs. iteration counts in the Algorithm 1.
each parameter, we choose corresponding sample complex- In all cases, we fix k = 5, d = 500, and choose the sample
ity n ∝ k2 log(d)/. Different curves for ∈ {0.1, 0.15, 0.2} complexity n ∝ 1/. The left plot considers different with
are the average of 15 trials. Consistent with the theory, the fixed σ 2 = 0.1. The right plot considers different σ 2 with
rescaled relative MSE’s are nearly independent of sparsity fixed = 0.1. As expected, the convergence is linear, and
and dimension. Furthermore, by rescaling for different , flatten out at the level of the final error.
three curves have the same magnitude.
3 5
log(parameter error)
epsilon = 0.2
log(parameter error)
1 σ2 = 0.2
algorithm with unknown covariance in Section 6, and 0
-5
-2
-15
-3
6 Numerical results -4
0 5 10
Iterations
15 20
-20
1 2 3 4 5 6
Iterations
7 8 9 10
We provide the complete details for our experiment Figure 3: When the unknown covariance matrix is a
2
setup and experiments on wall-clock time in Appendix. Toeplitz matrix with a decay Σij = exp−(i−j) , and the
other settings are the same as Figure 2. We observe similar
Robust sparse mean estimation. We first demon- linear convergence as Figure 2.
strate the performance of Algorithm 3 for robust sparse
mean estimation, and then move to Algorithm 1 for
robust sparse regression. For the robust sparse gra- Robust sparse regression. We use Algorithm 1 for
dient estimation, we generate samples through gi = robust sparse regression. Similarly as in robust sparse
xi x>
i G − xi ξi , where the unknown true mean G is k-
mean estimation, we use Algorithm 3 as our Robust
sparse. The authentic xi ’s are generated from N (0, Id ). Sparse Gradient Estimator, and leverage the Sparse
We set σ = 0, since the main part of the error in robust PCA solver from d’Aspremont et al. (2008). In the
sparse mean estimation is G. We plot the relative MSE simulation, we fix d = 500, and k = 5, hence the
2
of parameter recovery, defined as kG−Gk
b 2
2 /kGk2 , with
corresponding sample complexity is n ∝ 1/. We do not
respect to different sparsities and dimensions. use the sample splitting technique in the simulations.
Parameter error vs. sparsity k. We fix the dimension To show the performance of Algorithm 1 under dif-
to be d = 50. We solve the trace norm maximization in ferent settings, we use different levels of and σ in
2
Algorithm 3 using CVX Grant et al. (2008). We solve Figure 2, and track the parameter error kβ t − β ∗ k2
robust sparse gradient estimation under different levels of Algorithm 1 in each iteration. Consistent with the
of outlier fraction and different sparsity values k. theory, the algorithm displays linear convergence, and
the error curves flatten out at the level of the final error.
Parameter error vs. dimension d. We fix k = 5. We use Furthermore, Algorithm 1 achieves machine precision
a Sparse PCA solver from d’Aspremont et al. (2008) when σ 2 = 0 in the right plot of Figure 2.
which is much more efficient for higher dimensions.
We run robust sparse gradient estimation Algorithm 3 We then study the empirical performance of robust
under different levels of outlier fraction and different sparse regression with unknown covariance matrix Σ
dimensions d. following from Model 5.1. We use the same experimen-
tal setup as in identity covariance case, but modify the
For each parameter, the corresponding number of sam- covariance matrix to be a Toeplitz matrix with a decay
ples required for the authentic data is n ∝ k 2 log(d)/ 2
Σij = exp−(i−j) . Under this setting, the covariance
according to Theorem 4.1. Therefore, we add n/(1 − ) matrix is sparse, thus follows from Model 5.1. Figure 3
outliers (so that the outliers are an -fraction of indicates that we have nearly the same performance as
the total samples), and then run Algorithm 3. Ac- the Σ = Id case.
cording to Theorem 4.1, the rescaled relative MSE:
kGb − Gk2 /(kGk2 ) should be independent of the pa-
2 2
rameters {, k, d}. We plot this in Figure 1, and these
plots validate our theorem on the sample complexity
in robust sparse mean estimation problems.
Liu Liu, Yanyao Shen, Tianyang Li, Constantine Caramanis
Bickel, P. J. and Levina, E. (2008). Covariance regu- Donoho, D. L. (2006). Compressed sensing. IEEE
larization by thresholding. The Annals of Statistics, Transactions on information theory, 52(4):1289–
36(6):2577–2604. 1306.
Blumensath, T. and Davies, M. E. (2009). Iterative El Karoui, N. (2008). Operator norm consistent estima-
hard thresholding for compressed sensing. Applied tion of large-dimensional sparse covariance matrices.
and computational harmonic analysis, 27(3):265–274. The Annals of Statistics, 36(6):2717–2756.
Bubeck, S. (2015). Convex optimization: Algorithms Gao, C. (2017). Robust regression via mutivariate
and complexity. Foundations and Trends® in Ma- regression depth. arXiv preprint arXiv:1702.04656.
chine Learning, 8(3-4):231–357. Grant, M., Boyd, S., and Ye, Y. (2008). Cvx: Matlab
Chen, M., Gao, C., and Ren, Z. (2018). Robust covari- software for disciplined convex programming.
ance and scatter matrix estimation under hubers con- Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J.,
tamination model. Ann. Statist., 46(5):1932–1960. and Stahel, W. A. (2011). Robust statistics: the
Chen, Y., Caramanis, C., and Mannor, S. (2013). Ro- approach based on influence functions, volume 196.
bust sparse regression under adversarial corruption. John Wiley & Sons.
In International Conference on Machine Learning, Holland, M. J. and Ikeda, K. (2017). Efficient learn-
pages 774–782. ing with robust gradient descent. arXiv preprint
Chen, Y., Su, L., and Xu, J. (2017). Distributed arXiv:1706.00182.
statistical machine learning in adversarial settings: Hopkins, S. B., Schramm, T., Shi, J., and Steurer, D.
Byzantine gradient descent. Proceedings of the ACM (2016). Fast spectral algorithms from sum-of-squares
on Measurement and Analysis of Computing Systems, proofs: tensor decomposition and planted sparse
1(2):44. vectors. In Proceedings of the forty-eighth annual
High Dimensional Robust Sparse Regression
ACM symposium on Theory of Computing, pages Shen, J. and Li, P. (2017). A tight bound of hard thresh-
178–191. ACM. olding. The Journal of Machine Learning Research,
18(1):7650–7691.
Huber, P. J. (1964). Robust estimation of a location
parameter. The annals of mathematical statistics, Stein, C. M. (1981). Estimation of the mean of a multi-
pages 73–101. variate normal distribution. The annals of Statistics,
pages 1135–1151.
Huber, P. J. (2011). Robust statistics. In International
Encyclopedia of Statistical Science, pages 1248–1251. Tukey, J. W. (1975). Mathematics and the picturing of
Springer. data. In Proceedings of the International Congress of
Mathematicians, Vancouver, 1975, volume 2, pages
Jain, P., Tewari, A., and Kar, P. (2014). On itera- 523–531.
tive hard thresholding methods for high-dimensional
m-estimation. In Advances in Neural Information Vershynin, R. (2010). Introduction to the non-
Processing Systems, pages 685–693. asymptotic analysis of random matrices. arXiv
preprint arXiv:1011.3027.
Johnson, D. S. and Preparata, F. P. (1978). The densest
hemisphere problem. Theoretical Computer Science, Vu, V. Q., Cho, J., Lei, J., and Rohe, K. (2013). Fan-
6(1):93–107. tope projection and selection: A near-optimal convex
relaxation of sparse PCA. In Advances in neural in-
Karmalkar, S. and Price, E. (2018). Compressed sens- formation processing systems, pages 2670–2678.
ing with adversarial sparse noise via l1 regression. In
Wainwright, M. (2019). High-dimensional statistics:
2nd Symposium on Simplicity in Algorithms (SOSA
A non-asymptotic viewpoint. Cambridge University
2019). Schloss Dagstuhl-Leibniz-Zentrum fuer Infor-
Press.
matik.
Xu, H., Caramanis, C., and Mannor, S. (2013). Outlier-
Klivans, A., Kothari, P. K., and Meka, R. (2018). Ef-
robust PCA: the high-dimensional case. IEEE trans-
ficient algorithms for outlier-robust regression. In
actions on information theory, 59(1):546–572.
Conference On Learning Theory, pages 1420–1430.
Xu, H., Caramanis, C., and Sanghavi, S. (2012). Robust
Klivans, A. R., Long, P. M., and Servedio, R. A. (2009). PCA via outlier pursuit. IEEE Transactions on
Learning halfspaces with malicious noise. Journal of Information Theory, 58(5):3047–3064.
Machine Learning Research, 10(Dec):2715–2740.
Zhang, Y., Wainwright, M. J., and Jordan, M. I. (2014).
Lai, K. A., Rao, A. B., and Vempala, S. (2016). Agnos- Lower bounds on the performance of polynomial-time
tic estimation of mean and covariance. In Founda- algorithms for sparse linear regression. In Conference
tions of Computer Science (FOCS), 2016 IEEE 57th on Learning Theory, pages 921–948.
Annual Symposium on, pages 665–674. IEEE.
Li, X. (2013). Compressed sensing and matrix comple-
tion with constant proportion of corruptions. Con-
structive Approximation, 37(1):73–99.
Liu, H. and Barber, R. F. (2018). Between hard and
soft thresholding: optimal iterative thresholding al-
gorithms. arXiv preprint arXiv:1804.08841.
Liu, L., Li, T., and Caramanis, C. (2019). High dimen-
sional robust m-estimation: Arbitrary corruption
and heavy tails. arXiv preprint arXiv:1901.08237.
Nguyen, N. H. and Tran, T. D. (2013). Exact re-
coverability from dense corrupted observations via
l1-minimization. IEEE transactions on information
theory, 59(4):2017–2035.
Prasad, A., Suggala, A. S., Balakrishnan, S.,
and Ravikumar, P. (2018). Robust estimation
via robust gradient estimation. arXiv preprint
arXiv:1802.06485.
Rousseeuw, P. J. and Leroy, A. M. (2005). Robust
regression and outlier detection, volume 589. John
wiley & sons.