Fi LZ Moser 2020

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Received: 13 March 2020 Revised: 6 May 2020 Accepted: 2 June 2020

DOI: 10.1002/wics.1524

ADVANCED REVIEW

Robust linear regression for high-dimensional data: An


overview

Peter Filzmoser | Klaus Nordhausen

Institute of Statistics and Mathematical


Methods in Economics, Vienna University
Abstract
of Technology, Vienna, Austria Digitization as the process of converting information into numbers leads to
bigger and more complex data sets, bigger also with respect to the number of
Correspondence
Peter Filzmoser, Institute of Statistics and measured variables. This makes it harder or impossible for the practitioner to
Mathematical Methods in Economics, identify outliers or observations that are inconsistent with an underlying
Vienna University of Technology,
model. Classical least-squares based procedures can be affected by those out-
Wiedner Hauptstraße 8-10, 1040 Vienna,
Austria. liers. In the regression context, this means that the parameter estimates are
Email: p.filzmoser@tuwien.ac.at biased, with consequences on the validity of the statistical inference, on regres-
Funding information
sion diagnostics, and on the prediction accuracy. Robust regression methods
European Information and Technology aim at assigning appropriate weights to observations that deviate from the
Raw Materials, Grant/Award Number: model. While robust regression techniques are widely known in the low-
16329
dimensional case, researchers and practitioners might still not be very familiar
with developments in this direction for high-dimensional data. Recently, dif-
ferent strategies have been proposed for robust regression in the high-dimen-
sional case, typically based on dimension reduction, on shrinkage, including
sparsity, and on combinations of such techniques. A very recent concept is
downweighting single cells of the data matrix rather than complete observa-
tions, with the goal to make better use of the model-consistent information,
and thus to achieve higher efficiency of the parameter estimates.

This article is categorized under:


Statistical and Graphical Methods of Data Analysis > Robust Methods
Statistical and Graphical Methods of Data Analysis > Analysis of High
Dimensional Data
Statistical and Graphical Methods of Data Analysis > Dimension Reduction

KEYWORDS
dimension reduction, high-dimensional data, Outlier, regression, sparsity

1 | INTRODUCTION

Linear regression is one of the most popular modeling approaches as it gives often useful and interpretable insight into
the data. In recent years, however, the structure of the data sets changed. While earlier the number of observations, n,
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided
the original work is properly cited.
© 2020 The Authors. WIREs Computational Statistics published by Wiley Periodicals LLC.

WIREs Comput Stat. 2020;e1524. wires.wiley.com/compstats 1 of 18


https://doi.org/10.1002/wics.1524
2 of 18 FILZMOSER AND NORDHAUSEN

usually clearly exceeded the number of explaining variables, p, nowadays often n ≈ p or even n < p which is known as
high-dimensional data. These new data sets posed new challenges for regression models and two strategies where
followed to address this issue. The first strategy attempts to reduce the dimensionality of the explanatory data matrix
prior regression modeling. The second strategy suggested was to regularize the estimating equations to push for sparse
solutions, that is, that many regression coefficients are estimated as zeros. For a general overview see, for example
Hastie, Tibshirani, and Wainwright (2015).
Aside from the dimensionality issue, it is well established that classical regression approaches are highly nonrobust
which means that already a single observation can have a huge impact on the whole estimator. For traditional n > p
data sets, there exists a huge literature showing how to reduce the impact of atypical observations using robust regres-
sion models which are, for example, summarized in the text books (Maronna, Martin, & Yohai, 2006; Maronna, Martin,
Yohai, & Salibian-Barrera, 2019). However also the two strategies mentioned above to deal with high-dimensional data
suffer from bad robustness properties and in recent years the robustness community suggested solutions for
corresponding robust strategies.
This overview presents recent results for both strategies which posed many new challenges for robust methods. The
paper starts with some general concepts of robustness, particularly devoted to the regression setting. This is followed by
introducing robust regression methods for the low-dimensional situation. Furthermore, some recently developed robust
regression strategies for the high-dimensional case are explained and illustrated in a practical example. The paper ends
with a short overview of models for cellwise contamination, alternative regression models, and with a conclusion.

1.1 | Concept of robustness

Most statistical approaches start with a parametric model and then develop optimal methods under the model usually
using maximum-likelihood theory. As all models are however just approximations, the question is how do these
methods perform when the data do not follow exactly the model? For example, one of the most common model
assumptions is to assume Gaussian errors, which means that optimal methods are based on the mean and the variance.
But how do mean and variance perform if the data do not follow a normal distribution? Two scenarios are now possible
(a) the data are not normal at all but follow, for example, a symmetric distribution with heavier tails, or (b) most
observed data actually follow the Gaussian model but a few observations do not. These few atypical observations are
then usually denoted outliers. It is meanwhile well established that methods which assume Gaussianity have a consid-
erable loss in efficiency in scenario (a) and might even completely fail in scenario (b). The goal of robust statistics is to
safeguard against such model assumption violations. The challenge of robust statistics is then to find a good balance
between efficiency at the target model and at the same time guarantee valid inference under model violations. Thus,
the goal is to fit the majority of the data well, in scenario (a) the center of the data should be fitted well and heavy tails
tolerated, while in scenario (b) the influence of the atypical observations is limited.

1.2 | Characterizing robust estimators

To discuss robust methods, one needs measures to assess the robustness of an estimator μ. For example, when estimat-
ing a univariate location one could follow the behavior of the estimate μ ^ based on a sample x1, …, xn when an additional
observation xn + 1 is added which can range the whole real line, that is,

^ðx 1 , …, x n , x n + 1 Þ − μ
μ ^ðx 1 , …, x n Þ: ð1Þ

The result reveals the sensitivity of the estimator to an additional observation, depending on the value of this obser-
vation, and is known as the sensitivity curve. The asymptotic generalization of the sensitivity curve is the influence
function (IF), which measures the impact that a point mass contamination has on an estimator. Robust estimators
should have a bounded IF, thus limiting the influence of single outliers. For example, the IF of the sample mean is
unbounded while the IF of the median is bounded.
The IF measures the impact of small (infinitesimal) contamination, but how many atypical observations can be tol-
erated? This is measured by the breakdown point (BP) which, roughly speaking, measures the largest fraction of arbi-
trarily placed outliers an estimator can tolerate while still giving reasonable values for the quantity of interest. As
FILZMOSER AND NORDHAUSEN 3 of 18

robust methods should represent the majority of the data, a natural upper limit for the BP is 50%. From a robustness
point of view, a positive BP is desirable, but the higher the BP the better. For example, the mean has a BP of 0% while
the median has 50%.
Thus, the ideal robust estimator has a bounded IF and a high BP, and it would still be efficient at the target model.
For more details on IF and BP see, for example, Hampel, Ronchetti, Rousseeuw, and Stahel (2005) and Maronna
et al. (2019).

1.3 | Outliers in regression

The focus in this paper is on regression models, and in this setting different types of atypical observations are possible.
The first type of atypical points are observations which have values in the response that do not correspond to the
model. Such values are usually called vertical outliers. Another type is when there are atypical values in the explanatory
variables which do not fulfill the relationship with the response. These values are referred to as bad leverage points.
Values which are extreme in the response and in the explaining variables but follow the model are denoted good lever-
age points. In general, good leverage points have a big impact on the model fit which is, however, not problematic since
they correspond to the model. On the other hand, bad leverage points and vertical outliers require robust methods,
because they may affect classical least-squares estimation. The various proposed robust regression methods differ in the
way how they can deal with vertical outliers and bad leverage points, where vertical outliers are often easier to
deal with.
There are also different plots available for regression diagnostics, which help to distinguish the three different types
of atypical observations. For detailed discussion see, for example, (Rousseeuw & Leroy, 2003; Rousseeuw & van
Zomeren, 1990).

2 | R OB U S T R E G R E S S I O N I N L O W DI M E N S I O N

Consider a matrix X of dimension n × (p + 1), containing the information of n observations for p predictor variables.
The ith row of the matrix is denoted as the column vector xi = (1, xi1, …, xip)T, and the first entry 1 will take care of an
intercept in the regression model. Thus, the first column of X is a vector of ones. Furthermore, we have n observations
of the response, that is, y = (y1, …, yn)T. Then the linear regression model is given as

yi = xTi β + εi fori = 1, …,n, ð2Þ

where β = (β0, β1, …, βp)T is the vector of unknown regression coefficients, with the intercept term β0, and εi denotes
  to N(0, σ ).
2
the error terms, which are assumed to be i.i.d. random variables, distributed according
Suppose we have an estimator ^β for β. Then the residuals are given by r i = r i ^β = yi −xTi ^β, for i = 1, …, n. Since the
residuals refer to the error of the regression fit, their size should be kept small. This can be formalized into an objective
function, which eventually defines the estimator.
The best known estimator is the least squares (LS) estimator, which is defined as

X
n
^βLS = argmin r i ðβÞ2 : ð3Þ
β i=1

This results in an explicit formula for the solution,

 
^βLS = XT X − 1 XT y, ð4Þ

and in desirable statistical properties, see for example, Rousseeuw and Leroy (2003), for details. It is clear that in order
to compute the LS estimator, the matrix XTX needs to have full rank, which is not the case in problems where the
4 of 18 FILZMOSER AND NORDHAUSEN

number of variables exceeds the number of observations (high-dimensional setting). The treatment of this problem,
together with a review of robust procedures is the main focus of this contribution.

2.1 | The problem with the least-squares estimator

In spite of the good properties of the LS estimator and the explicit formula for its computation, there are several issues
that need to be considered:

• As mentioned above, the lack of full rank of the matrix XTX leads to computation problems. However, also high cor-
relation among the explanatory variables can already lead to numerical instabilities. In particular, they can make the
statistical inference for the regression parameters unreliable, because the diagonal elements of the inverse of this
matrix are involved in the construction of test statistics.
• A violation of the strict model assumptions (normal distribution, independence, equal variance σ 2) can lead to
unreliable tests as well, but also to spoiled parameter estimates. The presence of outliers is already sufficient for
severe model violations, and they can have a huge impact on the LS estimator.
• The different types of outliers in this context have different severity on the LS estimator:
 vertical outliers will typically shift the regression line (hyperplane) to a certain extent.
 good leverage points can even lead to a more precise estimation of the regression parameters, but this depends
very much on how close such an observation follows the model.
 bad leverage points are the worst type of outliers, and such observations can even lever the regression line (hyper-
plane) and thus make LS estimation completely useless.

As a consequence, regression diagnostics based on LS fits can also be misleading in presence of outliers. In the fol-
lowing we will describe alternative regression estimators which are robust against some or all of these types of outliers.

2.2 | M-estimators

One can generalize the objective function (3) as

X
n
^β = argmin ρðr i ðβÞÞ, ð5Þ
β i=1

where ρ is an appropriate function applied to the residuals. Taking ρ(r) = r2 yields the LS-estimator. For ρ(r) = j rj one
obtains the so-called L1-estimator which minimizes the sum of the absolute residuals.
Since certain equivariance properties are desirable (e.g., rescaling the response should not alter the robustness
behavior), it is necessary to scale the residuals. This results in the regression M-estimator

Xn  
^β = argmin r i ðβÞ
ρ , ð6Þ
β i=1
σ^

where σ^ is a robust scale estimator of the residuals. The scale σ^ can either be estimated previously, or jointly with the
regression parameters.
It is crucial to select an appropriate ρ function. The sensitivity of the LS-estimator to outliers is based on the fact that
large (scaled) residuals have a huge impact because of the square. On the other hand, in order to obtain high efficiency,
the ρ function needs to be similar to that of LS. One widely used function is Tukey’s bisquare function, defined as
(  2   r 2  r 4 
r
3 −3 + for jr j ≤ k
ρðr Þ = k k k : ð7Þ
1 else
FILZMOSER AND NORDHAUSEN 5 of 18

For k ! ∞, the estimate tends to LS, and thus it becomes more efficient but also less robust. In practice, k is selected
such that reasonable efficiency but also a certain level of robustness is attained. This ρ function is also bounded, which
0
implies that its derivative ψ ≔ ρ goes down to zero for larger (absolute) scaled residuals. In fact, based on the derivative
ψ one can formulate a necessary condition for Equation (6) as

Xn  
r i ðβÞ
ψ x i = 0, ð8Þ
i=1
σ^

which is called the system of M-estimating equations. Based on a bounded ρ function, the resulting ψ function leads to
the effect that outlying “large” values of xi will not dominate the sum if they are connected to large (absolute) residuals,
which correspond to bad leverage points.
With the notations W(r) = ψ(r)/r and wi = W ðr i ðβÞ=^σ Þ, Equation (8) can be rewritten as

X
n  
wi yi −x Ti β x i = 0 : ð9Þ
i=1

This reveals the M-estimator as a weighted version of the normal equations. Therefore, the estimator can be seen as
weighted LS, with the weights depending on the data. A robust estimator has to make sure that observations with large
(absolute) residuals receive a small weight, which implies that W(r) has to decrease to zero fast enough for large r.

2.3 | S- and MM-estimators

A crucial point is to estimate the robust residual scale in Equation (6). The robustness properties of the resulting estima-
tor depend on the robustness properties of the scale estimator. One prominent choice is the M-estimator of scale, defined
as the solution of the implicit equation

1X n r 
i
ρ = b, ð10Þ
n i = 1 σ^

where ρ is a bounded (and appropriately chosen) ρ-function and b is a constant. With this choice, regression S-estimators
are defined as

^β = arg minσ^ðr ðβÞÞ, ð11Þ


β

with the vector of residuals r(β) = (r1(β), …, rn(β))T. Regression S-estimators achieve the maximum BP of 50%, but suffer
from low efficiency. However, using the S-estimator and the resulting σ^ as an initial estimator for the M-estimator (6)
yields the so-called MM-estimator, which inherits the BP of the S-estimator, and has tunable efficiency.
The MM-estimator is implemented in the R package robustbase (Maechler et al., 2018) as the function
lmrob().

2.4 | LTS regression

A regression estimator with high BP, already proposed in 1984, is the least trimmed squares (LTS) estimator
(Rousseeuw, 1984). This estimator is still very popular because of its intuitive definition, and because of a fast algorithm
for its computation (Rousseeuw & Van Driessen, 2006). This estimator is defined as
6 of 18 FILZMOSER AND NORDHAUSEN

X
h
β^ = argmin r2 ðβÞi:n , ð12Þ
β i=1

 T
with the vector r2 ðβÞ = r 21 ,…, r 2n of squared residuals, and the order statistics of the squared residuals (r2(β))1 :
n ≤   ≤ (r (β))n : n. The parameter h determines
2
n + p + 1 the BP of the estimator, but also its efficiency. The maximum BP (and
lowest efficiency) is attained with h = 2 , where [a] is the integer part of a.
The so-called Fast-LTS algorithm (Rousseeuw & Van Driessen, 2006) works as follows:

Step 1: Fix h and select randomly h out of the n observations, determined by the set of the corresponding indexes H1.
Perform with these observations LS regression to obtain the estimator β^H 1 .
Step 2: Compute the residuals for all n observations based on β^H 1 , i.e.,

 
r i β^H 1 = yi − xTi β^H 1 for i = 1, …, n,
 2  2  2
and sort the squared residuals in ascending order r i β^H 1 ≤   ≤ r i β^H 1 ≤   ≤ r i β^H 1 .
ð1Þ ðhÞ ðnÞ

Step 3: Take the observations which refer in Step 2 to the h smallest squared residuals, and call the corresponding index
set H2. Perform LS regression with these observations to obtain the estimator β^H 2 .

It has been shown (Rousseeuw & Van Driessen, 2006) that the objective function of LTS decreases successively:

X  2 X  2
r i β^H 2 ≤ r i β^H 1
i∈H 2 i∈H 1

Thus, Steps 2 and 3 are alternated until convergence. These two steps are also called C-steps (Rousseeuw & Van
Driessen, 2006), and they are used in several other robustness procedures.

The final result may still depend on the initialization in Step 1. Therefore, in practice one should restart the whole
procedure several times in order to approximate the global minimum of the minimization problem (12). Note that also
the initialization is usually done in a more efficient way (Rousseeuw & Van Driessen, 2006).

3 | R OB U S T R E G R E S S I O N I N H I G H D I M E N S I O N

High-dimensional data arise in many application fields, such as chemometrics with spectral data, or bioinformatics
with genetic information. Also in many technical and industrial applications it is more and more common to measure
many variables, for example, with sensors. Not necessarily does the predictive power improve with an increasing num-
ber of variables. However, classical LS regression does no longer work if the number of variables p exceeds the number
of observations n (see discussion above). The same numerical problem happens for robust regression estimators,
because they are based either on weighted LS (e.g., M-estimators) or on trimmed LS-estimation (LTS). Therefore, other
approaches have been developed to cope with this problem, such as dimension reduction of the explanatory variables,
or shrinkage estimators. Robust proposals of these techniques are available, and some of them are discussed in the fol-
lowing. Note that these methods also work in the case p < n, and they could lead to simpler models or models with
higher predictive power compared to models from LS regression. Nevertheless, the focus below is on the case p > n.

3.1 | Methods based on dimension reduction

Reducing the dimensionality of the space of the explanatory variables leads to smaller and simpler models and thus
tries to avoid overfitting. As a consequence, this may lead to better prediction performance. The two approaches
FILZMOSER AND NORDHAUSEN 7 of 18

discussed below follow this idea: new “latent” variables are constructed as linear combinations of the input variables,
and they are used to replace the original matrix of explanatory variables for regression. However, the way how these
latent variables are constructed differs: For principal component regression (PCR), they are constructed just based on
information from the explanatory variables, while for partial least squares (PLS) regression they also take into account
the response variable.

3.2 | Robust PCR

The basic idea of robust PCR is very simple: use the matrix of the p predictor variables, compute robust principal com-
ponents (PCs), and take the first k PC scores as new explanatory variables in a robust regression, where the optimal
choice of k is usually determined by cross-validation. There are many robust PC analysis methods available, typically
based on robust covariance estimation (see Maronna et al., 2019). However, for the case p > n there are less possibilities
since, for example, prominent robust covariance estimators like the minimum covariance determinant (MCD) estimator
(Rousseeuw & Van Driessen, 1999) do no longer work in this case because of singularity. There are, however, covari-
ance estimators which also work in the high-dimensional situation (Hubert, Rousseeuw, & Van den Branden, 2005;
Maronna & Zamar, 2002), and methods based on the principle of projection pursuit (Croux, Filzmoser, & Oliveira, 2007;
Croux & Ruiz-Gazen, 2005). Since the number of PCs to be used is then typically much smaller than n, any robust
regression estimator can be employed. More details can be found for example, in Hubert and Verboven (2003) and Han
and Liu (2013).

3.3 | Robust partial least squares regression

PLS regression also constructs a set of linear combinations of the predictor variables for regression, but unlike PCR it
also makes use of the response y for this construction. We assume that the p predictor variables are arranged in the
n × p matrix X (now without a column of ones for the intercept), and that the columns are centered and—depending
on the application—also scaled. We consider the regression model for the centered response

yi = t >
i γ + εi , ð13Þ

for i = 1, …, n, with the regression parameters γ, a vector of length q < p, typically much smaller than p in the high-
dimensional case. The score vectors ti are arranged as rows in an n × q score matrix T. However, T can not be observed
directly; we obtain each column of T sequentially, for k = 1, 2, …, q, by using the PLS criterion

w k = argmax Covðy,Xw Þ, ð14Þ


w

under the constraints kwk k = 1 and Cov(Xwk, Xwj) = 0 for 1 ≤ j < k. The vectors wk with k = 1, 2, …, q are called
loadings, and they are collected in the columns of the matrix W. The score matrix is then T = XW, and y can be written
as

y = Tγ + ε = ðXWÞγ + ε = XðWγÞ + ε: ð15Þ


|fflffl{zfflffl}
β

In other words, PLS performs a regression on a weighted version of X which contains incomplete or partial informa-
tion (thus the name of the method). If LS regression is used for this purpose, this leads to the name partial least squares.
Since PLS uses also y to determine the PLS weights, this method is supposed to have better prediction performance
than PCR, or at least leads to a smaller number of components and thus potentially to a more stable model. The num-
ber of PLS components, q, is an important tuning parameter, which is usually determined by a cross-validation scheme
(Filzmoser, Liebmann, & Varmuza, 2009). Selecting q too low will result in a loss of valuable information for modeling,
8 of 18 FILZMOSER AND NORDHAUSEN

and taking q as too big results in overfit to the training data. In both cases, the prediction performance will in general
be lower compared to an “optimal” choice of q.
In order to robustify this method, two steps are crucial: the covariance in Equation (14) needs to be estimated
robustly, and robust regression in Equation (15) is required. The method called PRM (partial robust M-estimator) uses
weights for outliers in the space of the explanatory variables and in the response to do a weighted covariance estima-
tion, followed by an M-regression (Serneels, Croux, Filzmoser, & Van Espen, 2005). The iterative algorithm leads to a
highly robust estimator. However, several other robust proposals are available, see for instance Filzmoser, Serneels,
Maronna, and Van Espen (2009) and Aylin and Agostinelli (2017).
Naturally, any robust dimension reduction method which is assumed to capture all information relevant for the
regression can be combined with subsequent robust regression methods. The literature offers here still many other pos-
sibilities, like for example Croux, Filzmoser, and Fritz (2013), Hubert, Reynkens, Schmitt, and Verdonck (2016), Wang
and Van Aelst (2019b) and Dupuis and Victoria-Feser (2013).

3.4 | Methods based on shrinkage

The idea of shrinkage estimators is to continuously shrink the contributions of explanatory variables in the regression
model. In contrast to variable selection methods which is a discrete process, this can lead to smaller prediction errors
and more stability. As before, the focus here is on the case p > n.

3.5 | Robust ridge and Liu regression

Ridge regression shrinks the regression coefficients by imposing a penalty on their size (Hoerl & Kennard, 1970). The
Ridge coefficients minimize a penalized residual sum of squares,
( )
n 
X 2 X
p
^βRidge = argmin yi −β0 −x Ti β j +λ β2j , ð16Þ
β i=1 j=1

and the penalty is on the sum of squared regression coefficients, excluding the intercept. The penalty parameter λ ≥ 0
controls the amount of shrinkage: while λ = 0 leads to the LS solution, bigger values of λ result in smaller (squared)
regression coefficients. Thus, criterion (16) is a compromise between minimizing the residual sum-of-squares, and keep-
ing the size of the (squared) coefficients small. Due to the form of this criterion, the Ridge estimator is given by an
explicit formula,

 
^βRidge = XT X + λIp − 1 XT y ð17Þ

with the identity matrix Ip of order p, which makes it computationally attractive. Different proposals are available to
select the optimal Ridge parameter λ, most commonly some form of cross-validation.
Obviously, criterion (16) is not robust against outliers. A highly robust estimator for the case p > n has been intro-
duced in Maronna (2011), and the principle is based on MM regression. The classical Ridge estimator can be written in
terms of the normal equations, and this is also possible for the MM-type estimator:
     
w T y − β^0 1n −X^β1 = 0 and XT WX + λIp ^β1 = XT W y − β^0 1n : ð18Þ

Here, 1n is a vector of ones of length n, ^β1 contains all regression coefficients except the intercept, w is a vector of length
n with weights, and W = diag(w). The weights are based on robustly scaled residuals, where the scale is estimated by
an M-scale. Starting with an initial S-estimator of scale, an iterative scheme leads to a Ridge estimator which is robust
against vertical outliers and leverage points.
An estimator with similar properties as the Ridge estimator has been proposed in Liu (1993) as
FILZMOSER AND NORDHAUSEN 9 of 18

   
^βLiu = XT X + Ip − 1 XT X + λIp ^βLS ð19Þ

where 0 < λ < 1 is the shrinkage parameter, and ^βLS the LS estimator. As an advantage over the Ridge estimator, the
Liu estimator is a linear function of λ, and thus easier to compute. Several proposals have been made to robustify the
Liu estimator, but the challenge is to obtain an estimator which is robust against both types of outliers, and still works
in the case p > n. The proposal of Filzmoser and Kurnaz (2018) provides these properties, and the idea is similar to Mar-
onna (2011) by including weights, in this case originating from a PRM estimator (Serneels et al., 2005).

3.6 | Robust Lasso regression

Similar to Ridge regression, Lasso regression minimizes a penalized residual sum-of-squares, but the penalty is not of
an L2 type, but of L1 type:

( !2 )
X
n X
p X
p
^βLasso = argmin yi −β0 − x ij β j + nλ jβ j j ð20Þ
β i=1 j=1 j=1

The multiplication of the penalty term with the number of observations is not important, but it is useful once we turn
to a robustified version of this criterion. The L1 penalty has the effect that some regression coefficients will shrink to
exactly zero, which corresponds to a variable selection (Tibshirani, 1996). In practice, the tuning parameter λ is selected
by a cross-validation scheme. The bigger λ is chosen, the more emphasis is on keeping the L1-norm of the regression
coefficients small, which implies that more and more regression coefficients will be set to zero. Unlike in Ridge regres-
sion, there is no explicit formula to compute the Lasso regression coefficients. However, there are fast algorithms in the
framework of Least Angle Regression to compute the solution (Efron, Hastie, Johnstone, & Tibshirani, 2004).
The (finite-sample) BP of the Lasso estimator is 1/n. A robust Lasso estimator with BP (n − h + 1)/n is the so-called
sparse LTS estimator, defined as
( )
Xh   X
p
^βsparseLTS = argmin r ðβÞ
2
+ hλ jβ j j , ð21Þ
i:n
β i=1 j=1

where the same notation as for the LTS estimator in Equation (12) is used, and h is the trimming parameter, with h ≤ n
(Alfons, Croux, & Gelper, 2013). Thus, this estimator is highly robust and also resistant to leverage points. Moreover, an
algorithm in the spirit of the Fast-LTS algorithm (Rousseeuw & Van Driessen, 2006) has been proposed and
implemented in the R package robustHD (Alfons, 2016). Simulations have demonstrated that this estimator has clear
advantages over other robust Lasso estimators (H. Wang, Li, & Jiang, 2007; Khan, Van Aelst, & Zamar, 2007). A Lasso
estimator using MM regression has been introduced in Smucler and Yohai (2017). The authors even extend to more
general penalty types, and proof robustness and asymptotic properties.

3.7 | Robust Elastic Net regression

The Elastic Net estimator is defined as

( !2 )
X
n X
p
^βenet = argmin yi −β0 − x ij β j + λPα ðβÞ , ð22Þ
β i=1 j=1

with a convex combination of the L1 and L2 penalty,


10 of 18 FILZMOSER AND NORDHAUSEN

p 
X 
1 2
P α ðβ Þ = ð1 −αÞ β j + αjβ j j ð23Þ
j=1
2

(Zou & Hastie, 2005). This penalty allows to overcome the limitation of the Lasso estimator to select at most n vari-
ables, which is often not enough in settings where p is much larger than n. Moreover, this estimator makes it possible
that groups of correlated variables enter the model, and not only single variables from such blocks. A very efficient algo-
rithm for the computation of the Elastic Net estimator is available in the R package glmnet (Friedman, Hastie, Simon,
& Tibshirani, 2016).
Following the ideas of the previous section, a trimmed version of the Elastic Net estimator has been proposed (Kur-
naz, Hoffmann, & Filzmoser, 2018b). Here, also the logistic regression model in the high-dimensional case is treated
robustly. The procedures are implemented in the R package enetLTS (Kurnaz, Hoffmann, & Filzmoser, 2018a).
IFs in the context of many penalized regression estimators as discussed above are considered in Öllerer, Croux, and
Alfons (2015).

3.8 | Combining dimension reduction and shrinkage

PLS regression results in a coefficient vector which usually involves all variables. This is undesirable in case there are
many noise variables, thus variables which do not have any predictive power. The Lasso idea can also be applied in this
context: since the covariance between the response and a linear combination of the predictors needs to be maximized,
see Equation (14), it is necessary to include an L1 penalty on the coefficient vector of the linear combination. Chun and
Keleş (2010) proposed a slightly modified objective function which imposes sparsity on a surrogate vector, and not
directly on the vector of the linear combination, which allows to get sufficiently enough sparsity. It turns out that the
resulting sparse PLS loadings vector—compare with Equation (14)—has an exact solution given by an explicit formula.
A robust sparse PLS version has been introduced in Hoffmann, Serneels, Filzmoser, and Croux (2015), called sparse
PRM estimator, since the weighting principle comes from the PRM estimator (Serneels et al., 2005). Weights for
outlyingness are iteratively assigned to the observations, and a sparse NIPALS algorithm (Lee, Lee, Lee, &
Pawitan, 2011) is employed to achieve sparsity. Note that this leads to the selection of two tuning parameters, the num-
ber of PLS components and the sparsity parameter, which is done by a repeated cross-validation. The algorithm is
implemented in the R package spls (Serneels & Hoffmann, 2016).
Since PLS is often applied to classification problem in terms of a binary response variable, this idea has also been
extended to a robust sparse PLS version for the two-group problem (Hoffmann, Filzmoser, Serneels, & Varmuza, 2016).

3.9 | Example

Some of the methods mentioned above will be illustrated in the following based on an example with data from a pro-
duction process. The data were observed over time, in total 648 observations are available, and the values of 468 predic-
tors should be used to explain the outcome response. For confidentiality reasons, only the anonymized (centered and
scaled) data have been made available at https://cstat.tuwien.ac.at/data.RData. The time series character of the data
will be ignored in the following analyses for simplicity. A training set of randomly selected 3/4 of the observations is
used to fit the different models, and the evaluation is done on the remaining 1/4 of the test data. The 10% trimmed root
mean-squared error (RMSEt) is chosen as evaluation measure; here, 10% of the biggest squared residuals are trimmed,
and thus this measure provides a certain level of robustness to outliers in the residuals.
Figure 1 presents results from LS-regression, which is still feasible with 486 training set observations and 468 predic-
tors. The left plot shows the fitted values ^y versus the response y. The model fits well, with a 10% trimmed RMSE of
0.135. However, when applying the model to the test data, the RMSEt increases by a factor of 10 to 1.396, and the mid-
dle and right plot (zoomed) show predictions versus the response. It is obvious that the predictions are rather poor,
which is the result of overfitting to the training data. Moreover, outliers could have had an impact on the parameter
estimation—at least one huge outlier is visible in the middle plot.
A next attempt is to apply robust regression; the MM-estimator as alternative to classical LS-regression is used for
this purpose. Figure 2 presents the results in a similar way as before. In addition, color information indicates
FILZMOSER AND NORDHAUSEN 11 of 18

Training (0.135) Test (1.396) Test (zoom)


4

3
0

2
2

1
−100000
Prediction

Prediction
Fit
0

0
−1
−2

−200000

−2
−3
−4

−4 −2 0 2 4 −4 −2 0 2 −3 −2 −1 0 1 2 3

Response Response Response

F I G U R E 1 Results from least squares-regression: fitted values versus response (left), prediction versus response (middle) and zoom
(right); numbers on top are 10%-trimmed root mean-squared error

Training (0.062) Test (1.606) Test (zoom)


150

3
−1e+05

2
100

1
Prediction

Prediction
−3e+05
Fit

0
50

−1
−2
−5e+05
0

−3

−4 −2 0 2 4 −4 −2 0 2 −3 −2 −1 0 1 2 3

Response Response Response

F I G U R E 2 Results from MM regression: fitted values versus response (left), prediction versus response (middle) and zoom (right);
numbers on top are 10%-trimmed root mean-squared error; color of the symbols indicates outlyingness information

outlyingness: For the left plot, red symbol color is used if the observation weights wi are smaller than 0.5, referring to
observations that deviate from the model to a certain degree. For the middle and right plot, red color is used if the abso-
lute predicted residuals exceed 2:5^σ , where σ^ is the robustly estimated residual scale from the MM-estimator based on
the training data. However, this seems not to be very useful as outlier indication, because twice as many outliers as reg-
ular observation would appear as outliers for the test data. When looking at the RMSEt, the same problem as in LS-
regression with overfitting to the training data is apparent, and robustness of MM-regression did not at all help to
improve the prediction accuracy.
Assuming that several of the explanatory variables are not important for the model, Elastic Net regression could be
the method of choice. Using the R function cv.glmnet from the R package glmnet, the optimal tuning parameters
can be selected by cross-validation, resulting in λ = 0.0174 and α = 0.61. This leads to a model with 106 predictors, and
thus to moderate sparsity. Figure 3 shows the plots with the model fit and model prediction, and the RMSEt values indi-
cate that the overfit problem has been addressed appropriately, since training and test error are in the same order of
magnitude. It is, however, hard to say if there was an impact of potential outliers to model estimation.
The robust counterpart to Elastic Net regression based on trimming results in the optimal model with the parame-
ters λ = 0.0281 and α = 0.625, and with 78 explanatory variables entering the model. Figure 4 presents the
corresponding plots. The RMSEt for the training data improved when compared to classical Elastic Net regression, and
12 of 18 FILZMOSER AND NORDHAUSEN

Training (0.271) Test (0.296) Test (zoom)

3
2

−40,000 −30,000 −20,000 −10,000

2
1

1
Prediction

Prediction
Fit
0

0
−1
−1

−2
−2

−3
−4 −2 0 2 4 −4 −2 0 2 −3 −2 −1 0 1 2 3

Response Response Response

F I G U R E 3 Results from Elastic Net regression: fitted values versus response (left), prediction versus response (middle) and zoom
(right); numbers on top are 10%-trimmed root mean-squared error

Training (0.243) Test (0.186) Test (zoom)

3
3
4

2
2
2

1
Prediction

Prediction
1
Fit
0

0
0

−1
−2

−1

−2
−4

−3
−2

−4 −2 0 2 4 −4 −2 0 2 −3 −2 −1 0 1 2 3

Response Response Response

F I G U R E 4 Results from robust Elastic Net regression: fitted values versus response (left), prediction versus response (middle) and
zoom (right); numbers on top are 10%-trimmed root mean-squared error; color of the symbols indicates outlyingness information

the prediction error for the test data even reduced again. The color information in the training data plot is based on the
0/1 weights for outlyingness after model fit, and for the test data outlyingness is based on whether the absolute resid-
uals exceed 2:5^σ , where σ^ is a robust scale estimate (MAD) of the training set residuals. This fitted model may already
be quite useful for practical purposes, since it provides information on the relevant predictors, it leads to a good predic-
tion error, and it allows to identify outliers in the training and in the test set.
In the final analyses, dimension reduction and shrinkage are combined. Figure 5 presents the results of sparse PLS,
while Figure 6 shows those from sparse PRM. It seems that sparse PLS suffers from outliers, which are visible in the
plots for the training data fits and for the test data predictions. Outliers are also visible in the corresponding plots for
sparse PRM, but they are not affecting the estimation. The trimmed RMSE values for the robust analyses are quite com-
parable to those from robust Elastic Net regression, and also here outlier diagnostics is possible based on a robustly esti-
mated residual scale (color information in the plots).
Note that for these analyses, two tuning parameters need to be selected: the number of PLS (PRM) components, q,
and the sparsity parameter s ∈ [0, 1), affecting the number of selected variables in the model (Hoffmann et al., 2015).
Figure 7 presents a visual display for robust PRM to select these parameters. In the upper plot, the 10%-trimmed RMSE
is visualized for all combinations of the prechosen values for q and s. The RMSEt values result from a repeated cross-
validation, and the minimum is achieved for q = 8 components and s = 0.21. The bottom plot shows the number of
FILZMOSER AND NORDHAUSEN 13 of 18

Training (0.347) Test (0.278) Test (zoom)

3
10
4

2
8
2

1
Prediction

Prediction
6
Fit
0

0
4

−1
−2

−2
0
−4

−3
−4 −2 0 2 4 −4 −2 0 2 −3 −2 −1 0 1 2 3

Response Response Response

F I G U R E 5 Results from sparse PLS: fitted values versus response (left), prediction versus response (middle) and zoom (right); numbers
on top are 10%-trimmed root mean-squared error

Training (0.274) Test (0.186) Test (zoom)

3
0
4

2
−1000
2

1
Prediction

Prediction
−2000
Fit
0

0
−1
−3000
−2

−2
−4000
−4

−3

−4 −2 0 2 4 −4 −2 0 2 −3 −2 −1 0 1 2 3

Response Response Response

FIGURE 6 Results from sparse partial robust M-estimator: fitted values versus response (left), prediction versus response (middle) and
zoom (right); numbers on top are 10%-trimmed root mean-squared error; color of the symbols indicates outlyingness information

nonzero coefficients for all considered models. For the optimal parameter combination one obtains a quite nonsparse
model with 459 nonzero coefficients.

4 | C E LL W I S E C O N T A M I N A T I O N

All the previous methods downweight whole observations, which seems quite reasonable when there are many obser-
vations relative to the number of variables. However, especially in high-dimensional data settings as considered here, it
appears to be rather drastic to downweight a whole observation when, for example, only one variable has an atypical
entry not fitting the model. A recent direction of robust methods is considering especially such settings where only sin-
gle cells (variables) of an observation could be outliers, and the underlying model is denoted as the cellwise contamina-
tion model. Robust methods in this case try to identify the contaminated cells, replace them with reasonable values and
use the imputed data set to obtain robust estimates (Rousseeuw & Van den Bossche, 2018).
Methods assuming a cellwise contamination are, however, still not that well investigated. For example, in a regres-
sion setting assuming n > p, MM regression was considered in Filzmoser, Höppner, Ortner, Serneels, and Ver-
donck (2020), a three step procedure based on S-estimators in Leung, Zhang, and Zamar (2016) and the so-called
14 of 18 FILZMOSER AND NORDHAUSEN

Trimmed RMSE (minimum at q = 8, s = 0.21) F I G U R E 7 Results from sparse partial robust M-


estimator: 10%-trimmed root mean-squared error from
0.8
cross-validation for all combinations of numbers of
components q and sparsity parameter s (top); resulting
RMSEt number of nonzero coefficients for all parameter
0.6
0.40 combinations
s

0.36
0.4
0.32

0.28

0.2

0
1 2 3 4 5 6 7 8 9
q

Number of nonzero coefficients

0.8

number
0.6
400
300
s

0.4 200
100

0.2

0
1 2 3 4 5 6 7 8 9
q

shooting S-estimator in Öllerer, Alfons, and Croux (2016), combining ideas from simple S-regression with the ‘shooting
algorithm‘, which is a coordinate descent algorithm.
The shooting S-estimator was recently extended to the high-dimensional setting in Bottmer, Croux, and
Wilms (2020). The main idea of this sparse shooting S-estimator is to cyclically estimate the coefficients of β by solving
a penalized S regression given by
n   o
β^ j = argmin σ^ j β j + λjβ j j for j = 1, …, p, ð24Þ
β j ∈R

 
where σ^ j β j is the solution s of
!
1X n ~yð jÞ −x ij β j
ρ i = δ: ð25Þ
n i=1 s

The response variable in Equation (25) is


X
~yði jÞ = yi − ~x ik β^k , ð26Þ
k6¼j

where ~x ik is the ik-th cell of the predictor matrix X, possibly cleaned if considered outlying. For details on how to obtain
~x ik and an initial value for β (see Bottmer et al., 2020).
Naturally, in a cellwise contamination model, equivariance concepts become invalid, because an affine transforma-
tion of X would imply that an observation with, for example, one contaminated cell could have all cells contaminated
after the transformation. This effect is known as error propagation and readers interested in cellwise contamination
models are referred to Alqallaf, Van Aelst, Yohai, and Zamar (2009) for further details.
FILZMOSER AND NORDHAUSEN 15 of 18

5 | OTHER METHODS

Also regression approaches coming from a nonparametric approach based on signs and ranks are well established and
enjoy some robustness properties.
Median regression, or LAD (least absolute deviation) regression, is probably the most established nonparametric
regression approach. The LAD estimator is defined as

X
n
^βLAD = argmin j r i ðβÞ j , ð27Þ
β i=1

thus using an L1-norm rather than the L2-norm of the LS approach. It is interesting to note that also the LAD estimator
can be seen as an M-estimator, see Equation (5) and comments below. However, one drawback of this estimator is that
it is only robust against vertical outliers, but it does not protect against leverage points (Maronna et al., 2006).
The LAD Lasso estimator (H. Wang et al., 2007) is then analogously to the LS Lasso estimator defined as
( )
X
n X
p X
p
^βLAD Lasso = argmin jyi −β0 − x ij β j j + nλ jβ j j : ð28Þ
β i=1 j=1 j=1

The LAD Lasso is computationally quite easy. Denote fð~yi , x~i Þg , i = 1, …, n + p as an augmented data set where
ð~yi , x~i Þ = ððyi , xi ÞÞ if 1 ≤ i ≤ n, and ð~yi , x~i Þ = ðð0, n λ ei − n + 1 ÞÞ if n + 1 ≤ i ≤ n + p, where ei is the p-variate unit vector with a
one as its i-th element. Then Equation (28) can be expressed as

X
n
^βLAD Lasso = argmin j ~r i ðβÞ j , ð29Þ
β i=1

where ~r i ðβÞ = ~yi − x


~Ti β. Thus, this estimator can be computed as a regular LAD estimator. In R this is possible, for exam-
ple, with the package quantreg (Koenker, 2019). The LAD Lasso has, as the regular Lasso, also the oracle property
and many variants and extensions are suggested, like for example Arslan (2012) and Y. Wang and Zhu (2017). Also
multivariate extension have been investigated (Möttönen & Sillanpää, 2015). Rank regression is less popular, but also
in that case Lasso variants are possible (Kim, Ollila, & Koivunen, 2015).
We want to point out that this review of robust sparse regression approaches is by far not complete and many other
proposals are available in the literature. For the interested reader we refer for example to Dupuis and Victoria-
Feser (2013), Chang, Roberts, and Welsh (2018), Xu and Zhang (2019), Cohen Freue, Kepplinger, Salibin-Barrera, and
Smucler (2019), Wang and Van Aelst (2019a) for further approaches and references. There are also alternative
approaches to robustness available, such as methods based on a minimal distance concept, see Scott (2001) or Basu,
Shioya, and Park (2011), for an overview.

6 | C ON C L U S I ON S

The focus of this contribution was on robust linear regression methods for high-dimensional data. As in the low-dimen-
sional case, there are two types of outliers that need to be taken care of: vertical outliers, which are outliers in the
response variable deviating from the linear trend, and bad leverage points, which are outliers in the explanatory vari-
ables that clearly deviate from the linear trend. The various versions of robust regression techniques differ in the way
how they deal with these outlier types. Basically, the main strategy is to downweight such observations in order to
reduce their effect on the estimation of the regression parameters. In other words, the influence of outlying observa-
tions on the estimator should be bounded—for classical least-squares estimation this is not the case. A further goal of
robust regression estimators is to achieve a high BP and at the same time high efficiency under specific model
assumptions.
16 of 18 FILZMOSER AND NORDHAUSEN

In recent years, there was a lot of research activity in this direction. Since traditional M-estimators can no longer be
applied in the situation where the number of variables exceeds the number of observations, alternative approached
needed to be considered which, however, are available from the nonrobust setting. One research direction was on
robust dimension reduction of the explanatory variables, linked with a robust regression on the extracted latent vari-
ables. This resulted in robust counterparts to PC and partial least-squares regression. Another approach was based on
robust versions of Ridge and Lasso regression, which are shrinkage estimators with a penalty on the size of the regres-
sion coefficients. Especially the concept of sparsity is useful in many high-dimensional problems, and thus also robust
versions of sparse partial least-squares regression, as an example, are available.
There might still be a lot of potential in methods that follow the cellwise contamination model. Such methods do
not necessarily downweight the complete information of outlying observations, but they only reduce the impact of out-
lying cells in such observations. This concept seems particularly useful in the high-dimensional case, where cellwise
downweighting still allows to consider a lot of valuable information of “unusual” observations for modeling.
There are of course many robust methods available also for other purposes than regression, but listing those would
clearly exceed any space limit of this contribution. Generally speaking, robust methods usually follow the principle of
downweighting outlying observations (or cells) before traditional estimation techniques are applied. Depending on the
method, the assigned weights can be simply zero or one, corresponding to excluding or including the observation (cell)
in the estimation, or they can be continuous, usually in the interval [0, 1]. In either case, robust methods do not throw
away or waste information, since the weights are carefully chosen with respect to the underlying model. Practically, this
means that an iterative procedure is employed, where the estimates of the parameters need to be (robustly) initialized
and updated until they stabilize. Thus, in general, robust estimators require more computation time compared to non-
robust estimators, and the computation may also be more “tricky.” For this reason, not only the theoretical properties
of robust estimators are important, but also the algorithms for their computation, and the implementation of the algo-
rithms in software packages. Fortunately, there are many procedures available which are well designed with respect to
all three aspects, and thus ready for use in practical applications.

A C K N O WL E D G M E N T S
The authors want to thank the editors and reviewers for helpful remarks and comments. They also acknowledge sup-
port from the UpDeep project (Upscaling deep buried geochemical exploration techniques into European business,
2017–2020), funded by the European Information and Technology Raw Materials.

CONFLICT OF INTEREST
The authors have declared no conflicts of interest for this article.

A U T H O R C O N T R I B U T I O NS
Peter Filzmoser: Methodology. Klaus Nordhausen: Methodology.

ORCID
Peter Filzmoser https://orcid.org/0000-0002-8014-4682

R EL A TE D WIR ES AR TI CLE
Robust statistics: A selective overview and new directions

FURTHER READING
Robust high-dimensional regression from a signal processing view is, for example, discussed in Zoubir, Koivunen,
Ollila, and Muma (2018).

R EF E RE N C E S
Alfons, A. (2016). robustHD: Robust methods for high dimensional data (R package version 0.4.0) [Computer software manual]. Vienna, Aus-
tria. Retrieved from http://CRAN.R-project.org/package= robustHD.
Alfons, A., Croux, C., & Gelper, S. (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals
of Applied Statistics, 7(1), 226–248.
FILZMOSER AND NORDHAUSEN 17 of 18

Alqallaf, F., Van Aelst, S., Yohai, V. J., & Zamar, R. H. (2009). Propagation of outliers in multivariate data. The Annals of Statistics, 37(1),
311–331.
Arslan, O. (2012). Weighted LAD-LASSO method for robust parameter estimation and variable selection in regression. Computational Statis-
tics & Data Analysis, 56(6), 1952–1965.
Aylin, A., & Agostinelli, C. (2017). Robust iteratively reweighted SIMPLS. Journal of Chemometrics, 31, e2881.
Basu, A., Shioya, H., & Park, C. (2011). Statistical inference: The minimum distance approach. Boca Raton, FL: CRC Press.
Bottmer, L., Croux, C., & Wilms, I. (2020). Sparse regression for large data sets with outliere. Retrieved from https://sites.google.com/view/
iwilms/ongoing-research
Chang, L., Roberts, S., & Welsh, A. (2018). Robust lasso regression using Tukey's biweight criterion. Technometrics, 60(1), 36–47.
Chun, H., & Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of
the Royal Statistical Society: Series B (Statistical Methodology), 72(1), 3–25.
Cohen Freue, G. V., Kepplinger, D., Salibin-Barrera, M., & Smucler, E. (2019). Robust elastic net estimators for variable selection and identi-
fication of proteomic biomarkers. Annals of Applied Statistics, 13, 2065–2090.
Croux, C., Filzmoser, P., & Fritz, H. (2013). Robust sparse principal component analysis. Technometrics, 55, 202–214.
Croux, C., Filzmoser, P., & Oliveira, M. (2007). Algorithms for projection-pursuit robust principal component analysis. Chemometrics and
Intelligent Laboratory Systems, 87, 218–225.
Croux, C., & Ruiz-Gazen, A. (2005). High breakdown estimators for principal components: The projection-pursuit approach revisited. Jour-
nal of Multivariate Analysis, 95, 206–226.
Dupuis, D. J., & Victoria-Feser, M.-P. (2013). Robust VIF regression with application to variable selection in large data sets. Annals of Applied
Statistics, 7, 319–341.
Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.
Filzmoser, P., Höppner, S., Ortner, I., Serneels, S., & Verdonck, T. (2020). Cellwise robust M regression. Computational Statistics & Data
Analysis, 147, 106944. https://doi.org/10.1016/j.csda.2020.106944
Filzmoser, P., & Kurnaz, F. (2018). A robust Liu regression estimator. Communications in Statistics: Simulation and Computation, 47(2),
432–443.
Filzmoser, P., Liebmann, B., & Varmuza, K. (2009). Repeated double cross validation. Journal of Chemometrics: A Journal of the
Chemometrics Society, 23(4), 160–171.
Filzmoser, P., Serneels, S., Maronna, R., & Van Espen, P. (2009). Robust multivariate methods in chemometrics. In S. Brown, R. Tauler, &
B. Walczak (Eds.), Comprehensive chemometrics: Chemical and biochemical data analysis (pp. 681–722). Amsterdam, Netherlands:
Elsevier.
Friedman, J., Hastie, T., Simon, N., & Tibshirani, R. (2016). glmnet: Lasso and elastic net regularized generalized linear models (R package ver-
sion 2.0-5) [Computer software manual]. Vienna, Austria. Retrieved from http://CRAN.R-project.org/package=glmnet
Hampel, F., Ronchetti, E., Rousseeuw, P., & Stahel, W. (2005). Robust statistics: The approach based on influence functions. New York, NY:
Wiley.
Han, F., & Liu, H. (2013). Robust sparse principal component regression under the high dimensional elliptical model. In NIPS0 13: Proceed-
ings of the 26th International Conference on Neural Information Processing Systems, 2, 1941–1949.
Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: The Lasso and generalizations. Boca Raton, FL: Chap-
man & Hall/CRC.
Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67.
Hoffmann, I., Filzmoser, P., Serneels, S., & Varmuza, K. (2016). Sparse and robust PLS for binary classification. Journal of Chemometrics, 30
(4), 153–162.
Hoffmann, I., Serneels, S., Filzmoser, P., & Croux, C. (2015). Sparse partial robust M regression. Chemometrics and Intelligent Laboratory Sys-
tems, 149, 50–59.
Hubert, M., Reynkens, T., Schmitt, E., & Verdonck, T. (2016). Sparse PCA for high-dimensional data with outliers. Technometrics, 58,
424–434.
Hubert, M., Rousseeuw, P., & Van den Branden, K. (2005). ROBPCA: A new approach to robust principal component analysis.
Technometrics, 47(1), 64–79.
Hubert, M., & Verboven, S. (2003). A robust PCR method for high-dimensional regressors. Journal of Chemometrics, 17(8–9), 438–452.
Khan, J., Van Aelst, S., & Zamar, R. (2007). Robust linear model selection based on least angle regression. Journal of the American Statistical
Association, 102(480), 1289–1299.
Kim, H., Ollila, E., & Koivunen, V. (2015). New robust lasso method based on ranks. In 2015 23rd European Signal Processing Conference
(EUSIPCO) (pp. 699–703). Nice, France. https://doi.org/10.1109/EUSIPCO.2015.7362473.
Koenker, R. (2019). quantreg: Quantile regression (R package version 5.54) [Computer software manual]. Retrieved from https://CRAN.R-
project.org/package=quantreg
Kurnaz, F., Hoffmann, I., & Filzmoser, P. (2018a). enetLTS: Robust and sparse methods for high dimensional linear and logistic regression (R
package version 0.1.0) [Computer software manual]. Retrieved from https://CRAN.R-project.org/package=enetLTS
Kurnaz, F., Hoffmann, I., & Filzmoser, P. (2018b). Robust and sparse estimation methods for high-dimensional linear and logistic regression.
Chemometrics and Intelligent Laboratory Systems, 172, 211–222.
18 of 18 FILZMOSER AND NORDHAUSEN

Lee, D., Lee, W., Lee, Y., & Pawitan, Y. (2011). Sparse partial least-squares regression and its applications to high-throughput data analysis.
Chemometrics and Intelligent Laboratory Systems, 109(1), 1–8.
Leung, A., Zhang, H., & Zamar, R. (2016). Robust regression estimation and inference in the presence of cellwise and casewise contamina-
tion. Computational Statistics & Data Analysis, 99, 1–11.
Liu, K. (1993). A new class of biased estimates in linear regression. Communications in Statistics: Theory and Methods, 22, 393–402.
Maechler, M., Rousseeuw, P., Croux, C., Todorov, V., Ruckstuhl, A., Salibian-Barrera, M.,… di Palma, M. A. (2018). robustbase: Basic robust
statistics (R package version 0.93-3) [Computer software manual]. Retrieved from http://robustbase.r-forge.r-project.org/
Maronna, R. (2011). Robust ridge regression for high-dimensional data. Technometrics, 53(1), 44–53.
Maronna, R., Martin, D., & Yohai, V. (2006). Robust statistics: Theory and methods. Toronto, ON: John Wiley & Sons Canada Ltd.
Maronna, R., Martin, D., Yohai, V., & Salibian-Barrera, M. (2019). Robust statistics: Theory and methods (with R) (2nd ed.). Hoboken, NJ:
John Wiley & Sons.
Maronna, R., & Zamar, R. (2002). Robust estimates of location and dispersion for high-dimensional data sets. Technometrics, 44(4), 307–317.
Möttönen, J., & Sillanpää, M. J. (2015). Robust variable selection and coefficient estimation in multivariate multiple regression using LAD-
lasso. In K. Nordhausen & S. Taskinen (Eds.), Modern nonparametric, robust and multivariate methods: Festschrift in honour of Hannu
Oja (pp. 235–247). Cham, Switzerland: Springer International Publishing.
Öllerer, V., Alfons, A., & Croux, C. (2016). The shooting S-estimator for robust regression. Computational Statistics, 31(3), 829–844.
Öllerer, V., Croux, C., & Alfons, A. (2015). The influence function of penalized regression estimators. Statistics, 49(4), 741–765. https://doi.
org/10.1080/02331888.2014.922563
Rousseeuw, P. (1984). Least median of squares regression. Journal of the American Statistical Association, 79(388), 871–880.
Rousseeuw, P., & Leroy, A. (2003). Robust regression and outlier detection (2nd ed.). New York, NY: John Wiley & Sons.
Rousseeuw, P., & Van den Bossche, W. (2018). Detecting deviating data cells. Technometrics, 60(2), 135–145.
Rousseeuw, P., & Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212–223.
Rousseeuw, P., & Van Driessen, K. (2006). Computing LTS regression for large data sets. Data Mining and Knowledge Discovery, 12(1), 29–45.
Rousseeuw, P., & van Zomeren, B. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Associa-
tion, 85(411), 633–639.
Scott, D. W. (2001). Parametric statistical modeling by minimum integrated square error. Technometrics, 43(3), 274–285.
Serneels, S., Croux, C., Filzmoser, P., & Van Espen, P. (2005). Partial robust M-regression. Chemometrics and Intelligent Laboratory Systems,
79(1–2), 55–64.
Serneels, S., & Hoffmann, I. (2016). sprm: Sparse and non-sparse partial robust m regression and classification (R package version 1.2.2) [Com-
puter software manual]. Retrieved from https://CRAN.R-project.org/package=sprm
Smucler, E., & Yohai, V. (2017). Robust and sparse estimators for linear regression models. Computational Statistics & Data Analysis, 111,
116–130.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series (Methodological), 58(1),
267–288.
Wang, H., Li, G., & Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the LAD-Lasso. Journal of Busi-
ness & Economic Statistics, 25(3), 347–355.
Wang, Y., & Van Aelst, S. (2019a). Robust variable screening for regression using factor profiling. Statistical Analysis and Data Mining: The
ASA Data Science Journal, 12, 70–87.
Wang, Y., & Van Aelst, S. (2019b). Sparse principal component analysis based on least trimmed squares. Technometrics, 1–13. https://doi.
org/10.1080/00401706.2019.1671234
Wang, Y., & Zhu, L. (2017). Variable selection and parameter estimation via WLADSCAD with a diverging number of parameters. Journal of
the Korean Statistical Society, 46(3), 390–403.
Xu, S., & Zhang, C.-X. (2019). Robust sparse regression by modeling noise as a mixture of gaussians. Journal of Applied Statistics, 46(10),
1738–1755.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67(2),
301–320.
Zoubir, A. M., Koivunen, V., Ollila, E., & Muma, M. (2018). Robust statistics for signal processing. Cambridge, England: Cambridge University
Press.

How to cite this article: Filzmoser P, Nordhausen K. Robust linear regression for high-dimensional data: An
overview. WIREs Comput Stat. 2020;e1524. https://doi.org/10.1002/wics.1524

You might also like