Professional Documents
Culture Documents
Robust Regression and Outlier Detection With The ROBUSTREG Procedure PDF
Robust Regression and Outlier Detection With The ROBUSTREG Procedure PDF
net/publication/281873841
CITATIONS READS
19 1,076
1 author:
Colin Chen
J.P.Morgan
21 PUBLICATIONS 310 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Colin Chen on 19 September 2015.
Paper 265-27
Robust Regression and Outlier Detection with the
ROBUSTREG Procedure
Colin Chen, SAS Institute Inc., Cary, NC
1
SUGI 27 Statistics and Data Analysis
substantially improve the Ordinary Least Squares The OLS analysis of Figure 1 indicates that GAP and
(OLS) results for the growth study of De Long and EQP have a significant influence on GDP at the 5%
Summers. level.
De Long and Summers (1991) studied the national
The REG Procedure
growth of 61 countries from 1960 to 1985 using OLS. Model: MODEL1
Dependent Variable: GDP
data growth;
input country$ GDP LFG EQP NEQ GAP @@;
Parameter Estimates
datalines;
Argentin 0.0089 0.0118 0.0214 0.2286 0.6079 Parameter Standard
Austria 0.0332 0.0014 0.0991 0.1349 0.5809
Belgium 0.0256 0.0061 0.0684 0.1653 0.4109
Variable DF Estimate Error t Value Pr > |t|
Bolivia 0.0124 0.0209 0.0167 0.1133 0.8634
Botswana 0.0676 0.0239 0.1310 0.1490 0.9474 Intercept 1 -0.01430 0.01028 -1.39 0.1697
Brazil 0.0437 0.0306 0.0646 0.1588 0.8498
Cameroon 0.0458 0.0169 0.0415 0.0885 0.9333
LFG 1 -0.02981 0.19838 -0.15 0.8811
Canada 0.0169 0.0261 0.0771 0.1529 0.1783 GAP 1 0.02026 0.00917 2.21 0.0313
Chile 0.0021 0.0216 0.0154 0.2846 0.5402 EQP 1 0.26538 0.06529 4.06 0.0002
Colombia 0.0239 0.0266 0.0229 0.1553 0.7695
CostaRic 0.0121 0.0354 0.0433 0.1067 0.7043
NEQ 1 0.06236 0.03482 1.79 0.0787
Denmark 0.0187 0.0115 0.0688 0.1834 0.4079
Dominica 0.0199 0.0280 0.0321 0.1379 0.8293
Ecuador 0.0283 0.0274 0.0303 0.2097
ElSalvad 0.0046 0.0316 0.0223 0.0577
0.8205
0.8414
Figure 1. OLS Estimates
Ethiopia 0.0094 0.0206 0.0212 0.0288 0.9805
Finland
France
0.0301 0.0083 0.1206 0.2494
0.0292 0.0089 0.0879 0.1767
0.5589
0.4708
The following statements invoke the ROBUSTREG
Germany
Greece
0.0259 0.0047 0.0890 0.1885
0.0446 0.0044 0.0655 0.2245
0.4585
0.7924
procedure with the default M estimation.
Guatemal 0.0149 0.0242 0.0384 0.0516 0.7885
Honduras 0.0148 0.0303 0.0446 0.0954 0.8850
HongKong 0.0484 0.0359 0.0767 0.1233 0.7471
India 0.0115 0.0170 0.0278 0.1448 0.9356 proc robustreg data=growth;
Indonesi 0.0345 0.0213 0.0221 0.1179 0.9243 model GDP = LFG GAP EQP NEQ /
Ireland 0.0288 0.0081 0.0814 0.1879 0.6457
Israel 0.0452 0.0305 0.1112 0.1788 0.6816 diagnostics leverage;
Italy 0.0362 0.0038 0.0683 0.1790 0.5441
IvoryCoa 0.0278 0.0274 0.0243 0.0957 0.9207
output out=robout r=resid sr=stdres;
Jamaica 0.0055 0.0201 0.0609 0.1455 0.8229 run;
Japan 0.0535 0.0117 0.1223 0.2464 0.7484
Kenya 0.0146 0.0346 0.0462 0.1268 0.9415
Korea 0.0479 0.0282 0.0557 0.1842 0.8807
Luxembou 0.0236 0.0064 0.0711 0.1944 0.2863
Madagasc -0.0102 0.0203 0.0219 0.0481 0.9217 Figure 2 displays model information and summary
Malawi 0.0153 0.0226 0.0361 0.0935 0.9628
Malaysia 0.0332 0.0316 0.0446 0.1878 0.7853 statistics for variables in the model. Figure 3 displays
Mali 0.0044 0.0184 0.0433 0.0267 0.9478
Mexico 0.0198 0.0349 0.0273 0.1687 0.5921 the M estimates. Besides GAP and EQP , the robust
Morocco 0.0243 0.0281 0.0260 0.0540 0.8405
Netherla 0.0231 0.0146 0.0778 0.1781 0.3605 analysis also indicates N EQ has significant impact
Nigeria -0.0047 0.0283 0.0358 0.0842 0.8579
Norway 0.0260 0.0150 0.0701 0.2199 0.3755 on GDP . This new finding is explained by Figure 4,
Pakistan 0.0295 0.0258 0.0263 0.0880 0.9180
Panama 0.0295 0.0279 0.0388 0.2212 0.8015 which shows that Zambia, the sixtieth country in the
Paraguay 0.0261 0.0299 0.0189 0.1011 0.8458
Peru 0.0107 0.0271 0.0267 0.0933 0.7406 data, is an outlier. Figure 4 also displays leverage
Philippi 0.0179 0.0253 0.0445 0.0974 0.8747
Portugal 0.0318 0.0118 0.0729 0.1571 0.8033 points; however, there are no serious high leverage
Senegal -0.0011 0.0274 0.0193 0.0807 0.8884
Spain 0.0373 0.0069 0.0397 0.1305 0.6613 points.
SriLanka 0.0137 0.0207 0.0138 0.1352 0.8555
Tanzania 0.0184 0.0276 0.0860 0.0940 0.9762
Thailand 0.0341 0.0278 0.0395 0.1412 0.9174
Tunisia 0.0279 0.0256 0.0428 0.0972 0.7838
U.K. 0.0189 0.0048 0.0694 0.1132 0.4307
The ROBUSTREG Procedure
U.S. 0.0133 0.0189 0.0762 0.1356 0.0000
Uruguay 0.0041 0.0052 0.0155 0.1154 0.5782 Model Information
Venezuel 0.0120 0.0378 0.0340 0.0760 0.4974
Zambia -0.0110 0.0275 0.0702 0.2012 0.8695
Zimbabwe 0.0110 0.0309 0.0843 0.1257 0.8875 Data Set MYLIB.GROWTH
; Dependent Variable GDP
Number of Covariates 4
Number of Observations 61
Name of Method M-Estimation
The regression equation they used is
GDP = β0 +β1 LF G+β2 GAP +β3 EQP +β4 N EQ+, Summary Statistics
Standard
where the response variable is the GDP growth Variable Q1 Median Q3 Mean Deviation
per worker (GDP ) and the regressors are the con- LFG 0.0118 0.0239 0.02805 0.02113 0.009794
stant term, labor force growth (LF G), relative GDP GAP 0.57955 0.8015 0.88625 0.725777 0.21807
EQP 0.0265 0.0433 0.072 0.052325 0.029622
gap (GAP ), equipment investment (EQP ), and non- NEQ 0.09555 0.1356 0.1812 0.139856 0.056966
equipment investment (N EQ). GDP 0.01205 0.0231 0.03095 0.022384 0.015516
LFG 0.009489
proc reg data=growth; GAP 0.177764
model GDP = LFG GAP EQP NEQ ; EQP 0.032469
NEQ 0.062418
run; GDP 0.014974
2
SUGI 27 Statistics and Data Analysis
Intercept 0.0106
Figure 5. LTS estimates
LFG 0.5775
GAP 0.0038 Figure 6 displays outlier and leverage point diagnos-
EQP <.0001 tics based on the LTS estimates. Figure 7 displays the
NEQ 0.0069 final weighted least square estimates, which are iden-
Scale
tical to those reported in Zaman, Rousseeuw, and
Figure 3. M estimates Orhan (2001).
Diagnostics Profile
Diagnostics Profile
Name Percentage Cutoff
Name Percentage Cutoff
Outlier 0.0164 3.0000
Outlier 0.0164 3.0000 Leverage 0.2131 3.3382
Leverage 0.2131 3.3382
3
SUGI 27 Statistics and Data Analysis
Intercept 0.0165 As shown in the growth study, the OLS estimate can
LFG 0.7995 be significantly influenced by a single outlier. To
GAP 0.0026
EQP <.0001 bound the influence of outliers, Huber (1973) intro-
NEQ 0.0064 duced the M estimate.
Scale
Huber-type Estimates
Figure 7. Final Weighted LS estimates
Instead of minimizing a sum of squares, a Huber-type
The following section provides some theoretical back-
ground for robust estimates. M estimator θ̂M of θ minimizes a sum of less rapidly
increasing functions of the residuals:
Robust Estimates n
X ri
Let X = (xij ) denote an n × p matrix, y = (y1 , ..., yn ) T Q(θ) = ρ( )
i=1
σ
a given n-vector of responses, and θ = (θ1 , ..., θp )T an
unknown p-vector of parameters or coefficients whose
components have to be estimated. The matrix X where r = y − Xθ. For the OLS estimate, ρ is the
is called a design matrix. Consider the usual linear quadratic function.
model If σ is known, by taking derivatives with respect to θ,
θ̂M is also a solution of the system of p equations:
y = Xθ + e
n
X ri
where e = (e1 , ..., en )T is an n-vector of unknown er- ψ( )xij = 0, j = 1, ..., p
σ
rors. It is assumed that (for given X) the components i=1
ei of e are independent and identically distributed ac-
cording to a distribution L(·/σ), where σ is a scale pa- where ψ = ρ0 . If ρ is convex, θ̂M is the unique solution.
rameter (usually unknown). Often L(·/σ) = Φ(·), the
standard normal distribution with density φ(s) = (1/ PROC ROBUSTREG solves this system by using it-
√
2π)exp(−s2 /2). r = (r1 , ..., rn )T denotes the n- eratively reweighted least squares (IRLS). The weight
vector of residuals for a given value of θ and by xTi function w(x) is defined as
the i-th row of the matrix X.
ψ(x)
The Ordinary Least Squares (OLS) estimate θ̂LS of θ w(x) =
x
is obtained as the solution of the problem
PROC ROBUSTREG provides ten kinds of weight
min QLS (θ) functions (corresponding to ten ρ-functions) through
θ
the WEIGHTFUNCTION= option in the MODEL state-
where QLS (θ) = 1
Pn 2 ment. The scale parameter σ can be specified using
2 i=1 ri .
the SCALE= option in the PROC statement.
Taking the partial derivatives of QLS with respect to
the components of θ and setting them equal to zero If σ is unknown, then the function
yields the normal equations
n
X ri
T T
Q(θ, σ) = [ρ( ) + a]σ
XX θ = X y i=1
σ
4
SUGI 27 Statistics and Data Analysis
2. METHOD=M(SCALE=TUKEY<(D=d)>) This
option obtains σ̂ by solving the supplementary Both OLS estimation and M estimation suggest that
equation observations 11 to 14 are serious outliers. However,
n
these four observations were generated from the un-
1 X ri derlying model and observations 1 to 10 were con-
χd ( ) = β
n − p i=1 σ taminated. The reason that OLS estimation and M
estimation do not pick up the bad observations is that
where they cannot distinguish good leverage points (obser-
vations 11 to 14) from bad leverage points (observa-
3x2 3x4 x6
d2 − d4 + d6 if |x| < d tions 1 to 10). In such cases, high breakdown value
χd (x) =
1 otherwise, estimates are needed.
0 LTS estimate
R d being Tukey’s Biweight function, and β =
χ
χd (s)dΦ(s) is the constant such that the so- The least trimmed squares (LTS) estimate proposed
lution σ̂ is asymptotically consistent when L(·/ by Rousseeuw (1984) is defined as the p-vector
σ) = Φ(·) (refer to Hampel et al. 1986, p. 149).
You can specify d by the D= option. By default, θ̂LT S = arg min QLT S (θ)
d = 2.5. θ
h
σ̂ (m+1) = medni=1 |yi − xTi θ̂(m) |/β0 X
2
QLT S (θ) = r(i)
where β0 = Φ−1 (.75) is the constant such that i=1
5
SUGI 27 Statistics and Data Analysis
6
SUGI 27 Statistics and Data Analysis
Variable MAD
Diagnostics Profile
x1 1.927383
x2 1.630862 Name Percentage Cutoff
x3 1.779123
y 0.889561 Outlier 0.1333 3.0000
Leverage 0.1867 3.0575
Figure 10 displays parameter estimates for covariates Intercept 1 -0.1805 0.0968 -0.3702 0.0093 3.47
x1 1 0.0814 0.0618 -0.0397 0.2025 1.73
and scale. Two robust estimates of the scale pa- x2 1 0.0399 0.0375 -0.0336 0.1134 1.13
rameter are displayed. The weighted scale estimate x3 1 -0.0517 0.0328 -0.1159 0.0126 2.48
Scale 0.5165
(Wscale) is a more efficient estimate of the scale pa-
rameter. Parameter Estimates
for Final Weighted
LS
The ROBUSTREG Procedure
Parameter Pr > ChiSq
LTS Parameter Estimates
Intercept 0.0623
Parameter DF Estimate x1 0.1879
x2 0.2875
Intercept 1 -0.3431 x3 0.1150
x1 1 0.0901 Scale
x2 1 0.0703
x3 1 -0.0731 Figure 12. Final Weighted LS Estimates
Scale 0.7451
WScale 0.5749
MM estimate
Figure 10. LTS Parameter Estimates MM estimation is a combination of high breakdown
Figure 11 displays outlier and leverage point diagnos- value estimation and efficient estimation, which was
tics. The ID variable index is used to identify the ob- introduced by Yohai (1987). It has three steps:
servations. The first ten observations are identified
as outliers and observations 11 to 14 are identified as 1. Compute an initial (consistent) high break-
0
good leverage points. down value estimate θ̂ . PROC ROBUSTREG
provides two kinds of estimates as the initial
estimate, the LTS estimate and the S estimate.
By default, PROC ROBUSTREG uses the LTS
7
SUGI 27 Statistics and Data Analysis
estimate because of its speed, efficiency, and for ρ corresponding to the two kinds of χ func-
high breakdown value. The breakdown value tions, respectively.
of the final MM estimate is decided by the Tukey: With the option CHIF=TUKEY,
breakdown value of the initial LTS estimate and
the constant k0 in the CHI function. To use the ρ(s) = χk1 (s) =
S estimate as the initial estimate, you need
3( ks1 )2 − 3( ks1 )4 + ( ks1 )6 , if |s| ≤ k1
to specify the INITEST=S option in the PROC
statement. In this case, the breakdown value 1 otherwise
of the final MM estimate is decided only by
the constant k0 . Instead of computing the LTS where k1 can be specified by the K1= option.
estimate or the S estimate as initial estimates, The default k1 is 3.440 such that the MM
you can also specify the initial estimate using estimate has 85% asymptotic efficiency with the
the INEST= option in the PROC statement. Gaussian distribution.
0
2. Find σ̂ such that Yohai: With the option CHIF=YOHAI,
0
such that QM M (θ̂M M ) ≤ QM M (θ̂ ). The al-
gorithm for M estimate is used here. PROC
ROBUSTREG provides two kinds of functions
8
SUGI 27 Statistics and Data Analysis
and the robust deviance is defined as the optimal [1/(n − p)] (ψ(ri ))2 −1
P
value of the objective function on the σ 2 -scale: H2: K W
[(1/n) (ψ 0 (ri ))]
P
X yi − xTi θ̂ 1
D = 2(ŝ)2 ρ( )
X
ŝ H3: K −1 (ψ(ri ))2 W −1 (X T X)W −1
(n − p)
where ρ is the objective function for the robust esti- 0
mate, µ̂ is the robust location estimator, and ŝ is the where K = 1 + np var(ψ
(Eψ 0 )2
)
is the correction factor and
robust scale estimator in the full model. Wjk =
P 0
ψ (ri )xij xik . Refer to Huber (1981, p. 173)
The Information Criterion is a powerful tool for model for more details.
selection. The counterpart of the Akaike (1974) AIC Linear Tests
criterion for robust regression is defined as
Two tests are available in PROC ROBUSTREG for the
n
X canonical linear hypothesis
AICR = 2 ρ(ri:p ) + αp
i=1 H0 : θj = 0, j = q + 1, ..., p
9
SUGI 27 Statistics and Data Analysis
Asymptotically Sn2 ∼ λχ2p−q under H0 , where the stan- The following statements invoke the GLM procedure
R 0 for a standard ANOVA.
dardization factor is λ = ψ 2 (s)dΦ(s)/ ψ (s)dΦ(s)
R
Rn2 -test:
The results in Figure 14 indicate that neither treatment
The test statistic for the Rn2 -test is defined as
is significant at the 10% level.
−1
Rn2 = n(θ̂q+1 , ..., θ̂p )H22 (θ̂q+1 , ..., θ̂p )T The GLM Procedure
10
SUGI 27 Statistics and Data Analysis
Parameter Estimates
Parameter Estimates
Intercept <.0001
T1 0 0.0184
T1 1 .
T2 0 0.0081
T2 1 .
T1*T2 0 0 0.9490
T1*T2 0 1 .
T1*T2 1 0 .
T1*T2 1 1 .
Scale
Diagnostics
Robust
Obs Residual Outlier
4 5.7722 *
Diagnostics Profile
11
SUGI 27 Statistics and Data Analysis
diagnostics(all) leverage;
id index; Scalable
run; speedup with
Scalable intercept
title "Standard Residual - numObs speedup adjustment
Robust Distance Plot";
symbol v=plus h=2.5 pct; 50000 0.75835 1.26137
proc gplot data=Diagnostics; 100000 2.20000 2.24629
plot RResidual * Robustdis / 200000 3.41848 3.44375
hminor = 0 300000 4.13016 4.01907
vminor = 0 400000 4.68829 4.28268
vaxis = axis1 500000 4.91398 4.33051
vref = -3 3 750000 5.79769 4.94009
frame; 1000000 5.80328 4.30432
axis1 label = ( r=0 a=90 );
run;
RobustReg Timing and Speedup Results
title "Robust Distance - for Wide Data (5000 observations)
Mahalanobis Distance Plot";
symbol v=plus h=2.5 pct; num num Time 8 Unthreaded Scalable
proc gplot data=Diagnostics; Vars Obs threads time speedup
plot Robustdis * Mahalanobis /
hminor = 0 50 5000 17.69 24.06 1.36009
vminor = 0 100 5000 40.29 76.45 1.89749
vaxis = axis1 200 5000 128.14 319.80 2.49571
vref = 3.0575 250 5000 207.21 520.15 2.51026
frame;
axis1 label = ( r=0 a=90 );
run;
References
These plots are helpful in identifying outliers, good,
and bad leverage points. Akaike, H. (1974), “A New Look at the Statistical
Identification Model,” IEEE Trans. Automat
Control 19, 716–723.
Scalability
Brownlee, K. A. (1965), Statistical Theory and
The ROBUSTREG procedure implements parallel al-
Methodology in Science and Engineering, 2nd
gorithms for LTS- and S estimation. You can use the
ed. John Wiley & Sons, New York.
global SAS option CPUCOUNTS to specify the num-
ber of threads to use in the computation: Cohen, R. A. (2002), “SAS Meets Big Iron:
High Performance Computing in SAS Analytic
OPTIONS CPUCOUNTS=1-256|ACTUAL ;
Procedures”, Proceedings of the Twenty-
More details about multithreading in SAS Version 9 Seventh Annual SAS Users Group International
can be found in Cohen (2002). Conference, Cary, NC: SAS Institute Inc.
The following table contains some empirical results for Coleman, D. Holland, P., Kaden, N., Klema, V.,
LTS estimation we got from using a single processor and Peters, S. C. (1980), “A system of subrou-
and multiple processors (with 8 processors) on a SUN tines for iteratively reweighted least-squares com-
multi-processor workstation (time in seconds): putations,” ACM Transactions on Mathematical
Software, 6, 327-336.
RobustReg Timing and Speedup Results De Long, J.B., Summers, L.H. (1991), “Equipment
for Narrow Data (10 regressors)
investment and economic growth”. Quarterly
num Time 8 Unthreaded Journal of Economics, 106, 445-501.
numObs Vars threads time
50000 10 7.78 5.90 Hampel, F. R., Ronchetti, E.M., Rousseeuw, P.J.
100000 10 10.70 23.54 and Stahel, W.A. (1986), Robust Statistics, The
200000 10 23.49 80.30 Approach Based on Influence Functions, John
300000 10 41.41 171.03
Wiley & Sons, New York.
400000 10 63.20 296.30
500000 10 93.00 457.00 Hawkins, D. M., Bradu, D. and Kass, G. V. (1984),
750000 10 173.00 1003.00
1000000 10 305.00 1770.00 “Location of several outliers in multiple regression
data using elemental sets,” Technometrics , 26,
197-208.
RobustReg Timing and Speedup Results
for Narrow Data (10 regressors)
12
SUGI 27 Statistics and Data Analysis
Holland, P. and Welsch, R. (1977), “Robust regres- Statistics and Diagnostics, Part II, Springer-
sion using interatively reweighted least-squares,” Verlag, New Work.
Commun. Statist. Theor. Meth. 6, 813-827.
Yohai, V.J. and Zamar, R.H. (1997), "Optimal locally
Huber, P.J. (1973), “Robust regression: Asymptotics, robust M estimate of regression". Journal of
conjectures and Monte Carlo,” Ann. Stat., 1, 799- Statist. Planning and Inference, 64, 309-323.
821.
Zaman, A., Rousseeuw, P.J., Orhan, M. (2001),
Huber, P.J. (1981), Robust Statistics. John Wiley & “Econometric applications of high-breakdown
Sons, New York. robust regression techniques”, Econometrics
Letters, 71, 1-8.
Marazzi, A. (1993), Algorithm, Routines, and S
Functions for Robust Statistics, Wassworth &
Brooks / Cole, Pacific Grove, CA.
Acknowledgments
Ronchetti, E. (1985), “Robust Model Selection in
Regression,” Statistics and Probability Letters, 3, The author would like to thank Dr. Robert Rodriguez,
21-23. Professor Peter Rousseeuw, Victor Yohai, Zamma
Ruben, Alfio Marazzi, Xuming He and Roger Koenker
Rousseeuw, P.J. (1984), “Least Median of Squares for reviewing, comments, and discussions.
Regression,” Journal of the American Statistical
Association, 79, 871-880.
Contact Information
Rousseeuw, P.J. and Hubert, M. (1996), “Recent
Development in PROGRESS,” Computational Lin (Colin) Chen, SAS Institute Inc., SAS Campus
Statistics and Data Analysis, 21, 67-85. Drive, Cary, NC 27513. Phone (919) 531-6388, FAX
(919) 677-4444, Email Lin.Chen@sas.com.
Rousseeuw, P.J. and Leroy, A.M. (1987), Robust
Regression and Outlier Detection, Wiley- SAS software and SAS/STAT are registered trade-
Interscience, New York (Series in Applied marks or trademarks of SAS Institute Inc. in the USA
Probability and Statistics), 329 pages. ISBN and other countries. indicates USA registration.
0-471-85233-3.
Rousseeuw, P.J. and Van Driessen, K. (1999), “A
Fast Algorithm for the Minimum Covariance
Determinant Estimator,” Technometrics , 41,
212-223.
Rousseeuw, P.J. and Van Driessen, K. (1998),
“Computing LTS Regression for Large Data
Sets,” Technical Report, University of Antwerp,
submitted.
Rousseeuw, P.J. and Yohai, V. (1984), “Robust
Regression by Means of S estimators”, in Robust
and Nonlinear Time Series Analysis, edited by
J. Franke, W. Härdle, and R.D. Martin, Lecture
Notes in Statistics 26, Springer Verlag, New York,
256-274.
Ruppert, D. (1992), “Computing S Estimators
for Regression and Multivariate
Location/Dispersion,” Journal of Computational
and Graphical Statistics, 1, 253-270.
Yohai V.J. (1987), “High Breakdown Point and High
Efficiency Robust Estimates for Regression,”
Annals of Statistics, 15, 642-656.
Yohai V.J., Stahel, W.A. and Zamar, R.H. (1991), “A
Procedure for Robust Estimation and Inference
in Linear Regression,” in Stahel, W.A. and
Weisberg, S.W., Eds., Directions in Robust
13