Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 49

Chapter 3

Diagnostics and Remedial


Measures

1
Introduction
• We must examine the aptness of the model
for the data before inferences based on
model are undertaken

• In this chapter, we discuss some simple


graphic methods for studying the
appropriateness of a model and some
formal statistical tests
2
Diagnostics for predictor
variables
Graphic diagnostics
• We need diagnostic info. about the predictor
variable to see if there are any outlying X values
that could influence the appropriateness of the
fitted regression function
• Diagnostic info. about the range and concentration
of the X levels is useful for ascertaining the range
of validity for the regression analysis

3
Example-Toluca Company

4
Example-Toluca Company
Figure 3.2 (a) Dot Plot Figure 3.2(c) Stem-and-Leaf Plot
• Used when the no of observations • Provides info. similar to a
in the data set is not large frequency histogram
• Show in a no. of cases several runs • M-Median
were made for the same lot size • H-First and Third quartiles

Figure 3.2(b) Sequence Plot Figure 3.2(d) Box Plot


• Contains a time sequence plot of • Max, min, 1st , 3rd quartiles and median
the lot sizes lot size
• Useful when there are many observation
• Show more effectively the time
in the data set.
sequence
• Fairly symmetrically distributed
• Should be utilized whenever data
are obtained in a sequence

5
Residuals
• Direct diagnostics plots for the response
variable Y are ordinarily not too useful in
regression analysis because the values of
the observations on the response variable
are a function of the level of the predictor
variables. Instead, we examine the
residuals.
• Residual , ei  Yi  Yˆ
i

6
Properties of Residuals
• Mean of residuals, e
e
e i
0
n
• Variance of Residual,
s2 
 (e i  e) 2

 e 2
i

SSE
 MSE
n2 n2 n2
• Non-independence
– Residuals ei are not independent random variables because they involve
the fitted Yˆi which are based on the same fitted regression function.
– Residuals subject to 2 constraints:

e i 0

xe i i 0 7
Semistudentized Residuals
• Semistudentized residual, ei*
ei  e ei
ei*  
MSE MSE

– If MSE were an estimate of the standard deviation of the


residual ei , we would call ei as a studentized residual.
*

– However, the standard deviation of ei is complex and


varies for the different residuals ei and MSE is only an
approximation of the standard deviation of ei .
Hence, ei* is called semistudentized residual

8
Departures from model
• Residuals can be used to examine six important
types of departures from the normal error
regression model
1. Regression function is not linear
2. Error terms do not have constant variance
3. Error terms are not independent
4. Model fits all but one or a few outliers
5. Error terms are not normally distributed
6. One or several impt. predictor variables have been
omitted from model
9
Diagnostics for residuals
Some informal diagnostic plots of residuals can
provide info. on departures from model:
• Plot of residuals against predictor variables
• Plot of absolute or squared residuals against predictor
variable
• Plot of residuals against fitted values
• Plot of residuals against time or other sequence
• Plots of residuals against omitted predictor variables
• Box plot of residuals
• Normal probability plot of residuals

10
Nonlinearity of regression function
• Can be studied from a residual plot vs predictor variable or
residual plot vs fitted values or a scatter plot
• Data for figure below >> Table 3.1 (p.105)
• Figure 3.3a strongly suggested that a linear regression function is not
appropriate.
• Figure 3.3b strongly suggested lack of fit of linear regression fn.-
residuals depart from 0 in a systematic fashion.

11
Advantages of Residual Plot over Scatter Plot
• Residual plot can easily be used for examining other facets
of the aptness of the model.
• Difficult to study the appropriateness of a linear regression
function from the scatter plot due to scaling – obs. Yi close
to fitted values Yˆi.
• A residual plot can show any systematic pattern in the
deviations around the fitted regression line.
• Fig. 3.4a – Linear regression model is appropriate. The
residuals fall within a horizontal band centered around 0.
• Fig. 3.4b – Departure from linear regression model.
Curvilinear regression fn. seems more appropriate.

12
Fig. 3.4 Prototype residual plots

13
Nonconstancy of Error Variance
• Plots of the residuals against the predictor variable
or against the fitted values are also useful to check
whether the variance of the error term is constant
• Refer to Fig. 3.5(a), p. 107 – error variance
increases with age of women
• Plots of the absolute values of the residuals or of
the squared residuals against X or against the fitted
values Yˆ are also useful for diagnosing
nonconstancy of the error variance
• Refer Figure 3.5(b) – residuals tend to be larger in
absolute magnitude for older-aged women
14
15
Presence of Outliers
• Outliers are extreme observations
• Residual outliers can be identified from residual
plots against X or Yˆ , as well as from box plots,
stem-and-leaf plots and dot plots of the residuals.
• Plotting of semistudentized residuals (SR) is
helpful for distinguishing outlying obs., since it
then becomes easy to identify residuals that lie
many s.d. from zero
• Rule of thumb: SR with abs. value of 4 or more as
outliers
• Refer Fig. 3.6 (p. 108) – example of an outlier
16
• Outliers can cause great difficulty
• If resulted from mistake or extraneous effect,
should be discarded
• With least squares mtd, a fitted line may be pulled
disproportionately towards an outlying obs.
• Could cause a misleading fit if indeed they are due
to errors
• However, outliers may convey significant info.
• Safe rule: Discard an outlier if there is evidence
that it represents an error in recording,
miscalculation, a malfunctioning of equipment, or
a similar type of circumstance

17
Residual Plot with outlier

18
• When there are only a small no. of cases and an outlier is
present, the fitted regression can be so distorted that the
residual plot may inappropriately suggest a lack of fit of
the regression model
– Lead to a systematic pattern of deviations from the
fitted line for the other observations

19
Nonindependence of Error Terms
• Whenever data are obtained in a time sequence or some
other type of sequence, a sequence plot of the residuals
should be prepared
• Purpose of plotting the residuals against time or in some
other type of sequence is to see if there is any correlation
between errors terms that are near each other in the
sequence
• Refer to Fig. 3.8a

20
• When error terms are independent, we expect the residuals
in a sequence plot to fluctuate randomly around base line 0

21
Nonnormality of Error Terms
• The normality of the error terms can be studied
informally by examining the residuals in a variety
of graphics ways.
• Distribution Plots
– Box plot of the residuals is helpful for obtaining
summary information about the symmetry of the
residuals and about possible outliers
– A histogram, dot plot, or stem-and-leaf plot of the
residuals are also helpful for detecting gross departures
from normality
– However, no. of cases should be large for these plots to
convey reliable info. about shape of dist. of error terms

22
• Comparison of frequencies
– Another possibility when the no. of cases is
reasonably large is to compare actual
frequencies of the residuals against expected
frequencies under normality
– Determine whether approx. 68% of the
residuals fall within  MSE or about 90% fall
between  1.645 MSE
– When sample size is moderately large,
corresponding t values may be used

23
• Normal probability Plot
– Prepare normal probability plot of the residuals
– Each residual is plotted against its expected
value under normality
– A plot that is nearly linear suggests agreement
with normality, whereas a plot that departs
substantially from linearity suggests that the
error distribution is not normal

24
• Residuals and expected values under normality (Toluca Co. example)

MSE  2,384

• To find the expected value under normality, we utilize the fact that the
expected value of the error terms is zero and the s.d. is estimated by MSE
• The expected value of the kth smallest obs. in a random sample of n is
approx.: k  .375
MSE [ z ( )]
n  .25

• Normal probability plots when error term distribution is not normal - see
Fig. 3.9 (p. 112)

25
F test for lack of fit
• Formal test for determining whether a
specific type of regression function
adequately fits the data
• The lack of fit test assumes that the
observations Y for given X are (1)
independent, (2) normally distributed, and
(3) the dist. of Y have the same variance σ2
• Requires repeat obs. at one or more X levels
– replications. The resulting obs. are called
replicates.
26
Example (p. 120)
• In an experiment involving 12 similar but scattered suburban branch offices
of a commercial bank, holders of checking accounts at the offices were
offered gifts for setting up money market accounts. Minimum initial deposits
in the new money market account were specified to qualify for the gift. The
value of the gift was directly proportional to the specified minimum deposit.
Various levels of minimum deposit and related gift values were used in the
experiment in order to ascertain the relation between the specified minimum
deposit and gift value, on the one hand, and number of accounts opened at the
office, on the other.Altogether, six levels of minimum deposit and
proportional gift value were used, with two of the branch offices assigned at
random to each level. One branch office had a fire during the period and was
dropped from the study. Refer to Table 3.4a where X = amount of min.
deposit and Y = no. of new money market accounts that were opened and
qualified for the gift during test period.

27
28
• A linear regression fn. was fitted
Yˆ  50.72251  .48670 X
• ANOVA table is shown in Table 3.4b.
• Fig. 3.11 shows the scatter plot with the fitted regression
line – indicates that a linear regression fn. is inappropriate
• To test this formally, we use the general linear test
approach
Notation:
• Denote the different X levels as X1,…, Xc c
 n  nj
• nj = no. of replicates for the jth level of X j 1
(n1=n2=n3=n5=n6=2 and n4=1). Total no. of obs. = n
• Denote the observed value of the response variable for the
ith replicate for the jth level of X by Yij (i=1,…,n, j=1,…,c)
Yj
• Mean of the Y obs. at level X=Xj by 29
30
• Full Model
– FM used for the lack of fit test makes the same assumptions as the
simple linear regression model except for assuming a linear
regression relation, the subject of the test
Yij   j   ij Full mod el
where: µj are parameters and j=1,…,c
εij are independent N(0,σ2)
- Since the error terms have expectations 0, E{Yij}= µj
- Thus, the parameter µj (j=1,…,c) is the mean response when X=Xj.
– To fit the full model to the data, we need the least squares or max.
Yj
likelihood estimators for µj , which are simply the sample means
Yj
– Thus, the estimated expected value for obs. Yij is , and the SSE
SSE ( F )   Yij  Y j   SSPE
for the full model is: 2

j i

where SSPE = pure error sum of squares


– Note that SSPE = sums of squared deviations at each X level 31
– For the bank example, we have:
SSPE = (28 - 35)2 + (42 - 35)2 + …+ (124 - 114)2 + (104 - 114)2
= 1,148
– Degrees of freedom for SSPE is given by:
df F   n j  1   n j  c  n  c
j j

– For bank example, dfF = 11 – 6 = 5

•Reduced Model
– For testing the appropriateness of a linear regression relation:
H0 : E{Y} = β0 + β1X
Ha : E{Y} ≠ β0 + β1X
– Thus, H0 postulates that µj in the full model is linearly related to Xj:
µj = β0 + β1Xj
– The reduced model under H0 is therefore:
Yij   0  1 X j   ij Re duced mod el
32
– Note that the reduced model is the ordinary simple linear regression
model. Thus,
Yˆij  b0  b1 X j
– Hence, the error sum of squares for the reduced model is the usual
SSE:

SSE ( R )   Yij  b0  b1 X j 
2

 
  (Yij  Yˆij )  SSE
2

– We also know that the df assoc. with SSE(R) are:


dfR = n – 2

Test Statistic
– The general linear test statistic: SSE ( R )  SSE ( F ) SSE ( F )
F*  
df R  df F df F
here becomes: SSE  SSPE SSPE
F*  
( n  2)  ( n  c ) n  c

SSE – SSPE = SSLF (lack of fit sum of squares) 33


• Thus,
SSLF SSPE MSLF MSLF = lack of
F *
fitmean square

c  2 n  c MSPE
MSPE = pure
error mean square
If F *  F (1   ; c  2, n  c), conclude H 0
• Decision rule: If F *  F (1   ; c  2, n  c), conclude H a

• Bank example:
SSPE = 1,148.0 n – c = 11 – 6 = 5
SSE = 14,741.6
SSLF = 14,741.6 – 1,148.0 c – 2 = 6 – 2 = 4
= 13,593.6
• If α = 0.01, then F(.99; 4, 5) = 11.4.
• Since F* = 14.80 > 11.4, we conclude Ha i.e. the regression fn.
is not linear. P-value = 0.006
34
Remedial Measures
• If simple linear regression model is not
appropriate for data set:
– Abandon linear regression model and develop
and use a more appropriate model
– Employ some transformation on the data so that
linear regression model is appropriate for the
transformed data

35
Transformations
• Simple transformations of X or Y or both are often
sufficient to make the simple linear regression
model appropriate
• Transformation for nonlinear relation only
– When the dist. of the error terms is reasonably
close to a normal dist. and the error terms have
approx. constant variance, transformation on X
should be attempted
– Transformation on Y may change the shape of
the dist. of the error terms from the normal dist.
and affect constancy of error term variances

36
• Fig. 3.13 shows some prototype nonlinear regression relations with
constant error variance and presents simple transformations on X to
linearize the regression relationship without affecting the dist. of Y
• Scatter plots and residual plots sh. then be prepared to decide which
transformation is most effective

37
Example
• Data from an experiment on the effect of no. of
days of training received (X) on performance (Y)
in a battery of simulated sales situations are
presented in Table 3.7 (p. 130) for the 10
participants in the study. A scatter plot is shown in
Fig. 3.14a. We shall consider the square root
transformation
X  X

38
39
• Using the transformed X´, we obtain the following fitted
regression fn.:
Yˆ  10.33  83.45 X 
• Fig. 3.14c shows a plot of residuals vs X´- no evidence of
lack of fit or of strongly unequal variances
• Fig. 3.14d shows a normal probability plot of residuals –
no strong indications of substantial departures from
normality
• Thus, the simple linear regression model appears to be
appropriate here for the transformed data
• The fitted regression fn. in the original units of X can be
obtained as follows:
Yˆ  10.33  83.45 X

40
Transformations for Nonnormality and
Unequal Error Variances
• Unequal error variances and nonnormality of the error
terms frequently appear together
• Transformation on Y is needed since the shapes and
spreads of the dist. of Y need to be changed
• Sometimes, a simultaneous transformation on X may be
necessary
• Frequently the departures take the form of increasing
skewness and increasing variability of the dist. of the error
terms as the mean response E{Y} increases
• Fig. 3.15 contains some prototype regression relations
where the skewness and error variance increase with the
mean response E{Y} and also presents some simple
transformations on Y for these cases

41
42
Example
• Data on age (X) and plasma level of a polyamine (Y) for a
portion of the 25 healthy children in a study are presented
in cols. 1 and 2 of Table 3.8 (p. 133). These data are
plotted in Fig. 3.16a as a scatter plot
• We try the log. transformation Y´ = log10 Y. Fig. 3.16b
shows the scatter plot – reasonably linear relationship with
constant variability
• The fitted simple linear regression model on the
transformed Y data is given by:
Yˆ   1.135  .1023 X
• A plot of the residuals vs X (Fig. 3.16c) and a normal
probability plot of the residuals (Fig. 3.16d) support the
appropriateness of the regression model on Y´
43
44
Exploration of shape of
regression Function
• The scatter plot is complex and it becomes
difficult to see the nature of the regression
relationship, if any, from the plot.
• It’s helpful to explore the nature of the regression
relationship by fitting a smoothed curve without
any constraints on the regression function.
• These smoothed curves are called as
nonparametric regression curves – useful not only
for exploring regression relationships but also for
confirming the nature of the regression function

45
Smoothing mtds for obtaining smoothed curves for
time series data:
• Method of moving averages
– Uses the mean of the Y observations for adjacent time
periods to obtain smoothed values.
• Method of running medians
– Median is used as the average measure in order to
reduce the influence of outlying observation.
• Band regression
– Used when the X values are not equally spaced
– Divides the data set into a number of groups consisting
of adjacent cases according to their X levels.
– For each band, the median X value and the median Y
value are calculated, and the points defined by the pairs
of these median values are then connected by straight
lines. 46
• Lowess Method ( locally weighted
regression scatter plot smoothing )
– Obtain a smoothed curve by fitting successive
linear regression functions in local
neighborhoods.
– Method is similar to moving average and
running median mtds.
– However, it obtains the smoothed Y value at a
given X by fitting a linear regression to the data
in the neighborhood of the X value and then
using the fitted value at X as the smoothed
value.
47
• The Lowess method uses a number of refinements in
obtaining the final smoothed values to improve the
smoothing and to make the procedure robust to
outlying observation
1. The linear regression is weighted to give cases further
from the middle X level in each neighborhood smaller
weights
2. To make the procedure robust to outlying observations,
the linear regression fitting is repeated, with the
weights revised so cases that had large residuals in the
first fitting receive smaller weights in the second
fitting
3. To improve the robustness of the procedure further,
step 2 is repeated one or more times by revising the
weight according to the size of the residuals in the
48
latest fitting
Use of smoothed curves to
confirm fitted regression function
• Confirming the regression function chosen.
• Procedures
– The smoothed curve is plotted together with the
confidence band for the fitted regression
function.
– If the smoothed curve falls within the
confidence band, we have supporting evidence
of the appropriateness of the fitted regression
function.
49

You might also like