Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 104

SW388R7

Data Analysis &


Computers II
Slide 1

Multiple Regression Assumptions and


Outliers

Multiple Regression and Assumptions


Multiple Regression and Outliers
Strategy for Solving Problems
Practice Problems

SW388R7
Data Analysis &
Computers II

Multiple Regression and Assumptions

Slide 2

Multiple regression is most effect at identifying


relationship between a dependent variable and a
combination of independent variables when its
underlying assumptions are satisfied: each of the
metric variables are normally distributed, the
relationships between metric variables are linear,
and the relationship between metric and
dichotomous variables is homoscedastic.
Failing to satisfy the assumptions does not mean that
our answer is wrong. It means that our solution may
under-report the strength of the relationships.

SW388R7
Data Analysis &
Computers II

Multiple Regression and Outliers

Slide 3

Outliers can distort the regression results. When an


outlier is included in the analysis, it pulls the
regression line towards itself. This can result in a
solution that is more accurate for the outlier, but
less accurate for all of the other cases in the data
set.
We will check for univariate outliers on the
dependent variable and multivariate outliers on the
independent variables.

SW388R7
Data Analysis &
Computers II

Relationship between assumptions and outliers

Slide 4

The problems of satisfying assumptions and detecting


outliers are intertwined. For example, if a case has
a value on the dependent variable that is an outlier,
it will affect the skew, and hence, the normality of
the distribution.
Removing an outlier may improve the distribution of
a variable.
Transforming a variable may reduce the likelihood
that the value for a case will be characterized as an
outlier.

SW388R7
Data Analysis &
Computers II

Order of analysis is important

Slide 5

The order in which we check assumptions and detect


outliers will affect our results because we may get a
different subset of cases in the final analysis.
In order to maximize the number of cases available
to the analysis, we will evaluate assumptions first.
We will substitute any transformations of variable
that enable us to satisfy the assumptions.
We will use any transformed variables that are
required in our analysis to detect outliers.

SW388R7
Data Analysis &
Computers II

Strategy for solving problems

Slide 6

Our strategy for solving problems about violations of


assumptions and outliers will include the following steps:
1.
2.

3.
4.
5.
6.

Run type of regression specified in problem statement on variables


using full data set.
Test the dependent variable for normality. If it does not satisfy the
criteria for normality unless transformed, substitute the transformed
variable in the remaining tests that call for the use of the dependent
variable.
Test for normality, linearity, homoscedasticity using scripts. Decide
which transformations should be used.
Substitute transformations and run regression entering all
independent variables, saving studentized residuals and Mahalanobis
distance scores. Compute probabilities for D.
Remove the outliers (studentized residual greater than 3 or
Mahalanobis D with p <= 0.001), and run regression with the method
and variables specified in the problem.
Compare R for analysis using transformed variables and omitting
outliers (step 5) to R obtained for model using all data and original
variables (step 1).

SW388R7
Data Analysis &
Computers II

Transforming dependent variables

Slide 7

We will use the following logic to transform variables:

If dependent variable is not normally distributed:


Try log, square root, and inverse transformation.
Use first transformed variable that satisfies
normality criteria.
If no transformation satisfies normality criteria,
use untransformed variable and add caution for
violation of assumption.
If a transformation satisfies normality, use the
transformed variable in the tests of the independent
variables.

SW388R7
Data Analysis &
Computers II

Transforming independent variables - 1

Slide 8

If independent variable is normally distributed and


linearly related to dependent variable, use as is.
If independent variable is normally distributed but
not linearly related to dependent variable:
Try log, square root, square, and inverse
transformation. Use first transformed variable
that satisfies linearity criteria and does not
violate normality criteria
If no transformation satisfies linearity criteria and
does not violate normality criteria, use
untransformed variable and add caution for
violation of assumption

SW388R7
Data Analysis &
Computers II

Transforming independent variables - 2

Slide 9

If independent variable is linearly related to


dependent variable but not normally distributed:
Try log, square root, and inverse transformation.
Use first transformed variable that satisfies
normality criteria and does not reduce correlation.
Try log, square root, and inverse transformation.
Use first transformed variable that satisfies
normality criteria and has significant correlation.
If no transformation satisfies normality criteria
with a significant correlation, use untransformed
variable and add caution for violation of
assumption

SW388R7
Data Analysis &
Computers II

Transforming independent variables - 3

Slide 10

If independent variable is not linearly related to


dependent variable and not normally distributed:
Try log, square root, square, and inverse
transformation. Use first transformed variable
that satisfies normality criteria and has significant
correlation.
If no transformation satisfies normality criteria
with a significant correlation, used untransformed
variable and add caution for violation of
assumption

Impact of transformations
and omitting outliers

SW388R7
Data Analysis &
Computers II
Slide 11

We evaluate the regression assumptions and detect


outliers with a view toward strengthening the
relationship.
This may not happen. The regression may be the
same, it may be weaker, and it may be stronger. We
cannot be certain of the impact until we run the
regression again.
In the end, we may opt not to exclude outliers and
not to employ transformations; the analysis informs
us of the consequences of doing either.

SW388R7
Data Analysis &
Computers II

Notes

Slide 12

Whenever you start a new problem, make sure you


have removed variables created for previous analysis
and have included all cases back into the data set.
I have added the square transformation to the
checkboxes for transformations in the normality
script. Since this is an option for linearity, we need
to be able to evaluate its impact on normality.

If you change the options for output in pivot tables


from labels to names, you will get an error message
when you use the linearity script. To solve the
problem, change the option for output in pivot tables
back to labels.

SW388R7
Data Analysis &
Computers II

Problem 1

Slide 13

In the dataset GSS2000.sav, is the following statement true, false, or an


incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.01 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors
of "total family income" [income98] from the list: "sex" [sex], "how many
in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions
and removing outliers, the total proportion of variance explained by the
regression analysis increased by 10.8%.

1.
2.
3.
4.

True
True with caution
False
Inappropriate application of a statistic

SW388R7
Data Analysis &
Computers II

Dissecting problem 1 - 1

Slide 14

In the dataset GSS2000.sav, is the following statement true, false, or an


incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.01 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question
requires us to identify the best subset of predictors
The problem may give us different
of "total family income"
[income98]
from
the list: "sex" [sex], "how many
levels of significance
for the
analysis.
in family earned money" [earnrs], and "income" [rincom98].
In this problem, we are told to use
0.01 as alpha for the regression
as well as for
testing to
substitutinganalysis
transformed
variables
assumptions.

After
satisfy regression assumptions
and removing outliers, the total proportion of variance explained by the
regression analysis increased by 10.8%.

1.
2.
3.
4.

True
True with caution
False
Inappropriate application of a statistic

SW388R7
Data Analysis &
Computers II

Dissecting problem 1 - 2

Slide 15

In the dataset GSS2000.sav, is the following statement true, false, or an


incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.01 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors
of "total family income" [income98] from the list: "sex" [sex], "how many
in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions
The method for selecting variables is
and removing outliers,
total
proportion
of variance explained by the
derivedthe
from
the research
question.
regression analysis increased by 10.8%.

1.
2.
3.
4.

In this problem we are asked to idnetify the


best subset of predicotrs, so we do a
stepwise multiple regression.

True
True with caution
False
Inappropriate application of a statistic

SW388R7
Data Analysis &
Computers II

Dissecting problem 1 - 3

Slide 16

In the dataset GSS2000.sav, is the following statement true, false, or an


incorrect application
ofofatesting
statistic?
Assume and
that
there
is no problem with
The purpose
for assumptions
outliers
is to
stronger
The mainof
question
missing data. identify
Use a alevel
of model.
significance
0.01 to
forbethe regression
in this problem is whether or not the use
analysis. Use answered
a level of
significance of 0.01 for evaluating assumptions.
transformed variables to satisfy assumptions and the
removal of outliers improves the overall relationship
between the independent variables and the dependent
research variable,
question
requiresbyusR.to identify the best subset
as measured

The
of predictors
of "total family income" [income98] from the list: "sex" [sex], "how many
in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions
and removing outliers, the total proportion of variance explained by the
regression analysis increased by 10.8%.

1.
2.
3.
4.

True
Specifically, the question asks whether or
True with caution not the R for a regression analysis after
substituting transformed variables and
False
eliminating outliers is 10.8% higher than a
regression
using the original format
Inappropriate application
of aanalysis
statistic
for all variables and including all cases.

SW388R7
Data Analysis &
Computers II

R before transformations or removing outliers

Slide 17

To start out, we run a


stepwise multiple regression
analysis with income98 as
the dependent variable and
sex, earnrs, and rincom98
as the independent
variables.

We select stepwise as
the method to select the
best subset of predictors.

SW388R7
Data Analysis &
Computers II

R before transformations or removing outliers

Slide 18

Prior to any transformations of variables


to satisfy the assumptions of multiple
regression or removal of outliers, the
proportion of variance in the dependent
variable explained by the independent
variables (R) was 51.1%. This is the
benchmark that we will use to evaluate
the utility of transformations and the
elimination of outliers.

SW388R7
Data Analysis &
Computers II

R before transformations or removing outliers

Slide 19

For this particular question, we are not interested in the


statistical significance of the overall relationship prior to
transformations and removing outliers. In fact, it is
possible that the relationship is not statistically significant
due to variables that are not normal, relationships that
are not linear, and the inclusion of outliers.

SW388R7
Data Analysis &
Computers II
Slide 20

Normality of the dependent variable:


total family income

In evaluating assumptions, the first step is to


examine the normality of the dependent
variable. If it is not normally distributed, or
cannot be normalized with a transformation, it
can affect the relationships with all other
variables.

First, move the


dependent variable
INCOME98 to the list
box of variables to test.

To test the normality of the dependent


variable, run the script:
NormalityAssumptionAndTransformations.SBS

Second, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II
Slide 21

Normality of the dependent variable:


total family income
Descriptiv es
TOTAL FAMILY INCOME Mean
95% Confidence
Interval for Mean

Lower Bound
Upper Bound

Statistic
15.67
14.98

5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

The dependent variable "total family income"


[income98] satisfies the criteria for a normal
distribution. The skewness (-0.628) and kurtosis
(-0.248) were both between -1.0 and +1.0. No
transformation is necessary.

Std. Error
.349

16.36
15.95
17.00
27.951
5.287
1
23
22
8.00
-.628
-.248

.161
.320

SW388R7
Data Analysis &
Computers II
Slide 22

Linearity and independent variable:


how many in family earned money
First, move the dependent variable
INCOME98 to the text box for the
dependent variable.

To evaluate the linearity of the relationship


between number of earners and total family
income, run the script for the assumption of
linearity:
LinearityAssumptionAndTransformations.SBS

Second, move the


independent variable,
EARNRS, to the list
box for independent
variables.

Third, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II
Slide 23

Linearity and independent variable:


how many in family earned money
Correlations

TOTAL FAMILY INCOME Pearson Correlation


Sig. (2-tailed)
N
HOW MANY IN FAMILY
Pearson Correlation
EARNED MONEY
Sig. (2-tailed)
N
Logarithm of EARNRS
Pearson Correlation
[LG10( 1+EARNRS)]
Sig. (2-tailed)
N
Square of EARNRS
[(EARNRS)**2]
Square Root of EARNRS
[SQRT( 1+EARNRS)]
Inverse of EARNRS [-1/(
1+EARNRS)]

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

HOW MANY Logarithm of Square of


Square Root
TOTAL
IN FAMILY
EARNRS
EARNRS
of EARNRS
Inverse of
FAMILY
EARNED
[LG10(
[(EARNR
[SQRT(
EARNRS [-1/(
INCOME
MONEY The
1+EARNRS)]
S)**2]
1+EARNRS)]
independent variable
"how
many in 1+EARNRS)]
1
.505**
.536**money"
.376**
.527**
.526*
family earned
[earnrs] satisfies
.
.000 the criteria
.000
.000
.000
for the assumption
of .000
linearity
with
the
dependent
variable
229
228
228
228
228
228
"total
family
income"
[income98],
but
.505**
1
.959**
.908**
.989**
.871*
the .000
assumption of.000
.000
. does not satisfy
.000
.000
normality.
The
evidence
of
linearity
in
228
269
269
269
269
269
the relationship between the
.536**
.959**
1
.759**
.990**
.973*
independent variable "how many in
.000
.000
.
.000
.000
.000
228
.376**
.000
228
.527**
.000
228
.526**
.000
228

**. Correlation is significant at the 0.01 level (2-tailed).

family earned money" [earnrs] and the


variable "total
269 dependent269
269 family income"
269
[income98] was the statistical
.908**
.759**
1
.839**
significance
of the correlation
coefficient
.000 (r = 0.505).
.000The probability
.
.000
for the
269 correlation269
coefficient269
was <0.001,269
less
than or equal
to the .839**
level of significance
.989**
.990**
1
of
0.01.
We
reject
the
null
hypothesis
.000
.000
.000
.
that
r
=
0
and
conclude
that
there
is
269
269
269
269 a
linear
relationship
between
the
.871**
.973**
.606**
.932**
variables.
.000
269

.000
269

.000
269

.000
269

269

.606*
.000
269
.932*
.000
269
1
.
269

SW388R7
Data Analysis &
Computers II
Slide 24

Normality of independent variable:


how many in family earned money

After evaluating the dependent variable, we


examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of number of earners in
family, run the script:
NormalityAssumptionAndTransformations.SB
S

First, move the


independent variable
EARNRS to the list box
of variables to test.

Second, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II
Slide 25

Normality of independent variable:


how many in family earned money
Descriptiv es
HOW MANY IN FAMILY Mean
EARNED MONEY
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

Lower Bound
Upper Bound

Statistic
1.43
1.31

Std. Error
.061

1.56
1.37
1.00
1.015
1.008
0
5
5
1.00
.742
1.324

The independent variable "how many in family earned money" [earnrs]


satisfies the criteria for the assumption of linearity with the dependent
variable "total family income" [income98], but does not satisfy the
assumption of normality.

In evaluating normality, the skewness (0.742) was between -1.0 and


+1.0, but the kurtosis (1.324) was outside the range from -1.0 to +1.0.

.149
.296

SW388R7
Data Analysis &
Computers II
Slide 26

Normality of independent variable:


how many in family earned money

The square root transformation also


has values of skewness and kurtosis in
the acceptable range.
However, by our order of preference
for which transformation to use, the
logarithm is preferred.

The logarithmic
transformation
improves the normality
of "how many in family
earned money" [earnrs]
without a reduction in
the strength of the
relationship to "total
family income"
[income98]. In
evaluating normality,
the skewness (-0.483)
and kurtosis (-0.309)
were both within the
range of acceptable
values from -1.0 to
+1.0. The correlation
coefficient for the
transformed variable is
0.536.

Transformation for how many in family


earned money

SW388R7
Data Analysis &
Computers II
Slide 27

The independent variable, how many in family


earned money, had a linear relationship to the
dependent variable, total family income.
The logarithmic transformation improves the
normality of "how many in family earned money"
[earnrs] without a reduction in the strength of the
relationship to "total family income" [income98].
We will substitute the logarithmic transformation of
how many in family earned money in the regression
analysis.

SW388R7
Data Analysis &
Computers II
Slide 28

Normality of independent variable:


respondents income

After evaluating the dependent variable, we


examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of respondents in
family, run the script:
NormalityAssumptionAndTransformations.SB
S

First, move the


independent variable
RINCOM89 to the list
box of variables to
test.

Second, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II
Slide 29

Normality of independent variable:


respondents income
Descriptiv es
RESPONDENTS INCOME Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

Lower Bound
Upper Bound

Statistic
13.35
12.52

Std. Error
.419

14.18
13.54
15.00
29.535
5.435
1
23
22
8.00
-.686
-.253

The independent variable "income" [rincom98] satisfies the criteria for


both the assumption of normality and the assumption of linearity with
the dependent variable "total family income" [income98].
In evaluating normality, the skewness (-0.686) and kurtosis (-0.253)
were both within the range of acceptable values from -1.0 to +1.0.

.187
.373

SW388R7
Data Analysis &
Computers II
Slide 30

Linearity and independent variable:


respondents income
First, move the dependent variable
INCOME98 to the text box for the
dependent variable.

To evaluate the linearity of the relationship


between respondents income and total
family income, run the script for the
assumption of linearity:
LinearityAssumptionAndTransformations.SBS

Second, move the


independent variable,
RINCOM89, to the list
box for independent
variables.

Third, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II
Slide 31

Linearity and independent variable:


respondents income
Correlations

TOTAL FAMILY INCOME

Pearson Correlation
Sig. (2-tailed)
N
RESPONDENTS INCOME Pearson Correlation
Sig. (2-tailed)
N
Logarithm of RINCOM98 Pearson Correlation
[LG10( 24-RINCOM98)]
Sig. (2-tailed)
N
Square of RINCOM98
[(RINCOM98)**2]

Pearson Correlation
Sig. (2-tailed)
N
Square Root of
Pearson Correlation
RINCOM98 [SQRT(
Sig. (2-tailed)
24-RINCOM98)]
N
Inverse of RINCOM98 [-1/( Pearson Correlation
24-RINCOM98)]
Sig. (2-tailed)
N

Logarithm of
Square Root Inverse of
RINCOM98
Square of
of RINCOM98 RINCOM9
TOTAL
[LG10(
RINCOM98
[SQRT(
8 [-1/(
FAMILY
RESPONDEN 24-RINCOM
[(RINCOM9
24-RINCOM9
24-RINC
INCOME
TS INCOME
98)]
8)**2]
8)]
OM98)]
1
.577**
-.595**
.613**
-.601**
-.434**
.
.000
.000
.000
.000
.000
The evidence of linearity in the
229
163
163independent163
163
relationship163
between the
variable
"income"
[rincom98]
.577**
1
-.922**
.967** and the
-.985**
-.602**
dependent
variable "total
.000
.
.000
.000family income"
.000
.000
[income98]
was
the
statistical
163
168
168
168
168
168
significance
of
the
correlation
coefficient
-.595**
-.922**
1
-.976**
.974**
.848**
(r = 0.577). The probability for the
.000
.000
.
.000
.000
.000
163
.613**
.000
163
-.601**
.000
163
-.434**
.000
163

**. Correlation is significant at the 0.01 level (2-tailed).

correlation coefficient was <0.001, less


than or equal
168
168to the level
168of significance
168
of 0.01. We reject the null hypothesis
.967**
that r = 0-.976**
and conclude 1that there -.993**
is a
.000
.000
.
.000
linear relationship between the
168
168
168
variables. 168

-.985**
.000
168
-.602**
.000
168

.974**
.000
168
.848**
.000
168

-.993**
.000
168
-.718**
.000
168

1
.
168
.714**
.000
168

168
-.718**
.000
168
.714**
.000
168
1
.
168

SW388R7
Data Analysis &
Computers II

Homoscedasticity: sex

Slide 32

First, move the dependent variable


INCOME98 to the text box for the
dependent variable.

To evaluate the homoscedasticity of the


relationship between sex and total family
income, run the script for the assumption of
homogeneity of variance:

Second, move the


independent variable,
SEX, to the list box for
independent variables.

HomoscedasticityAssumptionAnd
Transformations.SBS

Third, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II

Homoscedasticity: sex

Slide 33

Based on the Levene Test, the


variance in "total family income"
[income98] is homogeneous for
the categories of "sex" [sex].
The probability associated with
the Levene Statistic (0.031) is
greater than the level of
significance, so we fail to reject
the null hypothesis and
conclude that the
homoscedasticity assumption is
satisfied.

SW388R7
Data Analysis &
Computers II

Adding a transformed variable

Slide 34

Even though we do not need a


transformation for any of the
variables in this analysis, we will
demonstrate how to use a script,
such as the normality script, to add a
transformed variable to the data set,
e.g. a logarithmic transformation for
highest year of school.

Second, mark the


checkbox for the
transformation we
want to add to the
data set, and clear
the other checkboxes.

Third, clear the


checkbox for Delete
transformed variables
from the data. This will
save the transformed
variable.

First, move the variable


that we want to transform
to the list box of variables
to test.

Fourth, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II

The transformed variable in the data editor

Slide 35

If we scroll to the extreme


right in the data editor, we
see that the transformed
variable has been added to
the data set.

Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.

SW388R7
Data Analysis &
Computers II

The regression to identify outliers

Slide 36

We use the regression procedure


to identify both univariate and
multivariate outliers.
We start with the same dialog we
used for the last analysis, in which
income98 as the dependent
variable and sex, earnrs, and
rincom98 were the independent
variables.

First, we substitute the


logarithmic transformation of
earnrs, logearn, into the list
of independent variables.

Second, we change the


method of entry from
Stepwise to Enter so that all
variables will be included in
the detection of outliers.

Third, we want to save the


calculated values of the outlier
statistics to the data set.
Click on the Save button to
specify what we want to save.

SW388R7
Data Analysis &
Computers II

Saving the measures of outliers

Slide 37

First, mark the checkbox for


Studentized residuals in the
Residuals panel. Studentized
residuals are z-scores computed
for a case based on the data for
all other cases in the data set.

Second, mark the checkbox for


Mahalanobis in the Distances
panel. This will compute
Mahalanobis distances for the
set of independent variables.

Third, click on
the OK button to
complete the
specifications.

SW388R7
Data Analysis &
Computers II

The variables for identifying outliers

Slide 38

The variables for identifying


univariate outliers for the
dependent variable are in a
column which SPSS has
names sre_1.

The variables for identifying


multivariate outliers for the
independent variables are in
a column which SPSS has
names mah_1.

SW388R7
Data Analysis &
Computers II

Computing the probability for Mahalanobis D

Slide 39

To compute the probability


of D, we will use an SPSS
function in a Compute
command.

First, select the


Compute command
from the Transform
menu.

SW388R7
Data Analysis &
Computers II

Formula for probability for Mahalanobis D

Slide 40

First, in the target variable text box, type the


name "p_mah_1" as an acronym for the probability
of the mah_1, the Mahalanobis D score.

Second, to complete the


specifications for the CDF.CHISQ
function, type the name of the
variable containing the D scores,
mah_1, followed by a comma,
followed by the number of variables
used in the calculations, 3.

Third, click on the OK button


to signal completion of the
computer variable dialog.

Since the CDF function (cumulative


density function) computes the
cumulative probability from the left
end of the distribution up through a
given value, we subtract it from 1 to
obtain the probability in the upper tail
of the distribution.

SW388R7
Data Analysis &
Computers II

Multivariate outliers

Slide 41

Using the probabilities computed in p_mah_1


to identify outliers, scroll down through the list
of case to see if we can find cases with a
probability less than 0.001.
There are no outliers for the set of
independent variables.

SW388R7
Data Analysis &
Computers II

Univariate outliers

Slide 42

Similarly, we can scroll down the values of


sre_1, the studentized residual to see the
one outlier with a value larger than 3.0.

Based on these criteria, there are 4


outliers.There are 4 cases that have a score
on the dependent variable that is
sufficiently unusual to be considered outliers
(case 20000357: studentized
residual=3.08; case 20000416: studentized
residual=3.57; case 20001379: studentized
residual=3.27; case 20002702: studentized
residual=-3.23).

SW388R7
Data Analysis &
Computers II

Omitting the outliers

Slide 43

To omit the outliers from the


analysis, we select in the
cases that are not outliers.

First, select the


Select Cases
command from the
Transform menu.

SW388R7
Data Analysis &
Computers II

Specifying the condition to omit outliers

Slide 44

First, mark the If


condition is satisfied
option button to
indicate that we will
enter a specific
condition for
including cases.

Second, click on the


If button to specify
the criteria for inclusion
in the analysis.

SW388R7
Data Analysis &
Computers II

The formula for omitting outliers

Slide 45

To eliminate the outliers, we


request the cases that are not
outliers.
The formula specifies that we
should include cases if the
studentized residual (regardless of
sign) if less than 3 and the
probability for Mahalanobis D is
higher than the level of
significance, 0.001.
After typing in the formula,
click on the Continue button
to close the dialog box,

SW388R7
Data Analysis &
Computers II

Completing the request for the selection

Slide 46

To complete the
request, we click on
the OK button.

SW388R7
Data Analysis &
Computers II

The omitted multivariate outlier

Slide 47

SPSS identifies the excluded cases by


drawing a slash mark through the case
number. Most of the slashes are for
cases with missing data, but we also see
that the case with the low probability for
Mahalanobis distance is included in
those that will be omitted.

SW388R7
Data Analysis &
Computers II

Running the regression without outliers

Slide 48

We run the regression again,


excluding the outliers.
Select the Regression |
Linear command from the
Analyze menu.

SW388R7
Data Analysis &
Computers II

Opening the save options dialog

Slide 49

We specify the dependent


and independent variables,
substituting any transformed
variables required by
assumptions.

When we used regression to


detect outliers, we entered
all variables. Now we are
testing the relationship
specified in the problem, so
we change the method to
Stepwise.

On our last run, we


instructed SPSS to save
studentized residuals and
Mahalanobis distance. To
prevent these values from
being calculated again, click
on the Save button.

SW388R7
Data Analysis &
Computers II

Clearing the request to save outlier data

Slide 50

First, clear the checkbox


for Studentized residuals.

Third, click on
the OK button to
complete the
specifications.

Second, clear the


checkbox form
Mahalanobis distance.

SW388R7
Data Analysis &
Computers II

Opening the statistics options dialog

Slide 51

Once we have removed outliers,


we need to check the sample
size requirement for regression.
Since we will need the
descriptive statistics for this,
click on the Statistics button.

SW388R7
Data Analysis &
Computers II

Requesting descriptive statistics

Slide 52

First, mark the checkbox


for Descriptives.

Second, click on
the Continue
button to
complete the
specifications.

SW388R7
Data Analysis &
Computers II

Requesting the output

Slide 53

Having specified the


output needed for the
analysis, we click on
the OK button to obtain
the regression output.

SW388R7
Data Analysis &
Computers II

Sample size requirement

Slide 54

The minimum ratio of valid cases to independent


variables for stepwise multiple regression is 5 to
1. After removing 4 outliers, there are 159 valid
cases and 3 independent variables.
The ratio of cases to independent variables for this
analysis is 53.0 to 1, which satisfies the minimum
requirement. In addition, the ratio of 53.0 to 1
satisfies the preferred ratio of 50 to 1.

Descriptiv e Statistics
TOTAL FAMILY INCOME
RESPONDENTS SEX
RESPONDENTS INCOME
Logarithm of EARNRS
[LG10( 1+EARNRS)]

Mean
17.09
1.55
13.76

Std. Deviation
4.073
.499
5.133

.424896

.1156559

N
159
159
159
159

SW388R7
Data Analysis &
Computers II

Significance of regression relationship

Slide 55

ANOVAd
Model
1

Regression
Residual
Total
Regression
Residual
Total
Regression
Residual
Total

Sum of
Squares
1122.398
1499.187
2621.585
1572.722
1048.863
2621.585
1623.976
997.609
2621.585

df
1
157
158
2
156
158
3
155
158

Mean Square
1122.398
9.549

F
117.541

Sig.
.000 a

786.361
6.723

116.957

.000 b

541.325
6.436

84.107

.000 c

a. Predictors: (Constant), RESPONDENTS INCOME


b. Predictors: (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10(

The1+EARNRS)]
probability of the F statistic (84.107) for the regression
relationship
which includes
these variables is <0.001, less
c. Predictors: (Constant),
RESPONDENTS INCOME, Logarithm of EARNRS [LG10(
than
or
equal
to
the
level
of
significance of 0.01. We reject
1+EARNRS)], RESPONDENTS SEX
the null hypothesis that there is no relationship between
d.
Variable: TOTAL FAMILY INCOME
theDependent
best subset
of independent variables and the dependent
variable (R = 0).
We support the research hypothesis that there is a
statistically significant relationship between the best subset
of independent variables and the dependent variable.

SW388R7
Data Analysis &
Computers II

Increase in proportion of variance

Slide 56

Model Summary
Model
1
2
3

R
R Square
a
.654
.428
b
.775
.600
c
.787
.619

Adjusted
R Square
.424
.595
.612

Std. Error of
the Estimate
3.090
2.593
2.537

a. Predictors: (Constant), RESPONDENTS INCOME


b. Predictors: (Constant), RESPONDENTS INCOME,
Logarithm of EARNRS [LG10( 1+EARNRS)]

Prior to any
transformations of variables to satisfy
c. Predictors: (Constant), RESPONDENTS
INCOME,
the assumptions of multiple regression or removal
Logarithm of EARNRS [LG10( 1+EARNRS)],
of outliers, the proportion of variance in the
RESPONDENTS SEX
dependent variable explained by the independent
variables (R) was 51.1%.

After transformed variables were substituted to


satisfy assumptions and outliers were removed
from the sample, the proportion of variance
explained by the regression analysis was 61.9%, a
difference of 10.8%.

The answer to the question


is true with caution.
A caution is added because
of the inclusion of ordinal
level variables.

SW388R7
Data Analysis &
Computers II

Problem 2

Slide 57

In the dataset GSS2000.sav, is the following statement true, false, or


an incorrect application of a statistic? Assume that there is no problem
with missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to the
dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression
assumptions and removing outliers, the proportion of variance
explained by the regression analysis increased by 3.6%.
1.
2.
3.
4.

True
True with caution
False
Inappropriate application of a statistic

SW388R7
Data Analysis &
Computers II

Dissecting problem 2 - 1

Slide 58

In the dataset GSS2000.sav, is the following statement true, false, or


an incorrect application of a statistic? Assume that there is no problem
with missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"
The problem may give us different
[age], "highest year
of school completed" [educ], and "sex" [sex] to the
levels of significance for the analysis.
dependent variable "occupational prestige score" [prestg80].
In this problem, we are told to use
0.05 as alpha for the regression
and the more
conservative
After substitutinganalysis
transformed
variables
to satisfy regression
0.01 as the alpha in testing
assumptions and removing
outliers, the proportion of variance
assumptions.

explained by the regression analysis increased by 3.6%.

1.
2.
3.
4.

True
True with caution
False
Inappropriate application of a statistic

SW388R7
Data Analysis &
Computers II

Dissecting problem 2 - 2

Slide 59

In the dataset GSS2000.sav, is the following statement true, false, or


an incorrect application of a statistic? Assume that there is no problem
with missing data. Use a level of significance of 0.05 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to examine the relationship of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to the
dependent variable "occupational prestige score" [prestg80].
After substituting
transformed variables to satisfy regression
The method for selecting variables is
assumptions andderived
removing
outliers,
proportion of variance
from the
research the
question.
explained by the regression analysis increased by 3.6%.

1.
2.
3.
4.

If we are asked to examine a relationship


without any statement about control
variables or the best subset of variables, we
do a standard multiple regression.

True
True with caution
False
Inappropriate application of a statistic

SW388R7
Data Analysis &
Computers II

Dissecting problem 2 - 3

Slide 60

In the dataset GSS2000.sav, is the following statement true, false, or


an incorrect application of a statistic? Assume that there is no problem
The purpose of testing for assumptions and outliers is to
with missing data.
Use a level of significance of 0.05 for the regression
identify a stronger model. The main question to be
analysis. Use answered
a level of
significance
of 0.01
forthe
evaluating
assumptions.
in this
problem is whether
or not
use
transformed variables to satisfy assumptions and the
removal of outliers improves the overall relationship
the requires
independentus
variables
and the dependent
research between
question
to examine
the relationship
variable, as measured by R.

The
of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to the
dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression
assumptions and removing outliers, the proportion of variance
explained by the regression analysis increased by 3.6%.

1.
2.
3.
4.

True
True with caution Specifically, the question asks whether or
not the R for a regression analysis after
False
substituting transformed variables and
eliminating outliers is 3.6% higher than a
Inappropriate application
a statistic
regression of
analysis
using the original format
for all variables and including all cases.

SW388R7
Data Analysis &
Computers II

R before transformations or removing outliers

Slide 61

To start out, we run a


standard multiple
regression analysis with
prestg80 as the dependent
variable and age, educ, and
sex as the independent
variables.

SW388R7
Data Analysis &
Computers II

R before transformations or removing outliers

Slide 62

Prior to any transformations of variables


to satisfy the assumptions of multiple
regression or removal of outliers, the
proportion of variance in the dependent
variable explained by the independent
variables (R) was 27.1%. This is the
benchmark that we will use to evaluate
the utility of transformations and the
elimination of outliers.

For this particular question, we are not interested in the


statistical significance the overall relationship prior to
transformations and removing outliers. In fact, it is
possible that the relationship is not statistically significant
due to variables that are not normal, relationships that
are not linear, and the inclusion of outliers.

SW388R7
Data Analysis &
Computers II

Normality of the dependent variable

Slide 63

In evaluating assumptions, the first step is to


examine the normality of the dependent
variable. If it is not normally distributed, or
cannot be normalized with a transformation, it
can affect the relationships with all other
variables.

First, move the


dependent variable
PRESTG80 to the list
box of variables to test.

To test the normality of the dependent


variable, run the script:
NormalityAssumptionAndTransformations.SBS

Second, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II

Normality of the dependent variable

Slide 64

The dependent variable "occupational prestige


score" [prestg80] satisfies the criteria for a
normal distribution. The skewness (0.401) and
kurtosis (-0.630) were both between -1.0 and
+1.0. No transformation is necessary.

SW388R7
Data Analysis &
Computers II

Normality of independent variable: Age

Slide 65

After evaluating the dependent variable, we


examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of age, run the script:
NormalityAssumptionAndTransformations.SB
S

First, move the


independent variable
AGE to the list box of
variables to test.

Second, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II

Normality of independent variable: Age

Slide 66

Descriptiv es
AGE OF RESPONDENT Mean
95% Confidence
Interval for Mean

Lower Bound
Upper Bound

5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

Statistic
45.99
43.98

Std. Error
1.023

48.00
45.31
43.50
282.465
16.807
19
89
70
24.00
.595
-.351

The independent variable "age" [age] satisfies the criteria for the
assumption of normality, but does not satisfy the assumption of
linearity with the dependent variable "occupational prestige score"
[prestg80].
In evaluating normality, the skewness (0.595) and kurtosis (-0.351)
were both within the range of acceptable values from -1.0 to +1.0.

.148
.295

SW388R7
Data Analysis &
Computers II

Linearity and independent variable: Age

Slide 67

First, move the dependent variable


PRESTG80 to the text box for the
dependent variable.

To evaluate the linearity of the relationship


between age and occupational prestige, run
the script for the assumption of linearity:
LinearityAssumptionAndTransformations.SBS

Second, move the


independent variable,
AGE, to the list box for
independent variables.

Third, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II

Linearity and independent variable: Age

Slide 68

Correlations

RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
AGE OF RESPONDENT

Logarithm of AGE
[LG10(AGE)]

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

Square of AGE [(AGE)**2] Pearson Correlation


Sig. (2-tailed)
N
Square Root of AGE
Pearson Correlation
[SQRT(AGE)]
Sig. (2-tailed)
N
Inverse of AGE [-1/(AGE)] Pearson Correlation
Sig. (2-tailed)
N

RS
OCCUPA
TIONAL
PRESTIG
E SCORE
(1980)
1
.
255
.024
.706
255
.059
.348
255
-.004
.956
255
.041
.518
255
.096
.128
255

**. Correlation is significant at the 0.01 level (2-tailed).

AGE OF
Logarithm of
Square of
Square Root
Inverse of
The evidence of nonlinearity in the
RESPON
AGE
AGE
of AGE
AGE
relationship between the independent
DENT
[LG10(AGE)]
[(AGE)**2]
[SQRT(AGE)]
[-1/(AGE)]
variable "age" [age] and the dependent
.024 variable .059
-.004
.041
.096
"occupational prestige score"
.706 [prestg80]
.348
.956of statistical.518
.128
was the lack
coefficient
255 significance
255of the correlation
255
255
255
(r
=
0.024).
The
probability
for
the
1
.979**
.983**
.995**
.916**
correlation coefficient was 0.706, greater
.
.000
.000
.000
.000
than the level of significance of 0.01. We
270 cannot reject
270 the null 270
270
hypothesis that270
r = 0,
.979** and cannot 1conclude .926**
.994**
.978**
that there is a linear
the variables. .000
.000 relationship .between .000
.000
270 Since none
270of the transformations
270
to270
improve linearity were successful, it is an
.983**
.926**
1
.960**
indication that the problem may be a weak
.000 relationship,
.000 rather than .a curvilinear
.000
270 relationship
270correctable
270by using a 270
transformation.
A
weak
relationship is not
.995**
.994**
.960**
1 a
violation
of
the
assumption
of
linearity,
.000
.000
.000
.and
does not require a caution.
270
270
270
270
.916**
.978**
.832**
.951**
.000
.000
.000
.000
270
270
270
270

270
.832**
.000
270
.951**
.000
270
1
.
270

SW388R7
Data Analysis &
Computers II

Transformation for Age

Slide 69

The independent variable age satisfied the criteria


for normality.
The independent variable age did not have a linear
relationship to the dependent variable occupational
prestige. However, none of the transformations
linearized the relationship.

No transformation will be used - it would not help


linearity and is not needed for normality.

SW388R7
Data Analysis &
Computers II
Slide 70

Linearity and independent variable:


Highest year of school completed
First, move the dependent variable
PRESTG80 to the text box for the
dependent variable.

To evaluate the linearity of the relationship


between highest year of school and
occupational prestige, run the script for the
assumption of linearity:
Second, move the
independent variable,
EDUC, to the list box
for independent
variables.

LinearityAssumptionAndTransformations.SBS

Third, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II
Slide 71

Linearity and independent variable:


Highest year of school completed
Correlations

RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
HIGHEST YEAR OF
SCHOOL COMPLETED
Logarithm of EDUC
[LG10( 21-EDUC)]

Square of EDUC
[(EDUC)**2]
Square Root of EDUC
[SQRT( 21-EDUC)]
Inverse of EDUC [-1/(
21-EDUC)]

Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N

RS
OCCUPA
TIONAL
HIGHEST
Square Root
PRESTIG
YEAR OF The
Logarithm
of
Square
of "highest
of EDUC
Inverse of
independent
variable
year
E SCORE
SCHOOL of EDUC
EDUC
[SQRT(
EDUC [-1/(
school[LG10(
completed"
[educ] satisfies
(1980)
COMPLETEDthe 21-EDUC)]
[(EDUC)**2]
21-EDUC)]
criteria for the
assumption21-EDUC)]
of
1
.495**
-.512**
.528** variable
-.518**
-.423
linearity with
the dependent
.
.000 "occupational
.000 prestige score"
.000
.000
.000
255
254 [prestg80],254
254
254
but does not
satisfy the 254
assumption
of
normality.
The
evidence
.495**
1
-.920**
.980**
-.982**
-.699
of
linearity
in
the
relationship
between
.000
.
.000
.000
.000
.000
the
independent
variable
"highest
year
254
269
269
269
269
269
of
school
completed"
[educ]
and
the
-.512**
-.920**
1
-.969**
.977**
.915
dependent variable "occupational
.000
.000
.
.000
.000
.000
254
.528**
.000
254
-.518**
.000
254
-.423**
.000
254

**. Correlation is significant at the 0.01 level (2-tailed).

prestige score" [prestg80] was the


of269
the correlation
269 statistical significance
269
269
coefficient (r = 0.495). The probability
.980**
-.969**
1
for the correlation
coefficient
was -.997**
.000 <0.001, less
.000than or equal. to the level
.000
269 of significance
269 of 0.01. We
269 reject the269
-.982**
.977** that r =
-.997**
1
null hypothesis
0 and conclude
is a linear relationship
.000 that there.000
.000
.
between
the
variables.
269
269
269
269
-.699**
.000
269

.915**
.000
269

-.789**
.000
269

.812**
.000
269

269

-.789
.000
269
.812
.000
269
1
.
269

SW388R7
Data Analysis &
Computers II
Slide 72

Normality of independent variable:


Highest year of school completed

To test the normality of EDUC, Highest year


of school completed, run the script:
NormalityAssumptionAndTransformations.SB
S

First, move the


dependent variable
EDUC to the list box of
variables to test.

Second, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II
Slide 73

Normality of independent variable:


Highest year of school completed
Descriptiv es
HIGHEST YEAR OF
SCHOOL COMPLETED

Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis

Lower Bound
Upper Bound

Statistic
13.12
12.77

Std. Error
.179

13.47
13.14
13.00
8.583
2.930
2
20
18
3.00
-.137
1.246

In evaluating normality, the skewness (-0.137) was between -1.0


and +1.0, but the kurtosis (1.246) was outside the range from -1.0
to +1.0. None of the transformations for normalizing the distribution
of "highest year of school completed" [educ] were effective.

.149
.296

SW388R7
Data Analysis &
Computers II

Transformation for highest year of school

Slide 74

The independent variable, highest year of school,


had a linear relationship to the dependent variable,
occupational prestige.
The independent variable, highest year of school, did
not satisfy the criteria for normality. None of the
transformations for normalizing the distribution of
"highest year of school completed" [educ] were
effective.
No transformation will be used - it would not help
normality and is not needed for linearity. A caution
should be added to any findings.

SW388R7
Data Analysis &
Computers II

Homoscedasticity: sex

Slide 75

First, move the dependent variable


PRESTG80 to the text box for the
dependent variable.

To evaluate the homoscedasticity of the


relationship between sex and occupational
prestige, run the script for the assumption of
homogeneity of variance:

Second, move the


independent variable,
SEX, to the list box for
independent variables.

HomoscedasticityAssumptionAnd
Transformations.SBS

Third, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II

Homoscedasticity: sex

Slide 76

Based on the Levene Test, the


variance in "occupational
prestige score" [prestg80] is
homogeneous for the categories
of "sex" [sex]. The probability
associated with the Levene
Statistic (0.808) is greater than
the level of significance, so we
fail to reject the null hypothesis
and conclude that the
homoscedasticity assumption is
satisfied.
Even if we violate the
assumption, we would not do a
transformation since it could
impact the relationships of the
other independent variables
with the dependent variable.

SW388R7
Data Analysis &
Computers II

Adding a transformed variable

Slide 77

Even though we do not need a


transformation for any of the
variables in this analysis, we will
demonstrate how to use a script,
such as the normality script, to add a
transformed variable to the data set,
e.g. a logarithmic transformation for
highest year of school.

Second, mark the


checkbox for the
transformation we
want to add to the
data set, and clear
the other checkboxes.

Third, clear the


checkbox for Delete
transformed variables
from the data. This will
save the transformed
variable.

First, move the variable


that we want to transform
to the list box of variables
to test.

Fourth, click on the


OK button to
produce the output.

SW388R7
Data Analysis &
Computers II

The transformed variable in the data editor

Slide 78

If we scroll to the extreme


right in the data editor, we
see that the transformed
variable has been added to
the data set.

Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.

SW388R7
Data Analysis &
Computers II

The regression to identify outliers

Slide 79

We can use the regression


procedure to identify both
univariate and multivariate
outliers.
We start with the same dialog we
used for the last analysis, in which
prestg90 as the dependent
variable and age, educ, and sex
were the independent variables.
If we need to use any transformed
variables, we would substitute
them now.

We will save the calculated


values of the outlier
statistics to the data set.

Click on the Save button to


specify what we want to
save.

SW388R7
Data Analysis &
Computers II

Saving the measures of outliers

Slide 80

First, mark the checkbox for


Studentized residuals in the
Residuals panel. Studentized
residuals are z-scores computed
for a case based on the data for
all other cases in the data set.

Second, mark the checkbox for


Mahalanobis in the Distances
panel. This will compute
Mahalanobis distances for the
set of independent variables.

Third, click on
the OK button to
complete the
specifications.

SW388R7
Data Analysis &
Computers II

The variables for identifying outliers

Slide 81

The variables for identifying


univariate outliers for the
dependent variable are in a
column which SPSS has
names sre_1.

The variables for identifying


multivariate outliers for the
independent variables are in
a column which SPSS has
names mah_1.

SW388R7
Data Analysis &
Computers II

Computing the probability for Mahalanobis D

Slide 82

To compute the probability


of D, we will use an SPSS
function in a Compute
command.

First, select the


Compute command
from the Transform
menu.

SW388R7
Data Analysis &
Computers II

Formula for probability for Mahalanobis D

Slide 83

First, in the target variable text box, type the


name "p_mah_1" as an acronym for the probability
of the mah_1, the Mahalanobis D score.

Second, to complete the


specifications for the CDF.CHISQ
function, type the name of the
variable containing the D scores,
mah_1, followed by a comma,
followed by the number of variables
used in the calculations, 3.

Third, click on the OK button


to signal completion of the
computer variable dialog.

Since the CDF function (cumulative


density function) computes the
cumulative probability from the left
end of the distribution up through a
given value, we subtract it from 1 to
obtain the probability in the upper tail
of the distribution.

SW388R7
Data Analysis &
Computers II

The multivariate outlier

Slide 84

Using the probabilities computed in p_mah_1


to identify outliers, scroll down through the list
of case to see the one case with a probability
less than 0.001.
There is 1 case that has a combination of
scores on the independent variables that is
sufficiently unusual to be considered an outlier
(case 20001984: Mahalanobis D=16.97,
p=0.0007).

SW388R7
Data Analysis &
Computers II

The univariate outlier

Slide 85

Similarly, we can scroll down the values of


sre_1, the studentized residual to see the
one outlier with a value larger than 3.0.
There is 1 case that has a score on the
dependent variable that is sufficiently
unusual to be considered an outlier (case
20000391: studentized residual=4.14).

SW388R7
Data Analysis &
Computers II

Omitting the outliers

Slide 86

To omit the outliers from the


analysis, we select in the
cases that are not outliers.

First, select the


Select Cases
command from the
Transform menu.

SW388R7
Data Analysis &
Computers II

Specifying the condition to omit outliers

Slide 87

First, mark the If


condition is satisfied
option button to
indicate that we will
enter a specific
condition for
including cases.

Second, click on the


If button to specify
the criteria for inclusion
in the analysis.

SW388R7
Data Analysis &
Computers II

The formula for omitting outliers

Slide 88

To eliminate the outliers, we


request the cases that are not
outliers.
The formula specifies that we
should include cases if the
studentized residual (regardless of
sign) if less than 3 and the
probability for Mahalanobis D is
higher than the level of
significance, 0.001.
After typing in the formula,
click on the Continue button
to close the dialog box,

SW388R7
Data Analysis &
Computers II

Completing the request for the selection

Slide 89

To complete the
request, we click on
the OK button.

SW388R7
Data Analysis &
Computers II

The omitted multivariate outlier

Slide 90

SPSS identifies the excluded cases by


drawing a slash mark through the case
number. Most of the slashes are for
cases with missing data, but we also see
that the case with the low probability for
Mahalanobis distance is included in
those that will be omitted.

SW388R7
Data Analysis &
Computers II

Running the regression without outliers

Slide 91

We run the regression again,


excluding the outliers.
Select the Regression |
Linear command from the
Analyze menu.

SW388R7
Data Analysis &
Computers II

Opening the save options dialog

Slide 92

If specify the dependent an


independent variables. If
we wanted to use any
transformed variables we
would substitute them now.

On our last run, we


instructed SPSS to save
studentized residuals and
Mahalanobis distance. To
prevent these values from
being calculated again, click
on the Save button.

SW388R7
Data Analysis &
Computers II

Clearing the request to save outlier data

Slide 93

First, clear the checkbox


for Studentized residuals.

Third, click on
the OK button to
complete the
specifications.

Second, clear the


checkbox form
Mahalanobis distance.

SW388R7
Data Analysis &
Computers II

Opening the statistics options dialog

Slide 94

Once we have removed outliers,


we need to check the sample
size requirement for regression.
Since we will need the
descriptive statistics for this,
click on the Statistics button.

SW388R7
Data Analysis &
Computers II

Requesting descriptive statistics

Slide 95

First, mark the checkbox


for Descriptives.

Second, click on
the Continue
button to
complete the
specifications.

SW388R7
Data Analysis &
Computers II

Requesting the output

Slide 96

Having specified the


output needed for the
analysis, we click on
the OK button to obtain
the regression output.

SW388R7
Data Analysis &
Computers II

Sample size requirement

Slide 97

The minimum ratio of valid cases to independent


variables for multiple regression is 5 to 1. After
removing 2 outliers, there are 252 valid cases and
3 independent variables.
The ratio of cases to independent variables for this
analysis is 84.0 to 1, which satisfies the minimum
requirement. In addition, the ratio of 84.0 to 1
satisfies the preferred ratio of 15 to 1.

SW388R7
Data Analysis &
Computers II

Significance of regression relationship

Slide 98

The probability of the F statistic (36.639) for the


overall regression relationship is <0.001, less than
or equal to the level of significance of 0.05. We
reject the null hypothesis that there is no
relationship between the set of independent
variables and the dependent variable (R = 0).
We support the research hypothesis that there is a
statistically significant relationship between the
set of independent variables and the dependent
variable.

SW388R7
Data Analysis &
Computers II

Increase in proportion of variance

Slide 99

Prior to any transformations of variables to satisfy


the assumptions of multiple regression or removal
of outliers, the proportion of variance in the
dependent variable explained by the independent
variables (R) was 27.1%. No transformed
variables were substituted to satisfy assumptions,
but outliers were removed from the sample.
The proportion of variance explained by the
regression analysis after removing outliers was
30.7%, a difference of 3.6%.

The answer to the question


is true with caution.
A caution is added because
of a violation of regression
assumptions.

SW388R7
Data Analysis &
Computers II

Impact of assumptions and outliers - 1

Slide 100

The following is a guide to the decision process for answering


problems about the impact of assumptions and outliers on analysis:
Dependent variable
metric?
Independent variables
metric or dichotomous?

No

Inappropriate
application of
a statistic

Yes

Ratio of cases to
independent variables at
least 5 to 1?

Yes
Run baseline regression and
record R for future
reference, using method for
including variables identified
in the research question.

No

Inappropriate
application of
a statistic

SW388R7
Data Analysis &
Computers II

Impact of assumptions and outliers - 2

Slide 101

Is the dependent variable


normally distributed?

No

Try:
1. Logarithmic transformation
2. Square root transformation
3. Inverse transformation
If unsuccessful, add caution

Yes

Metric IVs normally


distributed and linearly
related to DV

No

Try:
1. Logarithmic transformation
2. Square root transformation
(3. Square transformation)
4. Inverse transformation
If unsuccessful, add caution

Yes

DV is homoscedastic for
categories of
dichotomous IVs?

Yes

No

Add caution

SW388R7
Data Analysis &
Computers II

Impact of assumptions and outliers - 3

Slide 102

Substituting any transformed variables, run


regression using direct entry to include all
variables to request statistics for detecting
outliers

Are there univariate


outliers (DV) or
multivariate outliers
(IVs)?

Yes
Remove outliers from data

No

Ratio of cases to
independent variables at
least 5 to 1?

Yes
Run regression again using
transformed variables and
eliminating outliers

No

Inappropriate
application of
a statistic

SW388R7
Data Analysis &
Computers II

Impact of assumptions and outliers - 4

Slide 103

Yes

Probability of ANOVA test of


regression less than/equal to
level of significance?

No

False

Yes

Increase in R correct?

No

False

Yes

Satisfies ratio for preferred


sample size: 15 to 1
(stepwise: 50 to 1)

Yes

No

True with caution

SW388R7
Data Analysis &
Computers II

Impact of assumptions and outliers - 5

Slide 104

Yes

Other cautions added for


ordinal variables or violation
of assumptions?

Yes
True with caution

No

True

You might also like