Multiple Regression - Assumptions and Outliers

SW388R7
Data Analysis &

Computers II
Slide 1
Multiple Regression Assumptions and

Outliers
Multiple Regression and Assumptions

Multiple Regression and Outliers
Strategy for Solving Problems
Practice Problems
SW388R7
Data Analysis &
Computers II
Multiple Regression and Assumptions
Slide 2
Multiple regression is most effect at identifying

relationship between a dependent variable and a
combination of independent variables when its
underlying assumptions are satisfied: each of the
metric variables are normally distributed, the
relationships between metric variables are linear,
and the relationship between metric and
dichotomous variables is homoscedastic.
Failing to satisfy the assumptions does not mean that
our answer is wrong. It means that our solution may
under-report the strength of the relationships.
SW388R7
Data Analysis &
Computers II
Multiple Regression and Outliers
Slide 3
Outliers can distort the regression results. When an

outlier is included in the analysis, it pulls the
regression line towards itself. This can result in a
solution that is more accurate for the outlier, but
less accurate for all of the other cases in the data
set.
We will check for univariate outliers on the
dependent variable and multivariate outliers on the
independent variables.
SW388R7
Data Analysis &
Computers II
Relationship between assumptions and outliers
Slide 4
The problems of satisfying assumptions and detecting

outliers are intertwined. For example, if a case has
a value on the dependent variable that is an outlier,
it will affect the skew, and hence, the normality of
the distribution.
Removing an outlier may improve the distribution of
a variable.
Transforming a variable may reduce the likelihood
that the value for a case will be characterized as an
outlier.
SW388R7
Data Analysis &
Computers II
Order of analysis is important
Slide 5
The order in which we check assumptions and detect

outliers will affect our results because we may get a
different subset of cases in the final analysis.
In order to maximize the number of cases available
to the analysis, we will evaluate assumptions first.
We will substitute any transformations of variable
that enable us to satisfy the assumptions.
We will use any transformed variables that are
required in our analysis to detect outliers.
SW388R7
Data Analysis &
Computers II
Strategy for solving problems
Slide 6
Our strategy for solving problems about violations of

assumptions and outliers will include the following steps:
1.
2.
3.
4.
5.
6.
Run type of regression specified in problem statement on variables

using full data set.
Test the dependent variable for normality. If it does not satisfy the
criteria for normality unless transformed, substitute the transformed
variable in the remaining tests that call for the use of the dependent
variable.
Test for normality, linearity, homoscedasticity using scripts. Decide
which transformations should be used.
Substitute transformations and run regression entering all
independent variables, saving studentized residuals and Mahalanobis
distance scores. Compute probabilities for D.
Remove the outliers (studentized residual greater than 3 or
Mahalanobis D with p <= 0.001), and run regression with the method
and variables specified in the problem.
Compare R for analysis using transformed variables and omitting
outliers (step 5) to R obtained for model using all data and original
variables (step 1).
SW388R7
Data Analysis &
Computers II
Transforming dependent variables
Slide 7
We will use the following logic to transform variables:
If dependent variable is not normally distributed:

Try log, square root, and inverse transformation.
Use first transformed variable that satisfies
normality criteria.
If no transformation satisfies normality criteria,
use untransformed variable and add caution for
violation of assumption.
If a transformation satisfies normality, use the
transformed variable in the tests of the independent
variables.
SW388R7
Data Analysis &
Computers II
Transforming independent variables - 1
Slide 8
If independent variable is normally distributed and

linearly related to dependent variable, use as is.
If independent variable is normally distributed but
not linearly related to dependent variable:
Try log, square root, square, and inverse
transformation. Use first transformed variable
that satisfies linearity criteria and does not
violate normality criteria
If no transformation satisfies linearity criteria and
does not violate normality criteria, use
untransformed variable and add caution for
violation of assumption
SW388R7
Data Analysis &
Computers II
Slide 9
If independent variable is linearly related to

dependent variable but not normally distributed:
normality criteria and does not reduce correlation.
normality criteria and has significant correlation.
If no transformation satisfies normality criteria
with a significant correlation, use untransformed
variable and add caution for violation of
assumption
SW388R7
Data Analysis &
Computers II
Slide 10
If independent variable is not linearly related to

dependent variable and not normally distributed:
Try log, square root, square, and inverse
transformation. Use first transformed variable
that satisfies normality criteria and has significant
correlation.
If no transformation satisfies normality criteria
with a significant correlation, used untransformed
variable and add caution for violation of
assumption
Impact of transformations
and omitting outliers
SW388R7
Data Analysis &
Computers II
Slide 11
We evaluate the regression assumptions and detect

outliers with a view toward strengthening the
relationship.
This may not happen. The regression may be the
same, it may be weaker, and it may be stronger. We
cannot be certain of the impact until we run the
regression again.
In the end, we may opt not to exclude outliers and
not to employ transformations; the analysis informs
us of the consequences of doing either.
SW388R7
Data Analysis &
Computers II
Notes
Slide 12
Whenever you start a new problem, make sure you

have removed variables created for previous analysis
and have included all cases back into the data set.
I have added the square transformation to the
checkboxes for transformations in the normality
script. Since this is an option for linearity, we need
to be able to evaluate its impact on normality.
If you change the options for output in pivot tables

from labels to names, you will get an error message
when you use the linearity script. To solve the
problem, change the option for output in pivot tables
back to labels.
SW388R7
Data Analysis &
Computers II
Problem 1
Slide 13
In the dataset GSS2000.sav, is the following statement true, false, or an

incorrect application of a statistic? Assume that there is no problem with
missing data. Use a level of significance of 0.01 for the regression
analysis. Use a level of significance of 0.01 for evaluating assumptions.
The research question requires us to identify the best subset of predictors
of "total family income" [income98] from the list: "sex" [sex], "how many
in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions
and removing outliers, the total proportion of variance explained by the
regression analysis increased by 10.8%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 1
Slide 14

The research question
requires us to identify the best subset of predictors
The problem may give us different
of "total family income"
[income98]
from
the list: "sex" [sex], "how many
levels of significance
for the
analysis.
In this problem, we are told to use
0.01 as alpha for the regression
as well as for
testing to
substitutinganalysis
transformed
variables
assumptions.
After
satisfy regression assumptions
1.
2.
3.
4.
True
True with caution
False
SW388R7
Data Analysis &
Computers II
Slide 15

The research question requires us to identify the best subset of predictors
The method for selecting variables is
and removing outliers,
total
proportion
of variance explained by the
derivedthe
from
the research
question.
1.
2.
3.
4.
In this problem we are asked to idnetify the

best subset of predicotrs, so we do a
stepwise multiple regression.
True
True with caution
False
SW388R7
Data Analysis &
Computers II
Slide 16

incorrect application
ofofatesting
statistic?
Assume and
that
there
is no problem with
The purpose
for assumptions
outliers
is to
stronger
The mainof
question
missing data. identify
Use a alevel
of model.
significance
0.01 to
forbethe regression
in this problem is whether or not the use
analysis. Use answered
a level of
significance of 0.01 for evaluating assumptions.
transformed variables to satisfy assumptions and the
removal of outliers improves the overall relationship
between the independent variables and the dependent
research variable,
question
requiresbyusR.to identify the best subset
as measured
The
of predictors
1.
2.
3.
4.
True
Specifically, the question asks whether or
True with caution not the R for a regression analysis after
substituting transformed variables and
False
eliminating outliers is 10.8% higher than a
regression
using the original format
Inappropriate application
of aanalysis
statistic
for all variables and including all cases.
SW388R7
Data Analysis &
Computers II
R before transformations or removing outliers
Slide 17
To start out, we run a

stepwise multiple regression
analysis with income98 as
the dependent variable and
sex, earnrs, and rincom98
as the independent
variables.
We select stepwise as
the method to select the
best subset of predictors.
SW388R7
Data Analysis &
Computers II
Slide 18
Prior to any transformations of variables

to satisfy the assumptions of multiple
regression or removal of outliers, the
proportion of variance in the dependent
variable explained by the independent
variables (R) was 51.1%. This is the
benchmark that we will use to evaluate
the utility of transformations and the
elimination of outliers.
SW388R7
Data Analysis &
Computers II
Slide 19
For this particular question, we are not interested in the

statistical significance of the overall relationship prior to
transformations and removing outliers. In fact, it is
possible that the relationship is not statistically significant
due to variables that are not normal, relationships that
are not linear, and the inclusion of outliers.
SW388R7
Data Analysis &
Computers II
Slide 20
Normality of the dependent variable:

total family income
In evaluating assumptions, the first step is to

examine the normality of the dependent
variable. If it is not normally distributed, or
cannot be normalized with a transformation, it
can affect the relationships with all other
variables.
First, move the

dependent variable
INCOME98 to the list
box of variables to test.
To test the normality of the dependent

variable, run the script:
NormalityAssumptionAndTransformations.SBS
Second, click on the

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 21
Normality of the dependent variable:

total family income
Descriptiv es
TOTAL FAMILY INCOME Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
Statistic
15.67
14.98
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
The dependent variable "total family income"

[income98] satisfies the criteria for a normal
distribution. The skewness (-0.628) and kurtosis
(-0.248) were both between -1.0 and +1.0. No
transformation is necessary.
Std. Error
.349
16.36
15.95
17.00
27.951
5.287
1
23
22
8.00
-.628
-.248
.161
.320
SW388R7
Data Analysis &
Computers II
Slide 22
Linearity and independent variable:

how many in family earned money
First, move the dependent variable
INCOME98 to the text box for the
dependent variable.
To evaluate the linearity of the relationship

between number of earners and total family
income, run the script for the assumption of
linearity:
LinearityAssumptionAndTransformations.SBS
Second, move the

independent variable,
EARNRS, to the list
box for independent
variables.
Third, click on the

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 23

Correlations
TOTAL FAMILY INCOME Pearson Correlation

Sig. (2-tailed)
N
HOW MANY IN FAMILY
Pearson Correlation
EARNED MONEY
Sig. (2-tailed)
N
Logarithm of EARNRS
Pearson Correlation
[LG10( 1+EARNRS)]
Sig. (2-tailed)
N
Square of EARNRS
[(EARNRS)**2]
Square Root of EARNRS
[SQRT( 1+EARNRS)]
Inverse of EARNRS [-1/(
1+EARNRS)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
HOW MANY Logarithm of Square of

Square Root
TOTAL
IN FAMILY
EARNRS
EARNRS
of EARNRS
Inverse of
FAMILY
EARNED
[LG10(
[(EARNR
[SQRT(
EARNRS [-1/(
INCOME
MONEY The
1+EARNRS)]
S)**2]
1+EARNRS)]
independent variable
"how
many in 1+EARNRS)]
1
.505**
.536**money"
.376**
.527**
.526*
family earned
[earnrs] satisfies
.
.000 the criteria
.000
.000
.000
for the assumption
of .000
linearity
with
the
dependent
variable
229
228
228
228
228
228
"total
family
income"
[income98],
but
.505**
1
.959**
.908**
.989**
.871*
the .000
assumption of.000
.000
. does not satisfy
.000
.000
normality.
The
evidence
of
linearity
in
228
269
269
269
269
269
the relationship between the
.536**
.959**
1
.759**
.990**
.973*
independent variable "how many in
.000
.000
.
.000
.000
.000
228
.376**
.000
228
.527**
.000
228
.526**
.000
228
**. Correlation is significant at the 0.01 level (2-tailed).
family earned money" [earnrs] and the

variable "total
269 dependent269
269 family income"
269
[income98] was the statistical
.908**
.759**
1
.839**
significance
of the correlation
coefficient
.000 (r = 0.505).
.000The probability
.
.000
for the
269 correlation269
coefficient269
was <0.001,269
less
than or equal
to the .839**
level of significance
.989**
.990**
1
of
0.01.
We
reject
the
null
hypothesis
.000
.000
.000
.
that
r
=
0
and
conclude
that
there
is
269
269
269
269 a
linear
relationship
between
the
.871**
.973**
.606**
.932**
variables.
.000
269
.000
269
.000
269
.000
269
269
.606*
.000
269
.932*
.000
269
1
.
269
SW388R7
Data Analysis &
Computers II
Slide 24
Normality of independent variable:

After evaluating the dependent variable, we

examine the normality of each metric
variable and linearity of its relationship with
the dependent variable.
To test the normality of number of earners in
family, run the script:
NormalityAssumptionAndTransformations.SB
S
First, move the

EARNRS to the list box
of variables to test.

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 25

Descriptiv es
HOW MANY IN FAMILY Mean
EARNED MONEY
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
1.43
1.31
Std. Error
.061
1.56
1.37
1.00
1.015
1.008
0
5
5
1.00
.742
1.324
The independent variable "how many in family earned money" [earnrs]

satisfies the criteria for the assumption of linearity with the dependent
variable "total family income" [income98], but does not satisfy the
assumption of normality.
In evaluating normality, the skewness (0.742) was between -1.0 and

+1.0, but the kurtosis (1.324) was outside the range from -1.0 to +1.0.
.149
.296
SW388R7
Data Analysis &
Computers II
Slide 26

The square root transformation also

has values of skewness and kurtosis in
the acceptable range.
However, by our order of preference
for which transformation to use, the
logarithm is preferred.
The logarithmic
transformation
improves the normality
of "how many in family
earned money" [earnrs]
without a reduction in
the strength of the
relationship to "total
family income"
[income98]. In
evaluating normality,
the skewness (-0.483)
and kurtosis (-0.309)
were both within the
range of acceptable
values from -1.0 to
+1.0. The correlation
coefficient for the
transformed variable is
0.536.
Transformation for how many in family

earned money
SW388R7
Data Analysis &
Computers II
Slide 27
The independent variable, how many in family

earned money, had a linear relationship to the
dependent variable, total family income.
The logarithmic transformation improves the
normality of "how many in family earned money"
[earnrs] without a reduction in the strength of the
relationship to "total family income" [income98].
We will substitute the logarithmic transformation of
how many in family earned money in the regression
analysis.
SW388R7
Data Analysis &
Computers II
Slide 28

respondents income

To test the normality of respondents in
family, run the script:
S
First, move the

RINCOM89 to the list
box of variables to
test.

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 29

respondents income
Descriptiv es
RESPONDENTS INCOME Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
13.35
12.52
Std. Error
.419
14.18
13.54
15.00
29.535
5.435
1
23
22
8.00
-.686
-.253
The independent variable "income" [rincom98] satisfies the criteria for

both the assumption of normality and the assumption of linearity with
the dependent variable "total family income" [income98].
In evaluating normality, the skewness (-0.686) and kurtosis (-0.253)
were both within the range of acceptable values from -1.0 to +1.0.
.187
.373
SW388R7
Data Analysis &
Computers II
Slide 30

respondents income
dependent variable.

between respondents income and total
family income, run the script for the
assumption of linearity:
Second, move the

RINCOM89, to the list
box for independent
variables.
Third, click on the

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 31

respondents income
Correlations
TOTAL FAMILY INCOME
Pearson Correlation
Sig. (2-tailed)
N
RESPONDENTS INCOME Pearson Correlation
Sig. (2-tailed)
N
Logarithm of RINCOM98 Pearson Correlation
[LG10( 24-RINCOM98)]
Sig. (2-tailed)
N
Square of RINCOM98
[(RINCOM98)**2]
Pearson Correlation
Sig. (2-tailed)
N
Square Root of
Pearson Correlation
RINCOM98 [SQRT(
Sig. (2-tailed)
24-RINCOM98)]
N
Inverse of RINCOM98 [-1/( Pearson Correlation
24-RINCOM98)]
Sig. (2-tailed)
N
Logarithm of
Square Root Inverse of
RINCOM98
Square of
of RINCOM98 RINCOM9
TOTAL
[LG10(
RINCOM98
[SQRT(
8 [-1/(
FAMILY
RESPONDEN 24-RINCOM
[(RINCOM9
24-RINCOM9
24-RINC
INCOME
TS INCOME
98)]
8)**2]
8)]
OM98)]
1
.577**
-.595**
.613**
-.601**
-.434**
.
.000
.000
.000
.000
.000
The evidence of linearity in the
229
163
163independent163
163
relationship163
between the
variable
"income"
[rincom98]
.577**
1
-.922**
.967** and the
-.985**
-.602**
dependent
variable "total
.000
.
.000
.000family income"
.000
.000
[income98]
was
the
statistical
163
168
168
168
168
168
significance
of
the
correlation
coefficient
-.595**
-.922**
1
-.976**
.974**
.848**
(r = 0.577). The probability for the
.000
.000
.
.000
.000
.000
163
.613**
.000
163
-.601**
.000
163
-.434**
.000
163
correlation coefficient was <0.001, less

than or equal
168
168to the level
168of significance
168
of 0.01. We reject the null hypothesis
.967**
that r = 0-.976**
and conclude 1that there -.993**
is a
.000
.000
.
.000
linear relationship between the
168
168
168
variables. 168
-.985**
.000
168
-.602**
.000
168
.974**
.000
168
.848**
.000
168
-.993**
.000
168
-.718**
.000
168
1
.
168
.714**
.000
168
168
-.718**
.000
168
.714**
.000
168
1
.
168
SW388R7
Data Analysis &
Computers II
Homoscedasticity: sex
Slide 32

dependent variable.
To evaluate the homoscedasticity of the

relationship between sex and total family
income, run the script for the assumption of
homogeneity of variance:
Second, move the

SEX, to the list box for
HomoscedasticityAssumptionAnd
Transformations.SBS
Third, click on the

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 33
Based on the Levene Test, the

variance in "total family income"
[income98] is homogeneous for
the categories of "sex" [sex].
The probability associated with
the Levene Statistic (0.031) is
greater than the level of
significance, so we fail to reject
the null hypothesis and
conclude that the
homoscedasticity assumption is
satisfied.
SW388R7
Data Analysis &
Computers II
Adding a transformed variable
Slide 34
Even though we do not need a

transformation for any of the
variables in this analysis, we will
demonstrate how to use a script,
such as the normality script, to add a
transformed variable to the data set,
e.g. a logarithmic transformation for
highest year of school.
Second, mark the

checkbox for the
transformation we
want to add to the
data set, and clear
the other checkboxes.
Third, clear the

checkbox for Delete
transformed variables
from the data. This will
save the transformed
variable.
First, move the variable

that we want to transform
to the list box of variables
to test.
Fourth, click on the

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
The transformed variable in the data editor
Slide 35
If we scroll to the extreme

right in the data editor, we
see that the transformed
variable has been added to
the data set.
Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.
SW388R7
Data Analysis &
Computers II
The regression to identify outliers
Slide 36
We use the regression procedure

to identify both univariate and
multivariate outliers.
We start with the same dialog we
used for the last analysis, in which
income98 as the dependent
variable and sex, earnrs, and
rincom98 were the independent
variables.
First, we substitute the

logarithmic transformation of
earnrs, logearn, into the list
of independent variables.
Second, we change the

method of entry from
Stepwise to Enter so that all
variables will be included in
the detection of outliers.
Third, we want to save the

calculated values of the outlier
statistics to the data set.
Click on the Save button to
specify what we want to save.
SW388R7
Data Analysis &
Computers II
Saving the measures of outliers
Slide 37
First, mark the checkbox for

Studentized residuals in the
Residuals panel. Studentized
residuals are z-scores computed
for a case based on the data for
all other cases in the data set.
Second, mark the checkbox for

Mahalanobis in the Distances
panel. This will compute
Mahalanobis distances for the
set of independent variables.
Third, click on
the OK button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
The variables for identifying outliers
Slide 38
The variables for identifying

univariate outliers for the
dependent variable are in a
column which SPSS has
names sre_1.

multivariate outliers for the
independent variables are in
a column which SPSS has
names mah_1.
SW388R7
Data Analysis &
Computers II
Computing the probability for Mahalanobis D
Slide 39
To compute the probability

of D, we will use an SPSS
function in a Compute
command.
First, select the

Compute command
from the Transform
menu.
SW388R7
Data Analysis &
Computers II
Formula for probability for Mahalanobis D
Slide 40
First, in the target variable text box, type the

name "p_mah_1" as an acronym for the probability
of the mah_1, the Mahalanobis D score.
Second, to complete the

specifications for the CDF.CHISQ
function, type the name of the
variable containing the D scores,
mah_1, followed by a comma,
followed by the number of variables
used in the calculations, 3.
Third, click on the OK button

to signal completion of the
computer variable dialog.
Since the CDF function (cumulative

density function) computes the
cumulative probability from the left
end of the distribution up through a
given value, we subtract it from 1 to
obtain the probability in the upper tail
of the distribution.
SW388R7
Data Analysis &
Computers II
Multivariate outliers
Slide 41
Using the probabilities computed in p_mah_1

to identify outliers, scroll down through the list
of case to see if we can find cases with a
probability less than 0.001.
There are no outliers for the set of
SW388R7
Data Analysis &
Computers II
Univariate outliers
Slide 42
Similarly, we can scroll down the values of

sre_1, the studentized residual to see the
one outlier with a value larger than 3.0.
Based on these criteria, there are 4

outliers.There are 4 cases that have a score
on the dependent variable that is
sufficiently unusual to be considered outliers
(case 20000357: studentized
residual=3.08; case 20000416: studentized
residual=-3.23).
SW388R7
Data Analysis &
Computers II
Omitting the outliers
Slide 43
To omit the outliers from the

analysis, we select in the
cases that are not outliers.
First, select the

Select Cases
command from the
Transform menu.
SW388R7
Data Analysis &
Computers II
Specifying the condition to omit outliers
Slide 44
First, mark the If

condition is satisfied
option button to
indicate that we will
enter a specific
condition for
including cases.

If button to specify
the criteria for inclusion
in the analysis.
SW388R7
Data Analysis &
Computers II
The formula for omitting outliers
Slide 45
To eliminate the outliers, we

request the cases that are not
outliers.
The formula specifies that we
should include cases if the
studentized residual (regardless of
sign) if less than 3 and the
probability for Mahalanobis D is
higher than the level of
significance, 0.001.
After typing in the formula,
click on the Continue button
to close the dialog box,
SW388R7
Data Analysis &
Computers II
Completing the request for the selection
Slide 46
To complete the
request, we click on
the OK button.
SW388R7
Data Analysis &
Computers II
The omitted multivariate outlier
Slide 47
SPSS identifies the excluded cases by

drawing a slash mark through the case
number. Most of the slashes are for
cases with missing data, but we also see
that the case with the low probability for
Mahalanobis distance is included in
those that will be omitted.
SW388R7
Data Analysis &
Computers II
Running the regression without outliers
Slide 48
We run the regression again,

excluding the outliers.
Select the Regression |
Linear command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Opening the save options dialog
Slide 49
We specify the dependent

and independent variables,
substituting any transformed
variables required by
assumptions.
When we used regression to

detect outliers, we entered
all variables. Now we are
testing the relationship
specified in the problem, so
we change the method to
Stepwise.
On our last run, we

instructed SPSS to save
studentized residuals and
Mahalanobis distance. To
prevent these values from
being calculated again, click
on the Save button.
SW388R7
Data Analysis &
Computers II
Clearing the request to save outlier data
Slide 50
First, clear the checkbox

for Studentized residuals.
Third, click on
the OK button to
complete the
specifications.
Second, clear the

checkbox form
Mahalanobis distance.
SW388R7
Data Analysis &
Computers II
Opening the statistics options dialog
Slide 51
Once we have removed outliers,

we need to check the sample
size requirement for regression.
Since we will need the
descriptive statistics for this,
click on the Statistics button.
SW388R7
Data Analysis &
Computers II
Requesting descriptive statistics
Slide 52
First, mark the checkbox

for Descriptives.
Second, click on
the Continue
button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Requesting the output
Slide 53
Having specified the

output needed for the
analysis, we click on
the OK button to obtain
the regression output.
SW388R7
Data Analysis &
Computers II
Sample size requirement
Slide 54
The minimum ratio of valid cases to independent

variables for stepwise multiple regression is 5 to
1. After removing 4 outliers, there are 159 valid
cases and 3 independent variables.
The ratio of cases to independent variables for this
analysis is 53.0 to 1, which satisfies the minimum
requirement. In addition, the ratio of 53.0 to 1
satisfies the preferred ratio of 50 to 1.
Descriptiv e Statistics
TOTAL FAMILY INCOME
RESPONDENTS SEX
RESPONDENTS INCOME
Logarithm of EARNRS
[LG10( 1+EARNRS)]
Mean
17.09
1.55
13.76
Std. Deviation
4.073
.499
5.133
.424896
.1156559
N
159
159
159
159
SW388R7
Data Analysis &
Computers II
Significance of regression relationship
Slide 55
ANOVAd
Model
1
Regression
Residual
Total
Regression
Residual
Total
Regression
Residual
Total
Sum of
Squares
1122.398
1499.187
2621.585
1572.722
1048.863
2621.585
1623.976
997.609
2621.585
df
1
157
158
2
156
158
3
155
158
Mean Square
1122.398
9.549
F
117.541
Sig.
.000 a
786.361
6.723
116.957
.000 b
541.325
6.436
84.107
.000 c
a. Predictors: (Constant), RESPONDENTS INCOME

b. Predictors: (Constant), RESPONDENTS INCOME, Logarithm of EARNRS [LG10(
The1+EARNRS)]
probability of the F statistic (84.107) for the regression
relationship
which includes
these variables is <0.001, less
c. Predictors: (Constant),
RESPONDENTS INCOME, Logarithm of EARNRS [LG10(
than
or
equal
to
the
level
of
significance of 0.01. We reject
1+EARNRS)], RESPONDENTS SEX
the null hypothesis that there is no relationship between
d.
Variable: TOTAL FAMILY INCOME
theDependent
best subset
of independent variables and the dependent
variable (R = 0).
We support the research hypothesis that there is a
statistically significant relationship between the best subset
of independent variables and the dependent variable.
SW388R7
Data Analysis &
Computers II
Increase in proportion of variance
Slide 56
Model Summary
Model
1
2
3
R
R Square
a
.654
.428
b
.775
.600
c
.787
.619
Adjusted
R Square
.424
.595
.612
Std. Error of
the Estimate
3.090
2.593
2.537
a. Predictors: (Constant), RESPONDENTS INCOME

b. Predictors: (Constant), RESPONDENTS INCOME,
Logarithm of EARNRS [LG10( 1+EARNRS)]
Prior to any
transformations of variables to satisfy
c. Predictors: (Constant), RESPONDENTS
INCOME,
the assumptions of multiple regression or removal
Logarithm of EARNRS [LG10( 1+EARNRS)],
of outliers, the proportion of variance in the
RESPONDENTS SEX
dependent variable explained by the independent
variables (R) was 51.1%.
After transformed variables were substituted to

satisfy assumptions and outliers were removed
from the sample, the proportion of variance
explained by the regression analysis was 61.9%, a
difference of 10.8%.
The answer to the question

is true with caution.
A caution is added because
of the inclusion of ordinal
level variables.
SW388R7
Data Analysis &
Computers II
Problem 2
Slide 57
In the dataset GSS2000.sav, is the following statement true, false, or

an incorrect application of a statistic? Assume that there is no problem
with missing data. Use a level of significance of 0.05 for the regression
The research question requires us to examine the relationship of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to the
dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression
assumptions and removing outliers, the proportion of variance
explained by the regression analysis increased by 3.6%.
1.
2.
3.
4.
True
True with caution
False
SW388R7
Data Analysis &
Computers II
Slide 58

The problem may give us different
[age], "highest year
of school completed" [educ], and "sex" [sex] to the
levels of significance for the analysis.
In this problem, we are told to use
0.05 as alpha for the regression
and the more
conservative
After substitutinganalysis
transformed
variables
to satisfy regression
0.01 as the alpha in testing
assumptions and removing
outliers, the proportion of variance
assumptions.
1.
2.
3.
4.
True
True with caution
False
SW388R7
Data Analysis &
Computers II
Slide 59

After substituting
transformed variables to satisfy regression
The method for selecting variables is
assumptions andderived
removing
outliers,
proportion of variance
from the
research the
question.
1.
2.
3.
4.
If we are asked to examine a relationship

without any statement about control
variables or the best subset of variables, we
do a standard multiple regression.
True
True with caution
False
SW388R7
Data Analysis &
Computers II
Slide 60

The purpose of testing for assumptions and outliers is to
with missing data.
Use a level of significance of 0.05 for the regression
identify a stronger model. The main question to be
analysis. Use answered
a level of
significance
of 0.01
forthe
evaluating
assumptions.
in this
problem is whether
or not
use
transformed variables to satisfy assumptions and the
removal of outliers improves the overall relationship
the requires
independentus
variables
and the dependent
research between
question
to examine
the relationship
variable, as measured by R.
The
of "age"
After substituting transformed variables to satisfy regression
assumptions and removing outliers, the proportion of variance
1.
2.
3.
4.
True
True with caution Specifically, the question asks whether or
not the R for a regression analysis after
False
substituting transformed variables and
eliminating outliers is 3.6% higher than a
Inappropriate application
a statistic
regression of
analysis
using the original format
for all variables and including all cases.
SW388R7
Data Analysis &
Computers II
Slide 61
To start out, we run a

standard multiple
regression analysis with
prestg80 as the dependent
variable and age, educ, and
sex as the independent
variables.
SW388R7
Data Analysis &
Computers II
Slide 62
Prior to any transformations of variables

to satisfy the assumptions of multiple
regression or removal of outliers, the
proportion of variance in the dependent
variable explained by the independent
variables (R) was 27.1%. This is the
benchmark that we will use to evaluate
the utility of transformations and the
elimination of outliers.
For this particular question, we are not interested in the

statistical significance the overall relationship prior to
transformations and removing outliers. In fact, it is
possible that the relationship is not statistically significant
due to variables that are not normal, relationships that
are not linear, and the inclusion of outliers.
SW388R7
Data Analysis &
Computers II
Normality of the dependent variable
Slide 63
In evaluating assumptions, the first step is to

examine the normality of the dependent
variable. If it is not normally distributed, or
cannot be normalized with a transformation, it
can affect the relationships with all other
variables.
First, move the

dependent variable
PRESTG80 to the list
box of variables to test.
To test the normality of the dependent

variable, run the script:
NormalityAssumptionAndTransformations.SBS

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Normality of the dependent variable
Slide 64
The dependent variable "occupational prestige

score" [prestg80] satisfies the criteria for a
normal distribution. The skewness (0.401) and
kurtosis (-0.630) were both between -1.0 and
+1.0. No transformation is necessary.
SW388R7
Data Analysis &
Computers II
Normality of independent variable: Age
Slide 65

To test the normality of age, run the script:
S
First, move the

AGE to the list box of
variables to test.

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Normality of independent variable: Age
Slide 66
Descriptiv es
AGE OF RESPONDENT Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Statistic
45.99
43.98
Std. Error
1.023
48.00
45.31
43.50
282.465
16.807
19
89
70
24.00
.595
-.351
The independent variable "age" [age] satisfies the criteria for the
assumption of normality, but does not satisfy the assumption of
linearity with the dependent variable "occupational prestige score"
[prestg80].
In evaluating normality, the skewness (0.595) and kurtosis (-0.351)
were both within the range of acceptable values from -1.0 to +1.0.
.148
.295
SW388R7
Data Analysis &
Computers II
Linearity and independent variable: Age
Slide 67

PRESTG80 to the text box for the
dependent variable.

between age and occupational prestige, run
the script for the assumption of linearity:
Second, move the

AGE, to the list box for
Third, click on the

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Linearity and independent variable: Age
Slide 68
Correlations
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
AGE OF RESPONDENT
Logarithm of AGE
[LG10(AGE)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Square of AGE [(AGE)**2] Pearson Correlation

Sig. (2-tailed)
N
Square Root of AGE
Pearson Correlation
[SQRT(AGE)]
Sig. (2-tailed)
N
Inverse of AGE [-1/(AGE)] Pearson Correlation
Sig. (2-tailed)
N
RS
OCCUPA
TIONAL
PRESTIG
E SCORE
(1980)
1
.
255
.024
.706
255
.059
.348
255
-.004
.956
255
.041
.518
255
.096
.128
255
AGE OF
Logarithm of
Square of
Square Root
Inverse of
The evidence of nonlinearity in the
RESPON
AGE
AGE
of AGE
AGE
relationship between the independent
DENT
[LG10(AGE)]
[(AGE)**2]
[SQRT(AGE)]
[-1/(AGE)]
variable "age" [age] and the dependent
.024 variable .059
-.004
.041
.096
"occupational prestige score"
.706 [prestg80]
.348
.956of statistical.518
.128
was the lack
coefficient
255 significance
255of the correlation
255
255
255
(r
=
0.024).
The
probability
for
the
1
.979**
.983**
.995**
.916**
correlation coefficient was 0.706, greater
.
.000
.000
.000
.000
than the level of significance of 0.01. We
270 cannot reject
270 the null 270
270
hypothesis that270
r = 0,
.979** and cannot 1conclude .926**
.994**
.978**
that there is a linear
the variables. .000
.000 relationship .between .000
.000
270 Since none
270of the transformations
270
to270
improve linearity were successful, it is an
.983**
.926**
1
.960**
indication that the problem may be a weak
.000 relationship,
.000 rather than .a curvilinear
.000
270 relationship
270correctable
270by using a 270
transformation.
A
weak
relationship is not
.995**
.994**
.960**
1 a
violation
of
the
assumption
of
linearity,
.000
.000
.000
.and
does not require a caution.
270
270
270
270
.916**
.978**
.832**
.951**
.000
.000
.000
.000
270
270
270
270
270
.832**
.000
270
.951**
.000
270
1
.
270
SW388R7
Data Analysis &
Computers II
Transformation for Age
Slide 69
The independent variable age satisfied the criteria

for normality.
The independent variable age did not have a linear
relationship to the dependent variable occupational
prestige. However, none of the transformations
linearized the relationship.
No transformation will be used - it would not help

linearity and is not needed for normality.
SW388R7
Data Analysis &
Computers II
Slide 70

Highest year of school completed
dependent variable.

between highest year of school and
occupational prestige, run the script for the
assumption of linearity:
Second, move the
EDUC, to the list box
for independent
variables.
Third, click on the

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 71

Correlations
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
HIGHEST YEAR OF
SCHOOL COMPLETED
Logarithm of EDUC
[LG10( 21-EDUC)]
Square of EDUC
[(EDUC)**2]
Square Root of EDUC
[SQRT( 21-EDUC)]
Inverse of EDUC [-1/(
21-EDUC)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
RS
OCCUPA
TIONAL
HIGHEST
Square Root
PRESTIG
YEAR OF The
Logarithm
of
Square
of "highest
of EDUC
Inverse of
independent
variable
year
E SCORE
SCHOOL of EDUC
EDUC
[SQRT(
EDUC [-1/(
school[LG10(
completed"
[educ] satisfies
(1980)
COMPLETEDthe 21-EDUC)]
[(EDUC)**2]
21-EDUC)]
criteria for the
assumption21-EDUC)]
of
1
.495**
-.512**
.528** variable
-.518**
-.423
linearity with
the dependent
.
.000 "occupational
.000 prestige score"
.000
.000
.000
255
254 [prestg80],254
254
254
but does not
satisfy the 254
assumption
of
normality.
The
evidence
.495**
1
-.920**
.980**
-.982**
-.699
of
linearity
in
the
relationship
between
.000
.
.000
.000
.000
.000
the
independent
variable
"highest
year
254
269
269
269
269
269
of
school
completed"
[educ]
and
the
-.512**
-.920**
1
-.969**
.977**
.915
dependent variable "occupational
.000
.000
.
.000
.000
.000
254
.528**
.000
254
-.518**
.000
254
-.423**
.000
254
prestige score" [prestg80] was the

of269
the correlation
269 statistical significance
269
269
coefficient (r = 0.495). The probability
.980**
-.969**
1
for the correlation
coefficient
was -.997**
.000 <0.001, less
.000than or equal. to the level
.000
269 of significance
269 of 0.01. We
269 reject the269
-.982**
.977** that r =
-.997**
1
null hypothesis
0 and conclude
is a linear relationship
.000 that there.000
.000
.
between
the
variables.
269
269
269
269
-.699**
.000
269
.915**
.000
269
-.789**
.000
269
.812**
.000
269
269
-.789
.000
269
.812
.000
269
1
.
269
SW388R7
Data Analysis &
Computers II
Slide 72

To test the normality of EDUC, Highest year

of school completed, run the script:
S
First, move the

dependent variable
EDUC to the list box of
variables to test.

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 73

Descriptiv es
HIGHEST YEAR OF
SCHOOL COMPLETED
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
13.12
12.77
Std. Error
.179
13.47
13.14
13.00
8.583
2.930
2
20
18
3.00
-.137
1.246
In evaluating normality, the skewness (-0.137) was between -1.0

and +1.0, but the kurtosis (1.246) was outside the range from -1.0
to +1.0. None of the transformations for normalizing the distribution
of "highest year of school completed" [educ] were effective.
.149
.296
SW388R7
Data Analysis &
Computers II
Transformation for highest year of school
Slide 74
The independent variable, highest year of school,

had a linear relationship to the dependent variable,
occupational prestige.
The independent variable, highest year of school, did
not satisfy the criteria for normality. None of the
transformations for normalizing the distribution of
"highest year of school completed" [educ] were
effective.
No transformation will be used - it would not help
normality and is not needed for linearity. A caution
should be added to any findings.
SW388R7
Data Analysis &
Computers II
Slide 75

dependent variable.
To evaluate the homoscedasticity of the

relationship between sex and occupational
prestige, run the script for the assumption of
homogeneity of variance:
Second, move the

SEX, to the list box for
HomoscedasticityAssumptionAnd
Transformations.SBS
Third, click on the

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
Slide 76
Based on the Levene Test, the

variance in "occupational
prestige score" [prestg80] is
homogeneous for the categories
of "sex" [sex]. The probability
associated with the Levene
Statistic (0.808) is greater than
the level of significance, so we
fail to reject the null hypothesis
and conclude that the
homoscedasticity assumption is
satisfied.
Even if we violate the
assumption, we would not do a
transformation since it could
impact the relationships of the
other independent variables
with the dependent variable.
SW388R7
Data Analysis &
Computers II
Adding a transformed variable
Slide 77
Even though we do not need a

transformation for any of the
variables in this analysis, we will
demonstrate how to use a script,
such as the normality script, to add a
transformed variable to the data set,
e.g. a logarithmic transformation for
highest year of school.
Second, mark the

checkbox for the
transformation we
want to add to the
data set, and clear
the other checkboxes.
Third, clear the

checkbox for Delete
transformed variables
from the data. This will
save the transformed
variable.
First, move the variable

that we want to transform
to the list box of variables
to test.
Fourth, click on the

OK button to
produce the output.
SW388R7
Data Analysis &
Computers II
The transformed variable in the data editor
Slide 78
If we scroll to the extreme

right in the data editor, we
see that the transformed
variable has been added to
the data set.
Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.
SW388R7
Data Analysis &
Computers II
The regression to identify outliers
Slide 79
We can use the regression

procedure to identify both
univariate and multivariate
outliers.
We start with the same dialog we
used for the last analysis, in which
prestg90 as the dependent
variable and age, educ, and sex
were the independent variables.
If we need to use any transformed
variables, we would substitute
them now.
We will save the calculated

values of the outlier
statistics to the data set.
Click on the Save button to

specify what we want to
save.
SW388R7
Data Analysis &
Computers II
Saving the measures of outliers
Slide 80
First, mark the checkbox for

Studentized residuals in the
Residuals panel. Studentized
residuals are z-scores computed
for a case based on the data for
all other cases in the data set.
Second, mark the checkbox for

Mahalanobis in the Distances
panel. This will compute
Mahalanobis distances for the
set of independent variables.
Third, click on
the OK button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
The variables for identifying outliers
Slide 81

univariate outliers for the
dependent variable are in a
column which SPSS has
names sre_1.

multivariate outliers for the
independent variables are in
a column which SPSS has
names mah_1.
SW388R7
Data Analysis &
Computers II
Computing the probability for Mahalanobis D
Slide 82
To compute the probability

of D, we will use an SPSS
function in a Compute
command.
First, select the

Compute command
from the Transform
menu.
SW388R7
Data Analysis &
Computers II
Formula for probability for Mahalanobis D
Slide 83
First, in the target variable text box, type the

name "p_mah_1" as an acronym for the probability
of the mah_1, the Mahalanobis D score.
Second, to complete the

specifications for the CDF.CHISQ
function, type the name of the
variable containing the D scores,
mah_1, followed by a comma,
followed by the number of variables
used in the calculations, 3.
Third, click on the OK button

to signal completion of the
computer variable dialog.
Since the CDF function (cumulative

density function) computes the
cumulative probability from the left
end of the distribution up through a
given value, we subtract it from 1 to
obtain the probability in the upper tail
of the distribution.
SW388R7
Data Analysis &
Computers II
The multivariate outlier
Slide 84
Using the probabilities computed in p_mah_1

to identify outliers, scroll down through the list
of case to see the one case with a probability
less than 0.001.
There is 1 case that has a combination of
scores on the independent variables that is
sufficiently unusual to be considered an outlier
(case 20001984: Mahalanobis D=16.97,
p=0.0007).
SW388R7
Data Analysis &
Computers II
The univariate outlier
Slide 85
Similarly, we can scroll down the values of

sre_1, the studentized residual to see the
one outlier with a value larger than 3.0.
There is 1 case that has a score on the
dependent variable that is sufficiently
unusual to be considered an outlier (case
20000391: studentized residual=4.14).
SW388R7
Data Analysis &
Computers II
Omitting the outliers
Slide 86
To omit the outliers from the

analysis, we select in the
cases that are not outliers.
First, select the

Select Cases
command from the
Transform menu.
SW388R7
Data Analysis &
Computers II
Specifying the condition to omit outliers
Slide 87
First, mark the If

condition is satisfied
option button to
indicate that we will
enter a specific
condition for
including cases.

If button to specify
the criteria for inclusion
in the analysis.
SW388R7
Data Analysis &
Computers II
The formula for omitting outliers
Slide 88
To eliminate the outliers, we

request the cases that are not
outliers.
The formula specifies that we
should include cases if the
studentized residual (regardless of
sign) if less than 3 and the
probability for Mahalanobis D is
higher than the level of
significance, 0.001.
After typing in the formula,
click on the Continue button
to close the dialog box,
SW388R7
Data Analysis &
Computers II
Completing the request for the selection
Slide 89
To complete the
request, we click on
the OK button.
SW388R7
Data Analysis &
Computers II
The omitted multivariate outlier
Slide 90
SPSS identifies the excluded cases by

drawing a slash mark through the case
number. Most of the slashes are for
cases with missing data, but we also see
that the case with the low probability for
Mahalanobis distance is included in
those that will be omitted.
SW388R7
Data Analysis &
Computers II
Running the regression without outliers
Slide 91
We run the regression again,

excluding the outliers.
Select the Regression |
Linear command from the
Analyze menu.
SW388R7
Data Analysis &
Computers II
Opening the save options dialog
Slide 92
If specify the dependent an

independent variables. If
we wanted to use any
transformed variables we
would substitute them now.
On our last run, we

instructed SPSS to save
studentized residuals and
Mahalanobis distance. To
prevent these values from
being calculated again, click
on the Save button.
SW388R7
Data Analysis &
Computers II
Clearing the request to save outlier data
Slide 93
First, clear the checkbox

for Studentized residuals.
Third, click on
the OK button to
complete the
specifications.
Second, clear the

checkbox form
Mahalanobis distance.
SW388R7
Data Analysis &
Computers II
Opening the statistics options dialog
Slide 94
Once we have removed outliers,

we need to check the sample
size requirement for regression.
Since we will need the
descriptive statistics for this,
click on the Statistics button.
SW388R7
Data Analysis &
Computers II
Requesting descriptive statistics
Slide 95
First, mark the checkbox

for Descriptives.
Second, click on
the Continue
button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Requesting the output
Slide 96
Having specified the

output needed for the
analysis, we click on
the OK button to obtain
the regression output.
SW388R7
Data Analysis &
Computers II
Sample size requirement
Slide 97
The minimum ratio of valid cases to independent

variables for multiple regression is 5 to 1. After
removing 2 outliers, there are 252 valid cases and
3 independent variables.
The ratio of cases to independent variables for this
analysis is 84.0 to 1, which satisfies the minimum
requirement. In addition, the ratio of 84.0 to 1
satisfies the preferred ratio of 15 to 1.
SW388R7
Data Analysis &
Computers II
Significance of regression relationship
Slide 98
The probability of the F statistic (36.639) for the

overall regression relationship is <0.001, less than
or equal to the level of significance of 0.05. We
reject the null hypothesis that there is no
relationship between the set of independent
variables and the dependent variable (R = 0).
We support the research hypothesis that there is a
statistically significant relationship between the
set of independent variables and the dependent
variable.
SW388R7
Data Analysis &
Computers II
Increase in proportion of variance
Slide 99
Prior to any transformations of variables to satisfy

the assumptions of multiple regression or removal
of outliers, the proportion of variance in the
dependent variable explained by the independent
variables (R) was 27.1%. No transformed
variables were substituted to satisfy assumptions,
but outliers were removed from the sample.
The proportion of variance explained by the
regression analysis after removing outliers was
30.7%, a difference of 3.6%.
The answer to the question

is true with caution.
A caution is added because
of a violation of regression
assumptions.
SW388R7
Data Analysis &
Computers II
Impact of assumptions and outliers - 1
Slide 100
The following is a guide to the decision process for answering

problems about the impact of assumptions and outliers on analysis:
Dependent variable
metric?
Independent variables
metric or dichotomous?
No
Inappropriate
application of
a statistic
Yes
Ratio of cases to
independent variables at
least 5 to 1?
Yes
Run baseline regression and
record R for future
reference, using method for
including variables identified
in the research question.
No
Inappropriate
application of
a statistic
SW388R7
Data Analysis &
Computers II
Slide 101
Is the dependent variable

normally distributed?
No
Try:
1. Logarithmic transformation
2. Square root transformation
3. Inverse transformation
If unsuccessful, add caution
Yes
Metric IVs normally

distributed and linearly
related to DV
No
Try:
1. Logarithmic transformation
2. Square root transformation
(3. Square transformation)
4. Inverse transformation
If unsuccessful, add caution
Yes
DV is homoscedastic for
categories of
dichotomous IVs?
Yes
No
Add caution
SW388R7
Data Analysis &
Computers II
Slide 102
Substituting any transformed variables, run

regression using direct entry to include all
variables to request statistics for detecting
outliers
Are there univariate

outliers (DV) or
multivariate outliers
(IVs)?
Yes
Remove outliers from data
No
Ratio of cases to
independent variables at
least 5 to 1?
Yes
Run regression again using
transformed variables and
eliminating outliers
No
Inappropriate
application of
a statistic
SW388R7
Data Analysis &
Computers II
Slide 103
Yes
Probability of ANOVA test of

regression less than/equal to
level of significance?
No
False
Yes
Increase in R correct?
No
False
Yes
Satisfies ratio for preferred

sample size: 15 to 1
(stepwise: 50 to 1)
Yes
No
True with caution
SW388R7
Data Analysis &
Computers II
Slide 104
Yes
Other cautions added for

ordinal variables or violation
of assumptions?
Yes
True with caution
No
True

Multiple Regression - Assumptions and Outliers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Regression - Assumptions and Outliers

Uploaded by

Copyright:

Available Formats

SW388R7

Data Analysis &

Multiple Regression Assumptions and

Multiple Regression and Assumptions

Multiple Regression and Assumptions

Multiple regression is most effect at identifying

Multiple Regression and Outliers

Outliers can distort the regression results. When an

Relationship between assumptions and outliers

The problems of satisfying assumptions and detecting

Order of analysis is important

The order in which we check assumptions and detect

Strategy for solving problems

Our strategy for solving problems about violations of

Run type of regression specified in problem statement on variables

Transforming dependent variables

We will use the following logic to transform variables:

If dependent variable is not normally distributed:

Transforming independent variables - 1

If independent variable is normally distributed and

Transforming independent variables - 2

If independent variable is linearly related to

Transforming independent variables - 3

If independent variable is not linearly related to

We evaluate the regression assumptions and detect

Whenever you start a new problem, make sure you

If you change the options for output in pivot tables

In the dataset GSS2000.sav, is the following statement true, false, or an

In the dataset GSS2000.sav, is the following statement true, false, or an

In the dataset GSS2000.sav, is the following statement true, false, or an

In this problem we are asked to idnetify the

In the dataset GSS2000.sav, is the following statement true, false, or an

R before transformations or removing outliers

To start out, we run a

R before transformations or removing outliers

Prior to any transformations of variables

R before transformations or removing outliers

For this particular question, we are not interested in the

Normality of the dependent variable:

In evaluating assumptions, the first step is to

First, move the

To test the normality of the dependent

Second, click on the

Normality of the dependent variable:

The dependent variable "total family income"

Linearity and independent variable:

To evaluate the linearity of the relationship

Second, move the

Third, click on the

Linearity and independent variable:

TOTAL FAMILY INCOME Pearson Correlation

HOW MANY Logarithm of Square of

**. Correlation is significant at the 0.01 level (2-tailed).

family earned money" [earnrs] and the

Normality of independent variable:

After evaluating the dependent variable, we

First, move the

Second, click on the

Normality of independent variable:

The independent variable "how many in family earned money" [earnrs]

In evaluating normality, the skewness (0.742) was between -1.0 and

Normality of independent variable:

The square root transformation also