Professional Documents
Culture Documents
Multiple Regression - Assumptions and Outliers
Multiple Regression - Assumptions and Outliers
SW388R7
Data Analysis &
Computers II
Slide 2
SW388R7
Data Analysis &
Computers II
Slide 3
SW388R7
Data Analysis &
Computers II
Slide 4
SW388R7
Data Analysis &
Computers II
Slide 5
SW388R7
Data Analysis &
Computers II
Slide 6
3.
4.
5.
6.
SW388R7
Data Analysis &
Computers II
Slide 7
SW388R7
Data Analysis &
Computers II
Slide 8
SW388R7
Data Analysis &
Computers II
Slide 9
SW388R7
Data Analysis &
Computers II
Slide 10
Impact of transformations
and omitting outliers
SW388R7
Data Analysis &
Computers II
Slide 11
SW388R7
Data Analysis &
Computers II
Notes
Slide 12
SW388R7
Data Analysis &
Computers II
Problem 1
Slide 13
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 1
Slide 14
After
satisfy regression assumptions
and removing outliers, the total proportion of variance explained by the
regression analysis increased by 10.8%.
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 2
Slide 15
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 1 - 3
Slide 16
The
of predictors
of "total family income" [income98] from the list: "sex" [sex], "how many
in family earned money" [earnrs], and "income" [rincom98].
After substituting transformed variables to satisfy regression assumptions
and removing outliers, the total proportion of variance explained by the
regression analysis increased by 10.8%.
1.
2.
3.
4.
True
Specifically, the question asks whether or
True with caution not the R for a regression analysis after
substituting transformed variables and
False
eliminating outliers is 10.8% higher than a
regression
using the original format
Inappropriate application
of aanalysis
statistic
for all variables and including all cases.
SW388R7
Data Analysis &
Computers II
Slide 17
We select stepwise as
the method to select the
best subset of predictors.
SW388R7
Data Analysis &
Computers II
Slide 18
SW388R7
Data Analysis &
Computers II
Slide 19
SW388R7
Data Analysis &
Computers II
Slide 20
SW388R7
Data Analysis &
Computers II
Slide 21
Lower Bound
Upper Bound
Statistic
15.67
14.98
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Std. Error
.349
16.36
15.95
17.00
27.951
5.287
1
23
22
8.00
-.628
-.248
.161
.320
SW388R7
Data Analysis &
Computers II
Slide 22
SW388R7
Data Analysis &
Computers II
Slide 23
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
.000
269
.000
269
.000
269
269
.606*
.000
269
.932*
.000
269
1
.
269
SW388R7
Data Analysis &
Computers II
Slide 24
SW388R7
Data Analysis &
Computers II
Slide 25
Lower Bound
Upper Bound
Statistic
1.43
1.31
Std. Error
.061
1.56
1.37
1.00
1.015
1.008
0
5
5
1.00
.742
1.324
.149
.296
SW388R7
Data Analysis &
Computers II
Slide 26
The logarithmic
transformation
improves the normality
of "how many in family
earned money" [earnrs]
without a reduction in
the strength of the
relationship to "total
family income"
[income98]. In
evaluating normality,
the skewness (-0.483)
and kurtosis (-0.309)
were both within the
range of acceptable
values from -1.0 to
+1.0. The correlation
coefficient for the
transformed variable is
0.536.
SW388R7
Data Analysis &
Computers II
Slide 27
SW388R7
Data Analysis &
Computers II
Slide 28
SW388R7
Data Analysis &
Computers II
Slide 29
Lower Bound
Upper Bound
Statistic
13.35
12.52
Std. Error
.419
14.18
13.54
15.00
29.535
5.435
1
23
22
8.00
-.686
-.253
.187
.373
SW388R7
Data Analysis &
Computers II
Slide 30
SW388R7
Data Analysis &
Computers II
Slide 31
Pearson Correlation
Sig. (2-tailed)
N
RESPONDENTS INCOME Pearson Correlation
Sig. (2-tailed)
N
Logarithm of RINCOM98 Pearson Correlation
[LG10( 24-RINCOM98)]
Sig. (2-tailed)
N
Square of RINCOM98
[(RINCOM98)**2]
Pearson Correlation
Sig. (2-tailed)
N
Square Root of
Pearson Correlation
RINCOM98 [SQRT(
Sig. (2-tailed)
24-RINCOM98)]
N
Inverse of RINCOM98 [-1/( Pearson Correlation
24-RINCOM98)]
Sig. (2-tailed)
N
Logarithm of
Square Root Inverse of
RINCOM98
Square of
of RINCOM98 RINCOM9
TOTAL
[LG10(
RINCOM98
[SQRT(
8 [-1/(
FAMILY
RESPONDEN 24-RINCOM
[(RINCOM9
24-RINCOM9
24-RINC
INCOME
TS INCOME
98)]
8)**2]
8)]
OM98)]
1
.577**
-.595**
.613**
-.601**
-.434**
.
.000
.000
.000
.000
.000
The evidence of linearity in the
229
163
163independent163
163
relationship163
between the
variable
"income"
[rincom98]
.577**
1
-.922**
.967** and the
-.985**
-.602**
dependent
variable "total
.000
.
.000
.000family income"
.000
.000
[income98]
was
the
statistical
163
168
168
168
168
168
significance
of
the
correlation
coefficient
-.595**
-.922**
1
-.976**
.974**
.848**
(r = 0.577). The probability for the
.000
.000
.
.000
.000
.000
163
.613**
.000
163
-.601**
.000
163
-.434**
.000
163
-.985**
.000
168
-.602**
.000
168
.974**
.000
168
.848**
.000
168
-.993**
.000
168
-.718**
.000
168
1
.
168
.714**
.000
168
168
-.718**
.000
168
.714**
.000
168
1
.
168
SW388R7
Data Analysis &
Computers II
Homoscedasticity: sex
Slide 32
HomoscedasticityAssumptionAnd
Transformations.SBS
SW388R7
Data Analysis &
Computers II
Homoscedasticity: sex
Slide 33
SW388R7
Data Analysis &
Computers II
Slide 34
SW388R7
Data Analysis &
Computers II
Slide 35
Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.
SW388R7
Data Analysis &
Computers II
Slide 36
SW388R7
Data Analysis &
Computers II
Slide 37
Third, click on
the OK button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Slide 38
SW388R7
Data Analysis &
Computers II
Slide 39
SW388R7
Data Analysis &
Computers II
Slide 40
SW388R7
Data Analysis &
Computers II
Multivariate outliers
Slide 41
SW388R7
Data Analysis &
Computers II
Univariate outliers
Slide 42
SW388R7
Data Analysis &
Computers II
Slide 43
SW388R7
Data Analysis &
Computers II
Slide 44
SW388R7
Data Analysis &
Computers II
Slide 45
SW388R7
Data Analysis &
Computers II
Slide 46
To complete the
request, we click on
the OK button.
SW388R7
Data Analysis &
Computers II
Slide 47
SW388R7
Data Analysis &
Computers II
Slide 48
SW388R7
Data Analysis &
Computers II
Slide 49
SW388R7
Data Analysis &
Computers II
Slide 50
Third, click on
the OK button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Slide 51
SW388R7
Data Analysis &
Computers II
Slide 52
Second, click on
the Continue
button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Slide 53
SW388R7
Data Analysis &
Computers II
Slide 54
Descriptiv e Statistics
TOTAL FAMILY INCOME
RESPONDENTS SEX
RESPONDENTS INCOME
Logarithm of EARNRS
[LG10( 1+EARNRS)]
Mean
17.09
1.55
13.76
Std. Deviation
4.073
.499
5.133
.424896
.1156559
N
159
159
159
159
SW388R7
Data Analysis &
Computers II
Slide 55
ANOVAd
Model
1
Regression
Residual
Total
Regression
Residual
Total
Regression
Residual
Total
Sum of
Squares
1122.398
1499.187
2621.585
1572.722
1048.863
2621.585
1623.976
997.609
2621.585
df
1
157
158
2
156
158
3
155
158
Mean Square
1122.398
9.549
F
117.541
Sig.
.000 a
786.361
6.723
116.957
.000 b
541.325
6.436
84.107
.000 c
The1+EARNRS)]
probability of the F statistic (84.107) for the regression
relationship
which includes
these variables is <0.001, less
c. Predictors: (Constant),
RESPONDENTS INCOME, Logarithm of EARNRS [LG10(
than
or
equal
to
the
level
of
significance of 0.01. We reject
1+EARNRS)], RESPONDENTS SEX
the null hypothesis that there is no relationship between
d.
Variable: TOTAL FAMILY INCOME
theDependent
best subset
of independent variables and the dependent
variable (R = 0).
We support the research hypothesis that there is a
statistically significant relationship between the best subset
of independent variables and the dependent variable.
SW388R7
Data Analysis &
Computers II
Slide 56
Model Summary
Model
1
2
3
R
R Square
a
.654
.428
b
.775
.600
c
.787
.619
Adjusted
R Square
.424
.595
.612
Std. Error of
the Estimate
3.090
2.593
2.537
Prior to any
transformations of variables to satisfy
c. Predictors: (Constant), RESPONDENTS
INCOME,
the assumptions of multiple regression or removal
Logarithm of EARNRS [LG10( 1+EARNRS)],
of outliers, the proportion of variance in the
RESPONDENTS SEX
dependent variable explained by the independent
variables (R) was 51.1%.
SW388R7
Data Analysis &
Computers II
Problem 2
Slide 57
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 1
Slide 58
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 2
Slide 59
1.
2.
3.
4.
True
True with caution
False
Inappropriate application of a statistic
SW388R7
Data Analysis &
Computers II
Dissecting problem 2 - 3
Slide 60
The
of "age"
[age], "highest year of school completed" [educ], and "sex" [sex] to the
dependent variable "occupational prestige score" [prestg80].
After substituting transformed variables to satisfy regression
assumptions and removing outliers, the proportion of variance
explained by the regression analysis increased by 3.6%.
1.
2.
3.
4.
True
True with caution Specifically, the question asks whether or
not the R for a regression analysis after
False
substituting transformed variables and
eliminating outliers is 3.6% higher than a
Inappropriate application
a statistic
regression of
analysis
using the original format
for all variables and including all cases.
SW388R7
Data Analysis &
Computers II
Slide 61
SW388R7
Data Analysis &
Computers II
Slide 62
SW388R7
Data Analysis &
Computers II
Slide 63
SW388R7
Data Analysis &
Computers II
Slide 64
SW388R7
Data Analysis &
Computers II
Slide 65
SW388R7
Data Analysis &
Computers II
Slide 66
Descriptiv es
AGE OF RESPONDENT Mean
95% Confidence
Interval for Mean
Lower Bound
Upper Bound
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Statistic
45.99
43.98
Std. Error
1.023
48.00
45.31
43.50
282.465
16.807
19
89
70
24.00
.595
-.351
The independent variable "age" [age] satisfies the criteria for the
assumption of normality, but does not satisfy the assumption of
linearity with the dependent variable "occupational prestige score"
[prestg80].
In evaluating normality, the skewness (0.595) and kurtosis (-0.351)
were both within the range of acceptable values from -1.0 to +1.0.
.148
.295
SW388R7
Data Analysis &
Computers II
Slide 67
SW388R7
Data Analysis &
Computers II
Slide 68
Correlations
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
AGE OF RESPONDENT
Logarithm of AGE
[LG10(AGE)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
RS
OCCUPA
TIONAL
PRESTIG
E SCORE
(1980)
1
.
255
.024
.706
255
.059
.348
255
-.004
.956
255
.041
.518
255
.096
.128
255
AGE OF
Logarithm of
Square of
Square Root
Inverse of
The evidence of nonlinearity in the
RESPON
AGE
AGE
of AGE
AGE
relationship between the independent
DENT
[LG10(AGE)]
[(AGE)**2]
[SQRT(AGE)]
[-1/(AGE)]
variable "age" [age] and the dependent
.024 variable .059
-.004
.041
.096
"occupational prestige score"
.706 [prestg80]
.348
.956of statistical.518
.128
was the lack
coefficient
255 significance
255of the correlation
255
255
255
(r
=
0.024).
The
probability
for
the
1
.979**
.983**
.995**
.916**
correlation coefficient was 0.706, greater
.
.000
.000
.000
.000
than the level of significance of 0.01. We
270 cannot reject
270 the null 270
270
hypothesis that270
r = 0,
.979** and cannot 1conclude .926**
.994**
.978**
that there is a linear
the variables. .000
.000 relationship .between .000
.000
270 Since none
270of the transformations
270
to270
improve linearity were successful, it is an
.983**
.926**
1
.960**
indication that the problem may be a weak
.000 relationship,
.000 rather than .a curvilinear
.000
270 relationship
270correctable
270by using a 270
transformation.
A
weak
relationship is not
.995**
.994**
.960**
1 a
violation
of
the
assumption
of
linearity,
.000
.000
.000
.and
does not require a caution.
270
270
270
270
.916**
.978**
.832**
.951**
.000
.000
.000
.000
270
270
270
270
270
.832**
.000
270
.951**
.000
270
1
.
270
SW388R7
Data Analysis &
Computers II
Slide 69
SW388R7
Data Analysis &
Computers II
Slide 70
LinearityAssumptionAndTransformations.SBS
SW388R7
Data Analysis &
Computers II
Slide 71
RS OCCUPATIONAL
PRESTIGE SCORE
(1980)
HIGHEST YEAR OF
SCHOOL COMPLETED
Logarithm of EDUC
[LG10( 21-EDUC)]
Square of EDUC
[(EDUC)**2]
Square Root of EDUC
[SQRT( 21-EDUC)]
Inverse of EDUC [-1/(
21-EDUC)]
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
RS
OCCUPA
TIONAL
HIGHEST
Square Root
PRESTIG
YEAR OF The
Logarithm
of
Square
of "highest
of EDUC
Inverse of
independent
variable
year
E SCORE
SCHOOL of EDUC
EDUC
[SQRT(
EDUC [-1/(
school[LG10(
completed"
[educ] satisfies
(1980)
COMPLETEDthe 21-EDUC)]
[(EDUC)**2]
21-EDUC)]
criteria for the
assumption21-EDUC)]
of
1
.495**
-.512**
.528** variable
-.518**
-.423
linearity with
the dependent
.
.000 "occupational
.000 prestige score"
.000
.000
.000
255
254 [prestg80],254
254
254
but does not
satisfy the 254
assumption
of
normality.
The
evidence
.495**
1
-.920**
.980**
-.982**
-.699
of
linearity
in
the
relationship
between
.000
.
.000
.000
.000
.000
the
independent
variable
"highest
year
254
269
269
269
269
269
of
school
completed"
[educ]
and
the
-.512**
-.920**
1
-.969**
.977**
.915
dependent variable "occupational
.000
.000
.
.000
.000
.000
254
.528**
.000
254
-.518**
.000
254
-.423**
.000
254
.915**
.000
269
-.789**
.000
269
.812**
.000
269
269
-.789
.000
269
.812
.000
269
1
.
269
SW388R7
Data Analysis &
Computers II
Slide 72
SW388R7
Data Analysis &
Computers II
Slide 73
Mean
95% Confidence
Interval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
Lower Bound
Upper Bound
Statistic
13.12
12.77
Std. Error
.179
13.47
13.14
13.00
8.583
2.930
2
20
18
3.00
-.137
1.246
.149
.296
SW388R7
Data Analysis &
Computers II
Slide 74
SW388R7
Data Analysis &
Computers II
Homoscedasticity: sex
Slide 75
HomoscedasticityAssumptionAnd
Transformations.SBS
SW388R7
Data Analysis &
Computers II
Homoscedasticity: sex
Slide 76
SW388R7
Data Analysis &
Computers II
Slide 77
SW388R7
Data Analysis &
Computers II
Slide 78
Whenever we add
transformed variables to
the data set, we should be
sure to delete them before
starting another analysis.
SW388R7
Data Analysis &
Computers II
Slide 79
SW388R7
Data Analysis &
Computers II
Slide 80
Third, click on
the OK button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Slide 81
SW388R7
Data Analysis &
Computers II
Slide 82
SW388R7
Data Analysis &
Computers II
Slide 83
SW388R7
Data Analysis &
Computers II
Slide 84
SW388R7
Data Analysis &
Computers II
Slide 85
SW388R7
Data Analysis &
Computers II
Slide 86
SW388R7
Data Analysis &
Computers II
Slide 87
SW388R7
Data Analysis &
Computers II
Slide 88
SW388R7
Data Analysis &
Computers II
Slide 89
To complete the
request, we click on
the OK button.
SW388R7
Data Analysis &
Computers II
Slide 90
SW388R7
Data Analysis &
Computers II
Slide 91
SW388R7
Data Analysis &
Computers II
Slide 92
SW388R7
Data Analysis &
Computers II
Slide 93
Third, click on
the OK button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Slide 94
SW388R7
Data Analysis &
Computers II
Slide 95
Second, click on
the Continue
button to
complete the
specifications.
SW388R7
Data Analysis &
Computers II
Slide 96
SW388R7
Data Analysis &
Computers II
Slide 97
SW388R7
Data Analysis &
Computers II
Slide 98
SW388R7
Data Analysis &
Computers II
Slide 99
SW388R7
Data Analysis &
Computers II
Slide 100
No
Inappropriate
application of
a statistic
Yes
Ratio of cases to
independent variables at
least 5 to 1?
Yes
Run baseline regression and
record R for future
reference, using method for
including variables identified
in the research question.
No
Inappropriate
application of
a statistic
SW388R7
Data Analysis &
Computers II
Slide 101
No
Try:
1. Logarithmic transformation
2. Square root transformation
3. Inverse transformation
If unsuccessful, add caution
Yes
No
Try:
1. Logarithmic transformation
2. Square root transformation
(3. Square transformation)
4. Inverse transformation
If unsuccessful, add caution
Yes
DV is homoscedastic for
categories of
dichotomous IVs?
Yes
No
Add caution
SW388R7
Data Analysis &
Computers II
Slide 102
Yes
Remove outliers from data
No
Ratio of cases to
independent variables at
least 5 to 1?
Yes
Run regression again using
transformed variables and
eliminating outliers
No
Inappropriate
application of
a statistic
SW388R7
Data Analysis &
Computers II
Slide 103
Yes
No
False
Yes
Increase in R correct?
No
False
Yes
Yes
No
SW388R7
Data Analysis &
Computers II
Slide 104
Yes
Yes
True with caution
No
True