Professional Documents
Culture Documents
ST T153A Regression Analysis
ST T153A Regression Analysis
STT153A
Dataset
We are using a synthetic dataset on 999 household heads with the following
variables:
• -1 < r < 1
• (+) direct linear relationship
• (-) inverse linear relationship
• Correlation is NOT causation
The correlation can be visualized through scatterplots
Correlation Coefficient
Interpretation Guideline
Rule of thumb:
0.0 = Irl: no correlation
0.0 < Irl < 0.2 : very weak correlation
0.2 < |r| < 0.4 : weak correlation
0.4 < |r| < 0.6 : moderately strong correlation
0.6 < |r| < 0.8 : strong correlation
0.8 < |r| < 1.0 : very strong correlation
1.0 = Irl : perfect correlation
ITH PHAN NY
Scatterplots
Flome Edit View Format Statistics Data Mining Graphs Tools Data Workbook
ODDI
H03D Seq,T ffllcons rfilnput Data Graphs w ^^Multi-Graph Layouts w
Histograrr Scatterplot vleans Box Variability Line
sjBDXYZ- Categorized ^ Batch (By Group)
Common More... Tools
2D Scatterplots
LpJ Variables:
Cancel
□ X: annua line
r Y: ressize
Options
K Sel Cond
Graph type Regression bands
Regular Case Weights
£§) Off
Hil Multiple
M Linear
Scatterplots
Scatterplot of ressize against annualinc
HHData in HHData (6) 11v*999c ressize The scatterplot shows a trend where
= -0.667+0.768*x
household heads with high annualinc
tend to have high ressize
ressize
annualinc
Simple Linear Regression Model
The idea is that we can explain the relationship of one predictor X and another
variable Y through the equation
Y
i = Bo + B1Xi + £i
Ŷ = b 0 + b± X
Simple Linear Regression Model
Redundancy
M ANOVA (Overall goodness of fit) By Group
Covariance of coefficients Stepwise regression summary
regression coefficients
Ŷ = -0.6670 + 0.7680X
Interpretation:
57.61% of the total variation in ressize can be explained by annualinc.
Multiple Regression
With a coefficient of determination at 57.61% the model predicting ressize
using annualinc can be improved.
For the next example, let's try to use all possible predictor (FULL MODEL)
Multiple Regression
1 - Subject 1 - Subject OK
2- sex 2 - sex
3- mstatus 3 - mstatus Cancel
4 - age
5-occupation 5-occupation
[Bundles]..
6 - educyrs 6- educyrs
7-distwork 7- distwork
8 - workyrs 8-workyrs Use the "Show
appropriate variables
9-annualinc 9-annualinc
only" option to pre-
10-famsize 10-famsize screen variable lists and
11 - ressize show categorical and
continuous variables.
Press F1 for more
Select All Spread Zoom Select All Spread Zoom information.
Select by number
Multiple Regression
Regression Summary for Dependent Variable: ressize (HHData in Workbookl(Recovered))
R= .84942340 R2= .72152011 Adjusted R2= .71983576 F(6,992)=428.37 p<0.0000 Std.Error of estimate:
5.4647
b* Std Err -b- Std Err t(992) p-value
N=999 of b* of b
Intercept -7.66840 1 485763 -5.16125 0.000000
age 0.098798 0.035011 0.09146 0.032410 2.82194 0.004869
educyrs 0 008312 0.01903^ 0.04353 0 099686 0.43671 0.662418
distwork 0 002302 0 016876 0 00245 0 017951 013639 0.891544
workyrs 0 082220 0.034186 0.10509 0.043697 2.40506 0.016352
annualinc 0 545621 0.021628 0.55206 0.021883 25 22781 0.000000
fa m size 0 362981 0.018436 2.84849 0144670 19.68960 0.000000
regression coefficients
regression coefficients
Why are some rows highlighted RED while others are BLACK?
regression coefficients
Why are some rows highlighted RED while others are BLACK?
Examples:
age: p-value = 0.004869 Reject H0.
• Backward stepwise
Starting with all possible predictors (full model), we delete
"insignificant" predictors
Variable Selection and Model Building
Model Definition: HHData in Workbook1_(R... X
Variables Cancel
Regression Summary for Dependent Variable: ressize (HHData in Workbookl (Recovered)) R= 84938976 R2=
72146297 Adjusted R2= 72034210 F(4.994)=643 66 p<0 0000 Std Error of estimate: 5 4597
b* Std Err b Std Err t(994) p-value
N=999 of b* of b
Intercept -7 11224 0.328559 -3 53387 0.000000
annualinc 0.549812 0.019475 0.55630 0.019705 23 23104 0.000000
fa m size 0.362171 0 018293 2.84214 0143592 19.79321 0.000000
age 0.100584 0 034733 0.09311 0 032153 2.39592 0 003863
workyrs 0 079783 0.033652 0.10193 0.043014 2.37035 0 017937
Backward Stepwise
Regression Summary for Dependent Variable: ressize (HHData in Workbook1_(Recovered))
R= .84846207 R2= .71988789 Adjusted R2= .71904333
F(3,995)=852 38 p<0 0000 Std Error of estimate: 5 4724_
b* Std. Err of b* b Std Err t(995) p-value
N=999 of b
Intercept -3 36767 0.638746 -13 1002 0.000000
age 0.170577 0.018340 0.15791 0.016973 9.3006 0.000000
annualinc 0.546926 0 019432 0.55333 0.019712 23 0723 0.000000
famsize 0 363613 0 013330 2.35349 0.143345 19 3373 0.000000
Adjusted R-Squared
A measure of goodness-of-fit like R-Squared but also captures model
parsimony (simplicity) by penalizing the number of predictors.
Adjusted R
Squared = I - (1 - R2) x (n - 1)
Formula (n-k-1)
Highlighted yellow are the models with highest RSq and Adj RSq.
As expected, MODEL 2 has the highest RSq since it has the most predictors
However, for Adj RSq, MODEL 3 is the highest
e = Observed — Predicted
Residuals
We can extract residuals from a chosen model.
Evidences:
• Residual Histogram
• Normal Probability Plot
800
Would you say that the residuals (blue bars) follow a normal
distribution (red curve)?
Normal Probability Plot (Q-Q plot)
If the residuals (blue dots) follow the red line closely and
consistently then evidence points towards normality of residuals
Tests for Normality
add
Quick | Advanced | Residuals | Predicted | Scatterplots | Probability plots | Outliers I Save
Summary
]
Save residuals S predicted Cancel
Options
First, save the residuals as a dataset
By Group
HHData in HHData (6)
1 2 3 4 5 6 7 8 9
ressize Predicted Residuals StandardPredicted StandardResidual StdErrorPredicted MahalanobisDistance DeletedResidual CookDistance
1 26 20.91 5 09 035 093 0.31 2.12 5.11 0 00
2 22 25.57 -3.57 0.88 -0.65 0.31 2.23 -3.58 0.00
3 20 18.37 1.63 006 030 0.37 3.61 1 63 0 00
4 20 21.45 -1.45 0.41 -0 26 0 36 3.36 -1.45 0 00
5 19 18.76 0.24 0.10 0.04 0.25 1.14 0.24 0.00
6 36 30.69 5.31 1 46 097 0 49 7.11 5.35 0 00
7 8 28.15 -20.15 1.17 -3.68 0.59 10.62 -20.39 0 04
8 25 24.36 0.64 0.74 0.12 0.43 5.07 0.64 0.00
9 6 3.97 2.03 -1 59 037 0 36 3.39 2.04 0 00
10 32 28.41 3.59 1.20 0.66 0.31 2.15 3.61 0.00
11 33 28.01 4 99 1.16 0.91 0.29 1.89 5.00 0 00
12 30 27.15 2.85 1 06 052 027 1.50 2.86 0 00
13 29 25.33 3.67 0.85 0.67 0.27 1.46 3.68 0.00
14 18 16.71 1 29 -0 13 0.24 0.31 2.26 1.30 0 00
15 20 21.37 -1.37 040 -0.25 0 30 1.97 -1.37 0 00
16 19 13.61 5.39 -0.49 0.99 0.32 2.33 5.41 0.00
17 30 25.01 4.99 081 091 0 30 2.05 501 0.00
18 39 40.02 -1.02 2.53 -0.19 0.74 17.09 -1.04 0.00
19 31 31.42 -0 42 1.55 -0 08 0 32 2.48 -042 0 00
20 17 18.52 -1.52 007 -0 28 040 4.31 -1.53 0 00
21 28 17.64 10.36 -0.03 1.89 0.35 2.98 10.40 0.00
22 20 20.97 -0 97 035 -0 18 0 37 3.55 -097 0 00
23 6 3.18 2 82 -1 68 0.52 0.35 3.20 2.83 0 00
24 19 19.78 -0.78 0.22 -0.14 0.42 4.98 -0.78 0.00
25 17 19.24 -2 24 0 16 -041 0 37 3.58 -2 25 0 00
26 26 14.56 11.44 -0.38 2.09 0.31 2.18 11.48 0.00
27 5 3.34 1.60 -1.66 030 0.35 3.03 1.67 0 00
28 32 30.55 1 45 1.45 027 0.31 2.28 1.46 0 00
29 4 30.38 -26.38 1.43 -4.82 0.35 3.19 -26.49 0.02
30 28 25.41 2 59 086 047 0.26 1.28 2.60 0 00
31 20 18.91 1 09 0 12 020 040 4.30 1 10 0 00
32 34 33.16 0.84 1.75 0.15 0.35 3.09 0.85 0.00
33 32 24.85 7 15 080 1.31 029 1.81 7 17 0.00
34 24 10.30 13.70 -0.86 2.50 0.25 1.04 13.73 0.00
35 6 3.42 2 58 -1.65 047 0.37 3.48 2 60 0 00
36 27 22.08 4.92 048 090 026 1.26 4 93 0 00
Tests for Normality
Tests for Normality
H0: the residuals follow a normal
distribution
Ha: the residuals DO NOT follow a
normal distribution
Predicted Values
By Group
60
50
40
30
20
10
10
- -5 0 5 10 15 20 25 30 35 40 45
Predicted Values 10.95 ConflnT
20
Observed Values vs. Residuals
- Dependent variable: ressize
70 -T-T-T-|-T-|-T-
30
50
Residuals
40
30
Residuals
0 10 20 60
30 40 50
Observed Values 0.95 Conf.lnt.
Independence
The residuals from the linear regression model must be statistically
independent
I
Durbin-Watson statistic 100000
Spreadsheet or Graph:
Durbin-Watson d (HHData
and serial correlation of res
Durbin- Serial
Watson d Corr
Estimate 1.7966131
This value is interpreted in the
same way as Pearson's r, so
there is a WEAK dependence
among residuals
To check, we can:
1. Check correlations AMONG PREDICTORS
2. Check if variance inflation factor (VIF) > 10 and or tolerance < 0.1
Multicollinearity
3 Multiple Regression Results: HHData in HHData (6)
?X
Multiple Regression Results (Step 3)
Quick^^^^^^^Jiesiduals/assumptions/prediction |
n = 999 4/n
= 0.004
Every
observation
with Cook's
d greater
than 0.004
can be
considered
outliers
Dummy Variables
Regression analysis is intended for predicting a QUANTITATIVE variable using a
set of QUANTITATIVE predictors.
1, if male if
sex = I 0, female
V
mstatus becomes.
(1, if married
l 0, if not
married = 1, if widowed
0, if not
widowed = V
Example: Full Model with Dummy Variable sex
Since sex is already encoded as 1's and 0's, we can readily include it
Example: Full Model with Dummy Variable sex
Regression Summary for Dependent Variable: ressize (HHData in HHData (6))
R= .84954541 R2= 72172741 Adjusted R2= 71976181 F(7,991)=367 18 p<0 0000
Std Error of estimate: 5 4654
b* Std Err b Std Err t(991) p-value
N=999 of b* of b
Intercept -3 17730 1.599651 -5.11193 0.000000 1, if male if
sex 0.017021 0.019810 0 43977 0.511326 085921 0.390433 sex = 0, female
age iVltebtM li.l&db/ ibybtJbb
educyrs 0.012640 0.019692 0.06620 0103130 064192 0.521075
distwork -0.000403 0.017169 -0.00043 0 013263 -0 02345 0 981293
werkyrs 0.077556 0.034619 0.09913 0044250 2 24023 0.025294
annualinc 0 537227 0.023734 0.54357 0 024014 22 63511 0.000000
fa m size 0 361365 0.018533 2 33532 0.145439 19 49332 0.000000
• A male household head is EXPECTED to have a ressize 0.4398 units BIGGER than a female (reference or 0)
household head
• You can proceed all things discussed in this topic for this new model (i.e., model building and residual
analysis)