ST T153A Regression Analysis

Regression Analysis
STT153A
Dataset
We are using a synthetic dataset on 999 household heads with the following
variables:
• Subject: an identification number 1 if male, 0 if

• sex: female 1 if single, 2 if married, 3 if
• mstatus: widowed in years
• age: occupation categories (1,2,3,4 or 5)
• years of education distance from work
occupation: (in miles) years spent with current
• educyrs: employer annual income in thousands
• distwork: of USD family size (including household
• workyrs: head) size of residence (in 100 sq. ft.)
• annualinc:
• famsize:
• ressize:
Correlation
• Quantification of the relationship between two QUANTITATIVE variables
• The quantity is called the Pearson's correlation coefficient or simply r.
• -1 < r < 1
• (+) direct linear relationship
• (-) inverse linear relationship
• Correlation is NOT causation
The correlation can be visualized through scatterplots
Correlation Coefficient
Interpretation Guideline
Rule of thumb:
0.0 = Irl: no correlation
0.0 < Irl < 0.2 : very weak correlation
0.2 < |r| < 0.4 : weak correlation
0.4 < |r| < 0.6 : moderately strong correlation
0.6 < |r| < 0.8 : strong correlation
0.8 < |r| < 1.0 : very strong correlation
1.0 = Irl : perfect correlation
ITH PHAN NY
Scatterplots
Flome Edit View Format Statistics Data Mining Graphs Tools Data Workbook
Matr x Block Data Graphs ^pUser-defined w

w
ODDI
H03D Seq,T ffllcons rfilnput Data Graphs w ^^Multi-Graph Layouts w
Histograrr Scatterplot vleans Box Variability Line
sjBDXYZ- Categorized ^ Batch (By Group)
Common More... Tools
2D Scatterplots
Quick Advanced Appearance Categorized Options 1 Options 2
LpJ Variables:
Cancel
□ X: annua line
r Y: ressize
Options
K Sel Cond
Graph type Regression bands
Regular Case Weights
£§) Off
Hil Multiple
O Confidence Graphs Gallery

O Prediction
Updating: Auto ▼
Fit type
M Linear
Scatterplots
Scatterplot of ressize against annualinc
HHData in HHData (6) 11v*999c ressize The scatterplot shows a trend where
= -0.667+0.768*x
household heads with high annualinc
tend to have high ressize
ressize
annualinc
Simple Linear Regression Model
The idea is that we can explain the relationship of one predictor X and another
variable Y through the equation
Y
i = Bo + B1Xi + £i
Ex. Let Y = resize and X = annualinc

The goal is to estimate the regression coefficients B0 and &, assuming that £±
are normal random variables with a mean of 0 and constant variance a2
Ŷ = b 0 + b± X
Simple Linear Regression Model
Multiple Regression Results: HHData in Workbookl.(Recovered)
Mu11 i p1e Regressi on Resu11 s

iss i 2 e Mu 11 i p 1 e R = .7 5 8 9 F = 1354.777
8 9 6 4 If =
R*= .5760652 1,997
7 p = 0.000000
999 adjusted R’= .57564006
Standard error of estimate: 6.725521403
annualino -.667034183
b*: Std.Error: .5467279 t( 99
(significant b* are highlighted in red)
Alpha for highlighting effects: \K_s

Quick Advanced Residuals/assumptions/prediction |
Cancel
LnnJ Summary: Regression results Partial correlations |S Opti
Redundancy
M ANOVA (Overall goodness of fit) By Group
Covariance of coefficients Stepwise regression summary
Current sweep matrix ANOVA adjusted for mean

Simple Linear Regression
Regression Summary for Dependent Variable: ressize (HHData in Workbook1_(Recovered)) R=
75893964 R2= 57606527 Adjusted Rz= 57564006 F(1,997)=1354 8 p<0.0000 Std Error of
estimate: 6 7255
Std Err Std Err
b* b t(997) p-value
N=999 of b* of b
Intercept -0 667034 0.546723 -1.22005 0 222735
annualinc 0.758990' 0.020621 0.767951 0.020864 36.80730 0.000000
regression coefficients
The estimated regression equation can be written as:
Ŷ = -0.6670 + 0.7680X
estimated ressize = -0.6670 + 0.7680(annualinc)

Prediction: Example
What will be the EXPECTED ressize for someone with an annualinc of 30?
Ŷ = -0.667 + 0.768(30) = 22.373
Interpret the regression coefficients b0 and b1.
-0.667: The EXPECTED ressize for someone with an annualinc of 0 is -0.667.

0.768: The EXPECTED increase in ressize for every unit increase in annualinc.
Coefficient of Determination (R-square)
A measure of goodness-of-fit.
Regression Si mmarv for Henendent Variahle ressi7e O-HHIOata in Wnrkhnnkl ('Recovered’)1) R=

Ex: 75893964 R2= 57606527 |Adjusted R2= 57564006 F(1,997)=1354_8 p<0.0000 Sta Error of estimate:
6 7255
b* Std Err b Std Err t(997) p-value
R-sq = 57.61% N=999 of b* of b
Intercept -0 667034 0.546723 -1.22005 0 222735
annualinc 0.758990 1
0.020621 0.767951 0.020364 36 30730 0.000000
Interpretation:
57.61% of the total variation in ressize can be explained by annualinc.
Multiple Regression
With a coefficient of determination at 57.61% the model predicting ressize
using annualinc can be improved.
We can add some more predictors to potentially improve our model.
Multiple Regression Analysis makes use of more than one predictor.
For the next example, let's try to use all possible predictor (FULL MODEL)
Multiple Regression
Select dependent and independent variable lists: X
1 - Subject 1 - Subject OK
2- sex 2 - sex
3- mstatus 3 - mstatus Cancel
4 - age
5-occupation 5-occupation
[Bundles]..
6 - educyrs 6- educyrs
7-distwork 7- distwork
8 - workyrs 8-workyrs Use the "Show
appropriate variables
9-annualinc 9-annualinc
only" option to pre-
10-famsize 10-famsize screen variable lists and
11 - ressize show categorical and
continuous variables.
Press F1 for more
Select All Spread Zoom Select All Spread Zoom information.
Dependent var (or list for batch): Independent variable list
age educyrs distwork workyrs annualincfc

Show appropriate variables only
Select by number
Multiple Regression
Regression Summary for Dependent Variable: ressize (HHData in Workbookl(Recovered))
R= .84942340 R2= .72152011 Adjusted R2= .71983576 F(6,992)=428.37 p<0.0000 Std.Error of estimate:
5.4647
b* Std Err -b- Std Err t(992) p-value
N=999 of b* of b
Intercept -7.66840 1 485763 -5.16125 0.000000
age 0.098798 0.035011 0.09146 0.032410 2.82194 0.004869
educyrs 0 008312 0.01903^ 0.04353 0 099686 0.43671 0.662418
distwork 0 002302 0 016876 0 00245 0 017951 013639 0.891544
workyrs 0 082220 0.034186 0.10509 0.043697 2.40506 0.016352
annualinc 0 545621 0.021628 0.55206 0.021883 25 22781 0.000000
fa m size 0 362981 0.018436 2.84849 0144670 19.68960 0.000000
The estimated regression equation can be written as:
Ŷ = -7.7 + 0.09(age) + 0.04(educyrs) + 0.002(distwork) + 0.11(workyrs) + 0.55(annualinc) + 2.85(famsize)

Example. Interpret the regression coefficient for age:
The EXPECTED increase in ressize for every unit increase in age is 0.09 while holding all other
predictors constant
Multiple Regression
5.4647
N=999 of b* of b
Intercept -7.66840 1 485763 -5.16125 0.000000
age 0.098798 0.035011 0.09146 0.032410 2.82194 0.004869
educyrs 0 008312 0.01903^ 0.04353 0 099686 0.43671 0.662418
distwork 0 002302 0 016876 0 00245 0 017951 013639 0.891544
workyrs 0 082220 0.034186 0.10509 0.043697 2.40506 0.016352
annualinc 0 545621 0.021628 0.55206 0.021883 25 22781 0.000000
fa m size 0 362981 0.018436 2.84849 0144670 19.68960 0.000000
Why are some rows highlighted RED while others are BLACK?
STATISTICA usually highlights results with small p-values.

For every p-value you will encounter, there exists a hypothesis test actually being
conducted. The rule is to Reject H0 when p-value < a (usually 0.05).
Test for significance of B B is beta but idk how to a
H0\ fti = 0 (insignificant)
Ha: fti ^ 0 (significant)
Multiple Regression
5.4647
N=999 of b* of b
Intercept -7.66840 1 485763 -5.16125 0.000000
age 0.098798 0.035011 0.09146 0.032410 2.82194 0.004869
educyrs 0 008312 0.01903^ 0.04353 0 099686 0.43671 0.662418
distwork 0 002302 0 016876 0 00245 0 017951 013639 0.891544
workyrs 0 082220 0.034186 0.10509 0.043697 2.40506 0.016352
annualinc 0 545621 0.021628 0.55206 0.021883 25 22781 0.000000
fa m size 0 362981 0.018436 2.84849 0144670 19.68960 0.000000
Why are some rows highlighted RED while others are BLACK?
Examples:
age: p-value = 0.004869 Reject H0.
age is a significant predictor for ressize in the presence of other predictors

educyrs: p-value = 0.662418 Do not reject H0.
age is NOT a significant predictor for ressize in the presence of other predictors
Variable Selection and Model Building
• Forward stepwise
Starting with a single predictor, we add predictors until the added
"explanatory power" is negligible.
• Backward stepwise
Starting with all possible predictors (full model), we delete
"insignificant" predictors
Model Definition: HHData in Workbook1_(R... X
Quick | Advanced | Stepwise | Descriptives | H OK
Variables Cancel
Dependent ressize Independent age Options T

educyrs-famsize
3] By Group
—I §1
Method: All Effects
_I
specify here if:

• All Effects = Full Model
• Forward stepwise
• Backward stepwise
Forward Stepwise
Regression Summary for Dependent Variable: ressize (HHData in Workbookl (Recovered)) R= 84938976 R2=
72146297 Adjusted R2= 72034210 F(4.994)=643 66 p<0 0000 Std Error of estimate: 5 4597
N=999 of b* of b
Intercept -7 11224 0.328559 -3 53387 0.000000
annualinc 0.549812 0.019475 0.55630 0.019705 23 23104 0.000000
fa m size 0.362171 0 018293 2.84214 0143592 19.79321 0.000000
age 0.100584 0 034733 0.09311 0 032153 2.39592 0 003863
workyrs 0 079783 0.033652 0.10193 0.043014 2.37035 0 017937
Backward Stepwise
Regression Summary for Dependent Variable: ressize (HHData in Workbook1_(Recovered))
R= .84846207 R2= .71988789 Adjusted R2= .71904333
F(3,995)=852 38 p<0 0000 Std Error of estimate: 5 4724_
b* Std. Err of b* b Std Err t(995) p-value
N=999 of b
Intercept -3 36767 0.638746 -13 1002 0.000000
age 0.170577 0.018340 0.15791 0.016973 9.3006 0.000000
annualinc 0.546926 0 019432 0.55333 0.019712 23 0723 0.000000
famsize 0 363613 0 013330 2.35349 0.143345 19 3373 0.000000
Adjusted R-Squared
A measure of goodness-of-fit like R-Squared but also captures model
parsimony (simplicity) by penalizing the number of predictors.
Adjusted R
Squared = I - (1 - R2) x (n - 1)
Formula (n-k-1)
Basically, a high Adj RSq is desirable when comparing different models

Model Comparison
RSq Adj RSq
MODEL 1 SLRM 57.61% 57.56%
MODEL 2 Full Model 72.15% 71.98%
MODEL 3 Forward Stepwise 72.15% 72.03%
MODEL 4 Backward Stepwise 71.99% 71.90%
Highlighted yellow are the models with highest RSq and Adj RSq.
As expected, MODEL 2 has the highest RSq since it has the most predictors
However, for Adj RSq, MODEL 3 is the highest
What is Adj RSq (adjusted R-squared)?

Residual Analysis and Model
Diagnostics
Residuals
To assess model fit, we want errors to be small, we will call them
residuals or e:
e = Observed — Predicted
Residuals
We can extract residuals from a chosen model.
For illustration, let us proceed by using the backward stepwise model.

Residuals
Residuals
Predicted & Residual Values
ress ze
Observed Value Predicted Value Residual Standard Pred Standard Residual Std.Err. Pred.Val rvlahalanobis Distance Deleted Residual Cooks Distance
26 000000 2C 90676' 5.093231 0.346688 0.930711 0.306002 2.121488 5.109206 0.000681
22 000000 2 s 56713 -3.567139 0 878710 0.651841 0.3 1443 2.233449 -3 578730 3 000346
22 000000 18 37457 1 625422 0 057617 0.297021 0371976 3.612096 1 632966 0.000102
20 000000 44755f - .447556 0.408424 0.2645 9 0.361779 3.362757 1.453910 0.00007/
19 000000 18 76172 D.238276 0.101813 0.043541 0.253345 1 139937 0.238787 0 000001
0.970536
000000 30 68883 5.3 67 .463395 0.493189 06876 5.354658 0.001944
000000 22 15399! 20.153999 1 174022 3.682840 0.590363 10 615808 20.391315 0.040396
2 s 000000 24 36404 0.635952 0 741366 0. 62 0.42677 5.070655 0.639843 0.00002
000000 96873 2.031268 .586935 0.371184 0.363105 3.394787 2.040250 0.000 52
32 000000 22 40538! 3.594612 1 202721 0.656861 0.307323 148483 3.605985 0.00034
33 000000 22 0 133 4 988667 57736 0.9 1604 0.294562 892527 5.003 62 0.000605
30 000000 2 14904( 2.850954 1.059298 0.273893 1 500972 2.858114 0.000171
0.670456
29 000000 2 s 330991 3.669010 0.85175 0.271803 462968 3.678084 0.00027E
18 000000 70609! 1.293901 0.132855 0.236441 0.312493 2 255271 1 298134 0.000046
22 000000 36931 - .369314 0 399492 3.250222 0 298424 968856 -1 373398 3 000047
000000 13 608131 5.391870 0.486514 0.316028 2.329313 5.409912 3.00081
30 000000 2 s 005-2 4.994576 0.814585 0.912683 0.302538 2.051239 5.009887 0.00064C
39 000000 40 0 668 - .016682 2 528250 0736681 17 086609 035446 0.000162
31 000000 41820! J 418205 1 546660 0.076421 0.322980 2.477356 -0 419667 0 000005
17 000000 18 52273' -522734 0074530 0.399026 4.307103 - .530873 0.000104
25 000000 17 33971! 10.360281 0.026274 1.893185 0.345653 2.982568 10.401779 0.003602
2C 000000 2C 96550 3.965506 0.353394 0. 76432 0.369285 3.545623 0 969922 0.000036
000000 17777! 2 822221 1 677229 D.515718 0.354960 199871 2.834145 0.00028
000000 9 77907! -0.779076 0.217952 3.142364 0.423553 979455 -0.783771 3.000031
17 000000 9 24377 -2.243774 0. 56843 0.4 00 6 0.370776 3.582376 2.254 22 0.000195
5 S'000000 14 55699 1.443009 0.378194 2.091038 0.308965 182214 11 479602 0.00350/
000000 33568’ .664315 .659203 0.304128 0.347675 3.029284 .671060 0 000094
32 000000 30 54900! 1.450991 1.447433 D.265147 0.313540 2 277122 1.455770 3.000056
000000 30 38277 26.382772 .428456 4.821054 0.35475 94926 26.494 08 0.02462
26 000000 2 s 41065' 2.589346 0.860846 0.473164 0.261405 1 278195 2.595268 0.000126
2C 000000 8 90988 090120 0 18726 0.199203 0.398604 4.295878 095935 0.000056
3- 000000 33 15659 3 843403 1.745112 0.154119 0350430 3.093390 0.846876 0.00002
32 000000 24 84751! 7.152481 0.796559 1.307008 0.290088 1.805352 7.172636 3.00120/
2- 000000 10 29900 3.700994 0 864280 2.503650 0.247409 040885 3.729056 3.003216
000000 41534! 2 584651 1 650109 0.472306 0.366764 483785 2.596313 0 000256
Root Mean Squared Error (RMSE)
First, we want residuals to generally be small (i.e., close to 0).
An overall measure of how far predictions are from actual values is given
by the Root Mean Squared Error (RMSE).
The numerator is referring to the residuals. So, we want this to be low.

Model Diagnostics/Assumption Checking
For regression results to be valid, we need to satisfy certain
assumptions regarding the residuals:
• Normality
• Homoscedasticity (constant variance)
• Independence
• Linearity
Furthermore, we want the relation our predictors to exhibit NO

MULTICOLLINEARITY.
Normality
The residuals from the linear regression model must follow a normal
distribution.
Evidences:
• Residual Histogram
• Normal Probability Plot
Formal tests of hypothesis:

• Kolmogorov-Smirnov Test
• Shapiro-Wilk Test
Residual Histogram
Distribution of Raw residuals — Expected Normal
900 -------.-.-.--—
800
■40 -30 -20 -10 0 10 20 30 40 50 60 70 80
Would you say that the residuals (blue bars) follow a normal
distribution (red curve)?
Normal Probability Plot (Q-Q plot)
If the residuals (blue dots) follow the red line closely and
consistently then evidence points towards normality of residuals
Tests for Normality
| Residual Analysis: HHData in HHData (6) X
Dependent: ressize Multiple R : .84846207 F =852.3831

R2 : . 71988789 df = 3,995
Ho. of cases: 999 adjusted R4: .71904333 P =0.000000
S tandard error of estimate: 5.472407644
Intercept: -8.367670122 Std.Error:.6387461 t( 995)=-13.10 p { 0.0000
add
Quick | Advanced | Residuals | Predicted | Scatterplots | Probability plots | Outliers I Save
Summary
]
Save residuals S predicted Cancel
Options
First, save the residuals as a dataset
By Group
HHData in HHData (6)
1 2 3 4 5 6 7 8 9
ressize Predicted Residuals StandardPredicted StandardResidual StdErrorPredicted MahalanobisDistance DeletedResidual CookDistance
1 26 20.91 5 09 035 093 0.31 2.12 5.11 0 00
2 22 25.57 -3.57 0.88 -0.65 0.31 2.23 -3.58 0.00
3 20 18.37 1.63 006 030 0.37 3.61 1 63 0 00
4 20 21.45 -1.45 0.41 -0 26 0 36 3.36 -1.45 0 00
5 19 18.76 0.24 0.10 0.04 0.25 1.14 0.24 0.00
6 36 30.69 5.31 1 46 097 0 49 7.11 5.35 0 00
7 8 28.15 -20.15 1.17 -3.68 0.59 10.62 -20.39 0 04
8 25 24.36 0.64 0.74 0.12 0.43 5.07 0.64 0.00
9 6 3.97 2.03 -1 59 037 0 36 3.39 2.04 0 00
10 32 28.41 3.59 1.20 0.66 0.31 2.15 3.61 0.00
11 33 28.01 4 99 1.16 0.91 0.29 1.89 5.00 0 00
12 30 27.15 2.85 1 06 052 027 1.50 2.86 0 00
13 29 25.33 3.67 0.85 0.67 0.27 1.46 3.68 0.00
14 18 16.71 1 29 -0 13 0.24 0.31 2.26 1.30 0 00
15 20 21.37 -1.37 040 -0.25 0 30 1.97 -1.37 0 00
16 19 13.61 5.39 -0.49 0.99 0.32 2.33 5.41 0.00
17 30 25.01 4.99 081 091 0 30 2.05 501 0.00
18 39 40.02 -1.02 2.53 -0.19 0.74 17.09 -1.04 0.00
19 31 31.42 -0 42 1.55 -0 08 0 32 2.48 -042 0 00
20 17 18.52 -1.52 007 -0 28 040 4.31 -1.53 0 00
21 28 17.64 10.36 -0.03 1.89 0.35 2.98 10.40 0.00
22 20 20.97 -0 97 035 -0 18 0 37 3.55 -097 0 00
23 6 3.18 2 82 -1 68 0.52 0.35 3.20 2.83 0 00
24 19 19.78 -0.78 0.22 -0.14 0.42 4.98 -0.78 0.00
25 17 19.24 -2 24 0 16 -041 0 37 3.58 -2 25 0 00
26 26 14.56 11.44 -0.38 2.09 0.31 2.18 11.48 0.00
27 5 3.34 1.60 -1.66 030 0.35 3.03 1.67 0 00
28 32 30.55 1 45 1.45 027 0.31 2.28 1.46 0 00
29 4 30.38 -26.38 1.43 -4.82 0.35 3.19 -26.49 0.02
30 28 25.41 2 59 086 047 0.26 1.28 2.60 0 00
31 20 18.91 1 09 0 12 020 040 4.30 1 10 0 00
32 34 33.16 0.84 1.75 0.15 0.35 3.09 0.85 0.00
33 32 24.85 7 15 080 1.31 029 1.81 7 17 0.00
34 24 10.30 13.70 -0.86 2.50 0.25 1.04 13.73 0.00
35 6 3.42 2 58 -1.65 047 0.37 3.48 2 60 0 00
36 27 22.08 4.92 048 090 026 1.26 4 93 0 00
Tests for Normality
Tests for Normality
H0: the residuals follow a normal
distribution
Ha: the residuals DO NOT follow a
normal distribution
all p-values are small (<0.05),

so evidence suggests Reject H0 or that
the residuals DO NOT follow a normal
distribution
Homoscedasticity
The residuals from the linear regression model must exhibit a constant
variance
Evidences:
• Residual plot
Formal tests of hypothesis:

• Bartlett's test
• Levene's test
Residual Plots Predicted vs. Residual Scores
Dependent variable: ressize
Predicted Values
Observed Values vs. Residuals

Let us examine any of the two
Upon visual assessment, there is no change in the spread or variance

of the blue dots, so the assumption of homoscedasticity is satisfied
Observed Values
Linearity
The residuals from the linear regression model must have a linear pattern
Evidences:
• Residual plot
Residual Plots: Linearity
IB3 Residual Analysis: HHData in HHData (6) 7 X
Dependent: ressize Multiple E : .84846207 . F = 852 .

R2 ; 71988789 .7190 3831 d£ = 3,995
No. of cases: 999 adjusted R2: Standard 4333 5 . p = 0 .000000
error of estimate: Intercept: - 472407644 .638 995) = -13.10 p
8.367670122 Std.Error: 7461 t( < 0.0000
Quick | Advanced j Residuals | Predicted Scatterplots

'robabilrty plots | Outliers] Save |
Summary
Predicted vs. residuals *1 Observed vs. squared residuals Cancel
1 Predicted vs. squared residuals Residuals vs. deleted residuals Options
By Group
Let us examine any of the two
Using the same plots as the ones for homoscedasticity, we can

observe that the patterns in the residuals are just linear.
Predicted vs. Residual Scores
70
60
50
40
30
20
10
10
- -5 0 5 10 15 20 25 30 35 40 45
Predicted Values 10.95 ConflnT
20
Observed Values vs. Residuals
- Dependent variable: ressize
70 -T-T-T-|-T-|-T-
30
50
Residuals
40
30
Residuals
0 10 20 60
30 40 50
Observed Values 0.95 Conf.lnt.
Independence
The residuals from the linear regression model must be statistically
independent
Formal tests of hypothesis: •

Durbin-Watson test
Durbin-Watson Test
HO: residuals are independent/not
autocorrelated
Ha: residuals are
dependent/autocorrelated
IB] Residual Analysis: HHData in HHData (6) ? X
Dependent: ressise Multiple R : . 84846207 F =852.3831

R4 : .71988789 df = 3,995
No. of cases: 999 adjusted R4: .71904333 P =0.0 00000
S t a. nd a rd error of estimate: 5.472407644
Intercept: -8.3 67 6 Std.Error: .6387461 t( 995)= -13.10 p < 0.0000
70122
Quick Advanced | Residuals | Predicted] Scatterplots | Probability plots | Outliers] Save ]

Summary
Summary: Residuals & predicted Cancel
Descriptive statistics Options
Regression summary By Group

Maximum number of rows
[cases) in a single results
I
Durbin-Watson statistic 100000
Spreadsheet or Graph:
Durbin-Watson d (HHData
and serial correlation of res
Durbin- Serial
Watson d Corr
Estimate 1.7966131
This value is interpreted in the
same way as Pearson's r, so
there is a WEAK dependence
among residuals
Unfortunately, STATISTICA does

not have a p-value for the
Durbin-Watson test, so
conclusions are to be based on
the serial correlation alone.
Summary
The following are assumptions about the error terms (residuals) in a
regression model:
• Independence (through Durbin-Watson test)
• Normality (through histogram (visual) or tests (formal))
• Tests
H0: the residuals follow a normal distribution Ha: the
residuals do not follow a normal distribution
• Homoscedasticity = constant variance (check
through plots)
• Linearity (through plots)
Multicollinearity
We want PREDICTORS to exhibit NO MULTICOLLINEARITY. Multicollinearity is

when predictors are correlated with each other. Redundance.
To check, we can:
1. Check correlations AMONG PREDICTORS
2. Check if variance inflation factor (VIF) > 10 and or tolerance < 0.1
Multicollinearity
3 Multiple Regression Results: HHData in HHData (6)
?X
Multiple Regression Results (Step 3)
Dependent: ressize Multiple R = .84846207 F = 852.3831

R2 = .71988789 df = 3,995
Ho. of cases: 999 adjusted R = .71904333
2
p = 0.000000
Standard error of estimate: 5.472407644 Intercept: -8.367670122
Std.Error: .6387461 t( 995) = -13.10 p = 0.0000
age b*= .171 annualinc b*= .547 famsise b*= .364
(significant b* are highlighted in red)
Alpha for highlighting effects: .05 ||^

H OK Cancel
Quick^^^^^^^Jiesiduals/assumptions/prediction |
IHfgj SLimmary: Regression results Partial correlations
ANOVA (Overall goodness of fit) M Redundancy

ffl Covariance of coefficients fffffi Stepwise regression summary
[J| Current sweep matrix H ANOVA adjusted for mean

Redundancy of Independe
R-square column containssnt Variables; DV: ressize (HHData in HHData (6)) R-
variable with all other inde square of respective pendent variables
loieran. R-square Partial Semipart
Variable Cor Cor
age 0.836932 0 163068 0282812 0.156051
annualinc 0.741690 0 253310 0 664813 0.471020
fa m size 0.337385 0 162115 0 532362 0.332841
distwork 0.989092 0 010908 0 005401 0.002859
educyrs 0.300314 0.199686 0 000633 0.000335
workyrs 0.247448 0 752552 0 074987 0.039687
No VIF here, so we base conclusions on tolerance.
Since there is no tolerance lower than 0.1, we can conclude

that there is NO MULTICOLLINEARITY among predictors.
Outlier Detection: Cook's distance
A measure for identifying outliers/influential points.
Threshold (rule of thumb) = 4/n
Recall: residuals output

Outlier Detection: Cook's distance
n = 999 4/n
= 0.004
Every
observation
with Cook's
d greater
than 0.004
can be
considered
outliers
Dummy Variables
Regression analysis is intended for predicting a QUANTITATIVE variable using a
set of QUANTITATIVE predictors.
We can, however, include QUALITATIVE (categorical) variables into the analysis

through expressing them as DUMMY VARIABLES.
To incorporate dummy variables, we set one category/value of the

QUALITATIVE variable as reference and set it at 0.
Dummy Variables Example
Sex becomes...
1, if male if
sex = I 0, female
V
mstatus becomes.
(1, if married
l 0, if not
married = 1, if widowed
0, if not
widowed = V
Example: Full Model with Dummy Variable sex
Since sex is already encoded as 1's and 0's, we can readily include it
Example: Full Model with Dummy Variable sex
Regression Summary for Dependent Variable: ressize (HHData in HHData (6))
R= .84954541 R2= 72172741 Adjusted R2= 71976181 F(7,991)=367 18 p<0 0000
Std Error of estimate: 5 4654
N=999 of b* of b
Intercept -3 17730 1.599651 -5.11193 0.000000 1, if male if
sex 0.017021 0.019810 0 43977 0.511326 085921 0.390433 sex = 0, female
age iVltebtM li.l&db/ ibybtJbb
educyrs 0.012640 0.019692 0.06620 0103130 064192 0.521075
distwork -0.000403 0.017169 -0.00043 0 013263 -0 02345 0 981293
werkyrs 0.077556 0.034619 0.09913 0044250 2 24023 0.025294
annualinc 0 537227 0.023734 0.54357 0 024014 22 63511 0.000000
fa m size 0 361365 0.018533 2 33532 0.145439 19 49332 0.000000
• when added to the full model, sex is an insignificant predictor
• A male household head is EXPECTED to have a ressize 0.4398 units BIGGER than a female (reference or 0)
household head
• You can proceed all things discussed in this topic for this new model (i.e., model building and residual
analysis)

ST T153A Regression Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ST T153A Regression Analysis

Uploaded by

Copyright:

Available Formats

Regression Analysis

• Subject: an identification number 1 if male, 0 if

• Quantification of the relationship between two QUANTITATIVE variables

• The quantity is called the Pearson's correlation coefficient or simply r.

Matr x Block Data Graphs ^pUser-defined w

Quick Advanced Appearance Categorized Options 1 Options 2

O Confidence Graphs Gallery

Ex. Let Y = resize and X = annualinc

Multiple Regression Results: HHData in Workbookl.(Recovered)

Mu11 i p1e Regressi on Resu11 s

(significant b* are highlighted in red)

Alpha for highlighting effects: \K_s

LnnJ Summary: Regression results Partial correlations |S Opti

Current sweep matrix ANOVA adjusted for mean

The estimated regression equation can be written as:

estimated ressize = -0.6670 + 0.7680(annualinc)

Interpret the regression coefficients b0 and b1.

-0.667: The EXPECTED ressize for someone with an annualinc of 0 is -0.667.

Regression Si mmarv for Henendent Variahle ressi7e O-HHIOata in Wnrkhnnkl ('Recovered’)1) R=

We can add some more predictors to potentially improve our model.

Multiple Regression Analysis makes use of more than one predictor.

Select dependent and independent variable lists: X

Dependent var (or list for batch): Independent variable list

age educyrs distwork workyrs annualincfc

The estimated regression equation can be written as:

Ŷ = -7.7 + 0.09(age) + 0.04(educyrs) + 0.002(distwork) + 0.11(workyrs) + 0.55(annualinc) + 2.85(famsize)

STATISTICA usually highlights results with small p-values.

age is a significant predictor for ressize in the presence of other predictors

Quick | Advanced | Stepwise | Descriptives | H OK

Dependent ressize Independent age Options T

specify here if:

Basically, a high Adj RSq is desirable when comparing different models

What is Adj RSq (adjusted R-squared)?

For illustration, let us proceed by using the backward stepwise model.

The numerator is referring to the residuals. So, we want this to be low.

Furthermore, we want the relation our predictors to exhibit NO

Formal tests of hypothesis:

■40 -30 -20 -10 0 10 20 30 40 50 60 70 80

| Residual Analysis: HHData in HHData (6) X

Dependent: ressize Multiple R : .84846207 F =852.3831

all p-values are small (<0.05),

Formal tests of hypothesis:

Observed Values vs. Residuals

Let us examine any of the two

Upon visual assessment, there is no change in the spread or variance

Dependent: ressize Multiple E : .84846207 . F = 852 .

Quick | Advanced j Residuals | Predicted Scatterplots

Predicted vs. residuals *1 Observed vs. squared residuals Cancel

1 Predicted vs. squared residuals Residuals vs. deleted residuals Options

Let us examine any of the two

Using the same plots as the ones for homoscedasticity, we can

Formal tests of hypothesis: •

Dependent: ressise Multiple R : . 84846207 F =852.3831

Quick Advanced | Residuals | Predicted] Scatterplots | Probability plots | Outliers] Save ]

Summary: Residuals & predicted Cancel

Descriptive statistics Options

Regression summary By Group

Unfortunately, STATISTICA does

We want PREDICTORS to exhibit NO MULTICOLLINEARITY. Multicollinearity is

Dependent: ressize Multiple R = .84846207 F = 852.3831

age b*= .171 annualinc b*= .547 famsise b*= .364

(significant b* are highlighted in red)

Alpha for highlighting effects: .05 ||^

IHfgj SLimmary: Regression results Partial correlations

age b= .171 annualinc b= .547 famsise b*= .364