Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Chapter 4 Drawing Conclusions

STAT 3008 Applied Regression Analysis

Department of Statistics
The Chinese University of Hong Kong

2020/21 Term 1

Dr. LEE Pak Kuen, Philip

1
Chapter Outline
• Ch 2 & Ch 3: OLS Estimates, MLE, Properties, Confidence
Intervals, Hypothesis Testing
• Ch 4: Interpret the results from the OLS estimates, including
Section 4.1: Understanding Parameter Estimates
- How we should interpret the parameters
Section 4.2: Experimental vs. Observational Explanatory Variables
- Data were collected in different ways
Section 4.3: Notes on R2
- Situations when R2 is useful/not useful

2
Section 4.1
Understanding Parameter Estimates

3
Interpretation of Beta
Multiple Linear Regression: E(Y | X )  0  1x1     p x p
Consider the fitted regression line for the Fuel data:
E ( Fuel | X )  154 .19  4.23Tax  0.47 Dlic  6.14 Income  18.54 log 2 ( Miles )
• βi is the Rate of Change of y on xi, after adjusting for other
variables . I.e. Unit of β = Unit of y / Unit of x.
E.g. Fuel decreased by 6.14 gallon when Income increased by $1000
Fuel increased by 18.54 gallon when Miles is doubled (log22x=1+log2(x))
Issues of the above interpretation:
• The sign/magnitude of the estimates may not be consistent
with your intuition, as we assumed other terms to stay
unchanged even though correlations exist between terms.
• Value of the parameter estimate may change if the other terms
are replaced by a linear combinations of the terms in the model.
4
Example - Berkeley Guidance Study
Berkeley Guidance Study: Want to relate the growth of weights
with somatotype (體型) of girls (n=70).
Responses: Somatotype (Y) at age 18, a scale of 1 to 7 to quantify
the body shape of a person based on photos
x 1 2 3 4 5 6 7 x
Very thin Average Obese

Explanatory Variables:
1. WT2 = Weight (in kg) at Age 2
2. WT9 = Weight (in kg) at Age 9
3. WT18 = Weight (in kg) at Age 18

Data Info: http://people.reed.edu/~jones/141/Berkeley.html


5
Example - Berkeley Guidance Study
Data Set in R: BGSgirls.txt in the
“alr3” library
Scatterplot Matrix - Findings:
1. Relationships (i) positive and
(ii) quite linear between
different pairs of variables
2. ρ(WT9,WT18) > ρ(WT2,WT9)
> ρ(WT2,WT18)
3. ρ(Soma,WT18) > ρ(Soma,WT9)
> ρ(Soma,WT2)
Question: What is/are the terms
you would like to include in the
regression model?
6
Example - Berkeley Guidance Study
• Model 1 (Baseline Model, 3 explanatory variables):
Soma  0  W T2WT2  W T9WT9  W T18WT18  e
Fitted Model 1: Ŝoma  1.59  0.116 WT2  0.056 WT9  0.048 WT18

Question: Since ˆW T2  0.116  0 , does it imply that heavy at Age 2


will be thinner later on? 
Possible Explanation 1: βWT2 NOT significant different from 0
Hypotheses H0: βWT2 = 0 vs H1: βWT2 ≠ 0
p-value = 0.06 > α=0.05. We do not reject H0 at α=0.05.
Conclusion: We do not have sufficient evidence that βWT2 is
different from 0
Possible Explanation 2**: Strong correlations between WT2, WT9
and WT18 7
Example - Berkeley Guidance Study
Model 2: Try to reduce/remove the correlations between terms
Define the new terms:
(1) DW9 = WT9 - WT2 (Weight gain from age 2 to 9)
(2) DW18 = WT18 - WT9 (Weight gain from age 9 to 18)
Fitted Model 2: Ŝoma  1.59  0.011WT2  0.105 DW9  0.048 DW18
Problem: Similar to before, but ˆW T2  0.011  0.116 in Model 1
=> Try to use Hypothesis Testing to argue that βWT2 is zero.
H0: βWT2 = 0 vs H1: βWT2 ≠ 0
Decision: Since p-value = 0.42 > 0.05, we do not reject H0 at α =0.05.
Conclusion: We do not have sufficient evidence that βWT2 is different from 0

Final Model: Ŝoma  1.59  0.105 DW9  0.048 DW18


8
Example - Berkeley Guidance Study
Final Model: Ŝoma  1.59  0.105 DW9  0.048 DW18
Interpretation of the Final Model
1) For each kilogram gain in weight from age 2 to 9, somatotype
is expected to increase by 0.105, given that weight gain from
age 9 to 18 remain unchanged.
2) For each kilogram gain in weight from age 9 to 18, somatotype
is expected to increase by 0.048, given that weight gain from
age 2 to 9 remain unchanged.
Finding #1: If the terms are highly correlated, change of variables
(WT9 -> DW9 and WT18 -> DW18) would help to interpret the
model.
Fitted Model 1: Ŝoma  1.59  0.116 WT2  0.056 WT9  0.048 WT18
9
Example - Berkeley Guidance Study
Final Model: Ŝoma  1.59  0.105 DW9  0.048 DW18

Better Scatterplot Matrix after


redefining the EVs?

Correlation Matrices:

10
OLS Estimates – Linear Transformation
Finding #2: Linear transformation of terms in the multiple linear
regression would not alter the least squares estimates:
Model 2

ˆDW9  ˆWT9  ˆWT2 , ˆDW18  ˆWT18  ˆWT9


Model 1 Model 1
Fitted Model 1: Ŝoma  1.59  0.116 WT2  0.056 WT9  0.048 WT18
Fitted Model 2: Ŝoma  1.59  0.011WT2  0.105 DW9  0.048 DW18
Check: Start from the OLS estimates for Model 2:
Soma =1.59 - 0.011 WT2 + 0.105 DW9 + 0.048 DW18
=1.59 - 0.011 WT2 + 0.105 (WT9-WT2) + 0.048 (WT18-WT9)
=1.59 - 0.116 WT2 + 0.056 WT9 + 0.048 WT18
which gives the same OLS estimates as in Model 1!!
11
Unidentifiable Model
Question: Can we include ALL the
five terms (WT2, WT9, WT18, DW9
and DW18) in the regression model?
Answer: No, because the model is
unidentifiable (definition next page)

Model 3: Soma  0  W T2WT2  W T9WT9  W T18WT18  DW9DW9  DW18DW18  e


Linear Transformation from Fitted Model 1:
Fitted Soma = 1.59 - 0.116 WT2 + 0.056 WT9 + 0.048 WT18
= 1.59 – (0.116 - a) WT2 + (0.056-a+b) WT9
+ (0.048 - b) WT18 + a DW9 + b DW18
Okay for ANY CHOICES OF a and b
 Estimates not uniquely determined => Identification Problem 12
Example - Berkeley Guidance Study (R Codes)
library(car); library(alr3) summary(model3)
y<-BGSgirls$Soma Coefficients: (2 not defined because of singularities)
WT2<-BGSgirls$WT2; WT9<-BGSgirls$WT9; WT18<- Estimate Std. Error t value Pr(>|t|)
BGSgirls$WT18; (Intercept) 1.59210 0.67425 2.361 0.02117 *
DW9<-WT9-WT2; DW18<-WT18-WT9; WT2 -0.11564 0.06169 -1.874 0.06530 .
model1<-lm(y~WT2+WT9+WT18) WT9 0.05625 0.02011 2.797 0.00675 **
model1 WT18 0.04834 0.01060 4.559 2.28e-05 ***
Coefficients: DW9 NA NA NA NA
(Intercept) WT2 WT9 WT18 DW18 NA NA NA NA
1.59210 -0.11564 0.05625 0.04834 ---
model2<-lm(y~WT2+DW9+DW18) Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
model2
Residual standard error: 0.543 on 66 degrees of
Coefficients: freedom
(Intercept) WT2 DW9 DW18 Multiple R-squared: 0.5658, Adjusted R-squared:
1.59210 -0.01106 0.10459 0.04834 0.5461
model3<-lm(y~WT2+WT9+WT18+DW9+DW18) F-statistic: 28.67 on 3 and 66 DF, p-value: 5.497e-12
model3
Coefficients:
(Intercept) WT2 WT9 WT18 DW9 DW18
1.59210 -0.11564 0.05625 0.04834 NA NA
13
Unidentifiable Model
Definition: A model is identifiable if the parameter values can be
obtained through infinite number of observations.
Definition: A model is unidentifiable if the parameter values
cannot be obtained through infinite number of observations.
STAT3008: Unidentifiable Regression Model
= parameters unable to express in a unique way

How’s the OLS estimates when regression model is unidentified?


Consider the X matrix with 5 terms (i.e. a n×6 matrix)
If columns of X are linear dependent to each other
 X is not full rank => X’X is not full rank => det(X’X)=0
 βˆ  (X' X)1 X' Y [in fact (X' X)βˆ  X' Y] has infinitely many
solutions 14
Aliased vs Multicollinearity
Definition: A term (or EV) is said to be aliased to other terms if it can be
expressed as a linear combinations of those other terms
E.g. DW9 is alised to WT2 and WT9, DW18 is alised to WT9 and WT18
Existence of aliased term => Unidentifiable Regression model
Definition: Multicollinearity is a statistical phenomenon in which two or
more terms in a multiple regression model are highly correlated.
• Multicollinearity => OLS Estimates may change erratically in response to
small changes in the model or the data:
Suppose two or more terms are highly correlated
 some columns of matrix X are very similar,
 (X’ X)-1 become unstable, with det(X’ X) very close to 0
 det([X’ X]-1) large => Unstable βˆ  (X' X)1 X' Y
 Var(βˆ )   2 ( X' X)1 could be very large
• Presence of Aliased term => Perfect multicollinearity
=> Infinitely many solutions on OLS estimates 15
R Code - Multicollinearity
library(car); library(alr3)
y<-BGSgirls$Soma; WT18<-BGSgirls$WT18
model1<-lm(y~WT18); summary(model1)
Coefficients: Ŝoma  0.7904  0.0667  WT18
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.790364 0.474865 1.664 0.101
WT18 0.066710 0.007862 8.485 2.88e-12 ***
Residual standard error: 0.5658 on 68 degrees of freedom
Multiple R-squared: 0.5143, Adjusted R-squared: 0.5071
F-statistic: 72 on 1 and 68 DF, p-value: 2.885e-12
WT179<-WT18; WT179[1]<-WT18[1]-0.01 # Synthetically create a weight at age 17.9 (WT179)
model2<-lm(y~WT18+WT179); summary(model2) # which is almost identical to WT18
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7704 0.4773 1.614 0.111 Ŝoma  0.7904  42.1192  WT18
WT18 42.1192 57.2297 0.736 0.464 - 42.0522  WT179
WT179 -42.0522 57.2294 -0.735 0.465
Residual standard error: 0.5677 on 67 degrees of freedom
Multiple R-squared: 0.5182, Adjusted R-squared: 0.5038
F-statistic: 36.02 on 2 and 67 DF, p-value: 2.384e-11 16
Section Summary
1) Model Interpretation: If the terms are highly correlated, then
change of variables would help to interpret the parameter
estimates in an easier way.
2) Linear Transformation of terms would not alter the OLS
estimates
3) Aliased Term: Presence of aliased term => regression model
becomes unidentifiable
4) Multicollinearity: If some of the terms are highly correlated ,
the OLS estimates become unstable with huge variances

17
Section 4.2
Experimental vs. Observational Explanatory
Variables

18
Experimental vs. Observational Explanatory Variables
Types of Explanatory Variables (EVs) in regression analysis:
• Experimental EVs:
• Values are under the control of the experimenter
• Values are assigned based on randomization scheme
• Observational EVs: Values are observed (not set by the
experimenter)
• Values are observed via sampling, beyond the control of
experimenter
Example: Investigate factors affecting the crop yield
• Experimental EVs: Amt of fertilizers, Water, Spacing of
plants, … etc
• Observational EVs: Soil fertility, Temperature, Weather
19
Experimentation vs. Observation
Primary difference between the two types of EVs:
*** Difference Inference to be made ***

20
Experimentation vs. Observation
Example: Does more mobile phone usage decrease brain activity?

Experimental Study
1. Find a group of people and randomly assign them into groups
(Randomization helps to average out the lurking variables)
2. For each group, force them to use mobile phone for different
amount of time (X) [could be unethical]
3. Measure their brain activities (Y)
4. Regress brain activities (Y) on time (X)
Possible Conclusion: More phone usage cause lower brain
activities

21
Experimentation vs. Observation
Example: Does more mobile phone usage decrease brain activity?

Observational Study
1. Find a group of people via sampling (e.g. on the street
randomly?)
2. Measure their brain activities (Y) and habit of using mobile
phone (e.g. hours/week, average time to sleep, … etc)
3. Regress brain activities (Y) on time (X)
Possible Conclusion: More phone usage is associated with lower
brain activities

22
Example – Cake (Ch6)
• Problem: To study the palatability
score of cake (i.e. acceptability in
terms of taste) by baking temperature
and baking time
• Variables:
Y = Score for cake (The higher the better)
X1 = baking time (in minutes)
X2 = baking temperature (in F)
• n=14 observations: 6 cakes with x1= 35 minutes and x2 = 350F, and
the other 8 cakes with x1 and x2 scattered around (x1, x2)=(35, 350).
Experimental study
1) Explanatory variables are controlled by the experimenter
2) Lurking variables (if exists) are averaged out
* *
3) Strong Inference: Baking time and temperature at ( x1 , x2 ) will give
the best cake. 23
Section 4.3
Notes on R2

24
Notes on R2
• R2 tends to be large if the X are R2=0.241
dispersed
• R2 tends to be small if the X are
concentrated
=> Need to be careful about sampling!

R2=0.372 R2=0.027

25
Notes on R2
Simple linear regression: R2 is useful to measure the goodness-of- fit
if and only if the scatterplot looks like a sample from a bivariate
normal distribution (elliptical bivariate pdf)

R2 a useful
goodness-of-fit
measure

R2 NOT a useful
goodness-of-fit
measure:
• Leverage point
• Not-linear mean
function
• Lurking variable

26
Notes on R2
• Multiple linear regression: R2 is a useful goodness-of-fit
measure if the variables (terms and responses) follow a
multivariate normal distribution (i.e. ellipsoid joint pdf)
• Very hard to justify from the data, especially when (1) the
number of variables is large but (2) the data are sparse
• Residual plot would be extremely helpful: Identify if
possible non-null plot behavior on the residuals.

27

You might also like