Linear Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Regression Analysis

Correlation and Regression


Are the observations independent or correlated? Alternatives if the
Outcome normality
independent correlated
assumption is
Variable
violated (and small
sample size):
Continuous t-test: compares means between
(e.g. cognitive two independent groups
function)
ANOVA: compares means
between more than two
independent groups
Paired t-test
Non-parametric
Pearson’s correlation coefficient statistics
(linear correlation): shows linear Repeated-measures ANOVA
correlation between two continuous
variables Mixed models/GEE modeling

Linear regression: multivariate


regression technique used when
the outcome is continuous; gives
slopes
General Term

❖Variance

❖Covariance

❖Correlation
Covariance

▪Measure the direction of the linear relationship between


variables.

▪Ranges between -∞ to +∞

▪When the unit of observation is changed for one or both of


the two variables, the covariance value changes.

▪cov(X,Y) > 0 X and Y are positively correlated

▪cov(X,Y) < 0 X and Y are inversely correlated

▪cov(X,Y) = 0 X and Y are independent


Correlation
▪ Measures the relative strength of the linear relationship between two variables
▪ Unit-less
▪ Ranges between –1 and 1
▪ The closer to –1, the stronger the negative linear relationship
▪ The closer to 1, the stronger the positive linear relationship
▪ The closer to 0, the weaker any positive linear relationship
Simple vs. Multiple
 represents the unit i represents the unit
change in Y per unit change in Y per unit
change in X . change in Xi.
Does not take into Takes into account the
account any other
variable besides effect of other i s.
single independent
variable.
Assumptions
Linearity - the Y variable is linearly related to
the value of the X variable.
Independence of Error - the error (residual)
is independent for each value of X.
Homoscedasticity - the variation around the
line of regression be constant for all values of X.
Normality - the values of Y be normally
distributed at each value of X.
Simple Regression
A statistical model that utilizes one
quantitative independent variable
“X” to predict the quantitative
dependent variable “Y.”
Multiple Regression
A statistical model that utilizes two
or more quantitative and qualitative
explanatory variables (x1,..., xp) to
predict a quantitative dependent
variable Y.
Caution: have at least two or more quantitative
explanatory variables (rule of thumb)
Linear Model
Relationship between one dependent & two
or more independent variables is a linear
Function.
Population Population Random
Y-intercept slopes error

Y =  0 + 1 X 1 +  2 X 2 +  +  P X P + 

Dependent Independent
(response) (explanatory)
variable variables
Linear Model
Linear regression is a method that summarizes how the average
values of a numerical outcome variable vary over subpopulations
defined by linear functions of predictors.

We shall fit a series of regressions predicting cognitive test scores


of three- and four-year-old children given characteristics of their
mothers.
We start by modeling the children’s test scores given an indicator
for whether the mother graduated from high school (coded as 1)
or not (coded as 0). The fitted model is
kid.score = 78 + 12 · mom.hs + error

kid.score = 78 + 12 · mom.hs
Linear Model
Here, kid.score denotes either predicted or expected test score
given the mom.hs predictor. This model summarizes the
difference in average test scores between the children of
mothers who completed high school and those with mothers
who did not.

The intercept, 78, is the average (or predicted) score for children
whose mothers did not complete high school. To see this
algebraically, consider that to obtain predicted scores for these
children we would just plug 0 into this equation. To obtain
average test scores for children (or the predicted score for a
single child) whose mothers were high school graduates, we
would just plug 1 into this equation to obtain 78 + 12 · 1 = 91.
Linear Model
The difference between these two subpopulation means is equal
to the coefficient on mom.hs. This coefficient tells us that
children of mothers who have completed high school score 12
points higher on average than children of mothers who have
not completed high school.
Linear Model
Regression with a continuous predictor

If we regress instead on a continuous predictor, mother’s score


on an IQ test, the fitted model is
kid.score = 26 + 0.6 · mom.iq + error
If we compare average child test scores for subpopulations that
differ in maternal IQ by 1 point, we expect to see that the group
with higher maternal IQ achieves 0.6 points more on average.
Perhaps a more interesting comparison would be between groups
of children whose mothers’ IQ differed by 10 points—these
children would be expected to have scores that differed by 6
points on average.
Regression Model with
Multiple Predictors
Consider a linear regression predicting child test scores from
maternal education and maternal IQ.
The fitted model is
kid.score = 26 + 6 · mom.hs + 0.6 · mom.iq + error
1. The intercept. If a child had a mother with an IQ of 0 and
who did not complete high school (thus, mom.hs = 0), then we
would predict this child’s test score to be 26. This is not a
useful prediction, since no mothers have IQs of 0.
2. The coefficient of maternal high school completion.
Comparing children whose mothers have the same IQ, but who
differed in whether they completed high school, the model
predicts an expected difference of 6 in their test scores.
Regression Model with
Multiple Predictors
3. The coefficient of maternal IQ. Comparing children with the
same value of mom.hs, but whose mothers differ by 1 point in
IQ, we would expect to see a difference of 0.6 points in the
child’s test score (equivalently, a difference of 10 in mothers’
IQs corresponds to a difference of 6 points for their children).
Example

Data were collected on the depth of a dive of penguins and the duration of the
dive. The following linear model is a fairly good summary of the data, where t
is the duration of the dive in minutes and d is the depth of the dive in yards.

The equation for the model is d = 0.015 + 2.915t

Interpret the slope: If the duration of the dive increases by 1 minute, we


predict the depth of the dive will increase by approximately 2.915 yards.
Interpret the intercept. If the duration of the dive is 0 seconds, then we predict
the depth of the dive is 0.015 yards.

Comments: The interpretation of the intercept doesn’t make sense in the real
world. It isn’t reasonable for the duration of a dive to be near t = 0 , because
that’s too short for a dive. If data with x-values near zero wouldn’t make sense,
then usually the interpretation of the intercept won’t seem realistic in the real
world. It is, however, acceptable (even required) to interpret this as a coefficient
in the model.
Interaction Regression Model
Hypothesizes interaction between pairs of X
variables
– Response to one X variable varies at different
levels of another X variable
Contains two-way cross product terms
Y = 0 + 1x1 + 2x2 + 3x1x2 + 
Can be combined with other models
e.g. dummy variable models
Interaction Regression Model
An interaction between mom.hs and mom.iq—that is,
a new predictor which is defined as the product of these two
variables. This allows the slope to vary across subgroups. The
fitted model is

kid.score = −11 + 51 · mom.hs + 1.1 · mom.iq − 0.5 · mom.hs


· mom.iq + error
Interaction Regression Model
1. The intercept represents the predicted test scores for
children whose mothers did not complete high school and
had IQs of 0—not a meaningful scenario.
2. The coefficient of mom.hs can be conceived as the
difference between the predicted test scores for children
whose mothers did not complete high school and had IQs of
0, and children whose mothers did complete high school
and had Iqs of 0. You can see this by just plugging in the
appropriate numbers and comparing the equations. Since it
is implausible to imagine mothers with IQs of zero, this
coefficient is not easily interpretable.
Interaction Regression Model
3. The coefficient of mom.iq can be thought of as the comparison
of mean test scores across children whose mothers did not
complete high school, but whose mothers differ by 1 point in IQ.
This is the slope of the dark line in Figure 3.4.
4. The coefficient on the interaction term represents the
difference in the slope for mom.iq, comparing children with
mothers who did and did not complete high school.
Effect of Interaction
Given:
Yi =  0 + 1X 1i +  2 X 2i +  3 X 1i X 2i +  i

Without interaction term, effect of X1 on Y


is measured by 1
With interaction term, effect of X1 on
Y is measured by 1 + 3X2
– Effect increases as X2i increases
Interaction Example
Y Y = 1 + 2X1 + 3X2 + 4X1X2
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12

8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0 X1
0 0.5 1 1.5
Effect (slope) of X1 on Y does depend on X2 value
When should we look for interactions?

Interactions can be important. In practice, inputs that have large


main effects also tend to have large interactions with other inputs
(however, small main effects do not preclude the possibility of
large interactions). For example, smoking has a huge effect on
cancer. In epidemiologial studies of other carcinogens, it is
crucial to adjust for smoking both as a main effect and as an
interaction. High levels of home radon exposure are associated
with greater likelihood of cancer—but this difference is much
greater for smokers than for nonsmokers.
Method of Least Squares
The straight line that best fits the data.

Determine the straight line for which the


differences between the actual values (Y)
and the values that would be predicted
from the fitted line of regression (Y-hat)
are as small as possible.
Method of Least Squares
This ‘line of best fit’ is found by ascertaining which line, of all
of the possible lines that could be drawn, results in the least
amount of difference between the observed data points and the
line.
Method of Least Squares
Our objective is to make use of the sample data on Y and X and obtain the “best” estimates
of the population parameters.

The most commonly used procedure used for regression analysis is called ordinary least
squares (OLS).

The OLS procedure minimizes the sum of squared residuals. From the theoretical

regression model

Yi = þ0 + þ1Xi + si,

we want to obtain an estimated regression equation

Y^i = þ^0 + þ^1Xi.


Method of Least Squares
OLS is a technique that is used to obtain þ^0 and þ^1. The OLS procedure minimizes
i
n n n
Σ e2 = Σ(Yi — Y^i)2 = Σ(Yi — þ^0 — þ^1Xi)2
i=1 i=1 i=1

with respect to þ^0 and þ^1. Solving the minimization problem


results in the following expressions:
∑ni = 1 (Xi — X¯)(Yi — Y¯) ∑ni=1 XiYi —nX¯Y¯
þ^1 = = n
∑i=1n (X —
i ∑i=1 Xi2 —nX¯2
X¯)2

þ^0 = Y¯ — þ^1X¯

Notice that different datasets will produce different

values for þ^0 and þ^1.


Coefficient of Multiple
Determination
R2y.123- - -P

The proportion of Y that is explained by


the set of explanatory variables selected
Multiple Regression Equation
Y-hat = 0 + 1x1 + 2x2 + ... + PxP + 
where:
0 = y-intercept {a constant value}

1 = slope of Y with variable x1 holding the


variables x2, x3, ..., xP effects constant
P = slope of Y with variable xP holding all
other variables’ effects constant
Mini-Case
Develop a model for O il (G a l) T e m p (0F) In su la tio n
2 7 5 .3 0 40 3
estimating heating oil
3 6 3 .8 0 27 3
used for a single family 1 6 4 .3 0 40 10
home in the month of 4 0 .8 0 73 6
January based on average 9 4 .3 0 64 6

temperature and amount 2 3 0 .9 0 34 6


3 6 6 .7 0 9 6
of insulation in inches.
3 0 0 .6 0 8 10
2 3 7 .8 0 23 10
1 2 1 .4 0 63 3
3 1 .4 0 65 10
2 0 3 .5 0 41 6
4 4 1 .1 0 21 3
3 2 3 .0 0 38 3
5 2 .5 0 58 10
Multiple Regression Equation
[mini-case]
Dependent variable: Gallons Consumed
-------------------------------------------------------------------------------------
Standard t
Parameter Estimate Error Statistic p-Value
--------------------------------------------------------------------------------------
CONSTANT 562.151 21.0931 26.6509 < 0.001
Insulation -20.0123 2.34251 -8.54313 <0.001
Temperature -5.43658 0.336216 -16.1699 <0.001
--------------------------------------------------------------------------------------
R-squared = 96.561 percent
R-squared (adjusted for d.f.) = 95.9879 percent
Standard Error of Est. = 26.0138
+
Multiple Regression Equation
[mini-case]

Y-hat = 562.15 - 5.44x1 - 20.01x2

where: x1 = temperature [degrees F]


x2 = attic insulation [inches]
Multiple Regression Equation
[mini-case]

Y-hat = 562.15 - 5.44x1 - 20.01x2


thus:
For a home with zero inches of attic
insulation and an outside temperature
of 0 oF, 562.15 gallons of heating oil
would be consumed.
Multiple Regression Equation
[mini-case]

Y-hat = 562.15 - 5.44x1 - 20.01x2


For a home with zero attic insulation and an outside temperature
of zero, 562.15 gallons of heating oil would be consumed. [
caution .. data boundaries .. extrapolation ]

For each incremental increase in degree F


of temperature, for a given amount of attic
insulation, heating oil consumption drops
5.44 gallons.
Multiple Regression Equation
[mini-case]

Y-hat = 562.15 - 5.44x1 - 20.01x2


For a home with zero attic insulation and an outside temperature of zero,
562 gallons of heating oil would be consumed. [ caution … ]
For each incremental increase in degree F of temperature, for a given
amount of attic insulation, heating oil consumption drops 5.44 gallons.

For each incremental increase in inches


of attic insulation, at a given temperature,
heating oil consumption drops 20.01
gallons.
Multiple Regression Prediction
[mini-case]

Y-hat = 562.15 - 5.44x1 - 20.01x2


with x1 = 15oF and x2 = 10 inches

Y-hat = 562.15 - 5.44(15) - 20.01(10)


= 280.45 gallons consumed
Coefficient of Multiple Determination
[mini-case]

R2y.12 = .9656

96.56 percent of the variation in


heating oil can be explained by
the variation in temperature and
insulation.
Coefficient of Multiple Determination
Adjusted

Proportion of variation in Y ‘explained’ by all


X variables taken together
Reflects
– Sample size
– Number of independent variables
Smaller [more conservative] than R2Y.12
Used to compare models
Coefficient of Multiple Determination
(adjusted)

R2(adj) y.123- - -P

The proportion of Y that is explained by the


set of independent [explanatory] variables
selected, adjusted for the number of
independent variables and the sample size.
Multicollinearity
High correlation between X variables
Coefficients measure combined effect
Leads to unstable coefficients depending on
X variables in model
Always exists; matter of degree
Example: Using both total number of rooms
and number of bedrooms as explanatory
variables in same model
Detecting Multicollinearity
Examine correlation matrix
– Correlations between pairs of X variables are
more than with Y variable
Few remedies
– Obtain new sample data
– Eliminate one correlated X variable
Multicollinearity

A VIF of 1 means that there is no correlation


among the jth predictor and the remaining
predictor variables, and hence the variance of bj is
not inflated at all. The general rule of thumb is that
VIFs exceeding 4 warrant further investigation,
while VIFs exceeding 10 are signs of serious
multicollinearity requiring correction.
Dummy-Variable Regression Model

Involves categorical X variable with


two levels
– e.g., female-male, employed-not employed, etc.
Dummy-Variable Regression Model

Involves categorical X variable with


two levels
– e.g., female-male, employed-not employed, etc.
Variable levels coded 0 & 1
Dummy Variables

Permits use of As part of Diagnostic


qualitative data Checking;
(e.g.: seasonal, class incorporate outliers
standing, location, (i.e.: large residuals)
gender). and influence
measures.
0, 1 coding
(nominative data)
Dummy Variable Regression
Simplest case: Additive Model (no interaction)
•Example
- Dependent variable: income
- One quantitative independent variable: education
- One dichotomous (can take two values) independent variable: gender

• Assume effect of either independent variable is the same, regardless of


the value of the other variable (additivity, parallel regression lines) – See ANOVA
slides about interactions.

•Usual assumptions on statistical errors: independent, zero means, constant


variance, normally distributed, fixed X’s or X independent of statistical errors.
Dummy Variable Regression

Introducing a Dummy Regressor

One way of formulating the common-slope model is


Yi =  +  xi +  Di + i
•Where Di, called a dummy-variable regressor or an indicator variable, is coded 1 for men
and 0 for women:
1 for
men
D=
i

0 for women
Thus, for women the model becomes

Yi =  + xi +  (0) + i =  +  xi + i
and for men

Yi =  +  xi +  (1)+ i =  +  +  xi + i

These regression equations are graphed in the following Figure. 5


Dummy Variable Regression

Parameters in the additive dummy-regression model (additive: no interactions, that is,


parallel regression lines):

Y D =1

 D =0
1

 +  
1

49
Dummy Variable Regression

Interpretation of parameters in the additive dummy-regression model:


 gives the difference in intercepts for the two regression lines.
•Because these regression lines are parallel,  represents the constant separation
between the lines — the expected income advantage accruing to men when education is
held constant
•If men were disadvantaged relative to women, then  would be negative
• gives the intercept for women, for whom D = 0
• is the common within-gender education slope
Y D =1

 D= 0
Yi =  +  xi +  Di + i 1

+  
1

X
Dummy Variable Regression: Example

Predicting Scholaptic Aptitude Test Scores from PISA scores and gender (males: 0,
females: 1)

Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 37.326 3.743 9.972 .000
PISA Score .216 .021 .397 10.413 .000
Gender 9.971 .440 .863 22.647 .000
a. Dependent Variable: SAT Score

A = ˆ=37.326
ˆ= 9.971
ˆ+ ˆ = 37.326 + 9.971 = 47.297
B = ˆ= .216

Both main effects are significant.


Regression Equations for Males and Females

Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 37.326 3.743 9.972 .000
PISA Score .216 .021 .397 10.413 .000
Gender 9.971 .440 .863 22.647 .000
a. Dependent Variable: SAT Score

• Model equation for males (0): Yi =  + xi +  (0) + i =  +  xi + i


• Model equation for females (1): Yi =  +  +  xi + i

• Regression equation for males:


Yˆ= 37.326 + .216x
i i

Regression equation for females:


Yˆ= 47.297 +.216x
i i
Regression Equations for Males and Females
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 37.326 3.743 9.972 .000
PISA Score .216 .021 .397 10.413 .000
Gender 9.971 .440 .863 22.647 .000
a. Dependent Variable: SAT Score

 gives the difference in intercepts for the two regression lines.


•Because these regression lines are parallel,  also represents the constant
separation between the lines: the expected SAT score advantage of females when the PISA
test performance is held constant
 gives the intercept for males, for whom D = 0: Expected/average SAT performance of
males at X = 0
 is the common within-gender SAT slope: One-unit increase in the PISA score is, on
average, associated with a .216 unit increase in SAT scores
Dummy Variable Regression: Interactions

Modeling Interactions
•Two explanatory variables interact in determining a response variable when
the partial effect of one depends on the value of the other

•So far we only studied models without interaction (additive model)

•Interaction between two variables means that the effect of one of the
variables depends on the value of the other variable
- Example: the effect of income (average income of incumbents in
dollars) on prestige (percentage score) is bigger for men than for women

11
Dummy Variable Regression: Interactions

Dummy variable interaction model illustrated I

Y D = 1 (males)

 + D = 0 (females)
Prestige


 + 1

Income X
• The within-gender regressions of income on prestige are not parallel — the slope for men is larger
than the slope for women.
• Because the effect of income varies by gender, income and gender interact inaffecting
prestige
• Interaction refers to the manner in which explanatory variables combine to affect a response
variable, not to the relationship between the explanatory variables themselves.
Problems with Pick-a-Point Approach

Dummy variable interaction model illustrated II

Y D = 1 (males)

 + D = 0 (females)
Prestige


 + 1

Income X

 and  are the intercept and slope for the regression of income on prestige among women
 gives the difference in intercepts betweenthe male and female groups
 gives the difference in slopes between the two groups
Dummy Variable Regression: Interactions

Dummy variable interaction model illustrated III

Y D = 1 (males)

 + D = 0 (females)
Prestige


 + 1

Income
X

 and  +  , is the expected prestige for females and males, respectively, at X = 0


 is thus the prestige difference at X = 0
 is the constant increase in prestige for females
 +  is the constant increase in prestige for males
57
Dummy Variable Regression: Interactions

Constructing regressors
• Y = prestige, X = income, D = dummy for gender (0 or 1)
• The model then becomes Yi =  +  xi +  D i +  (xi Di ) + 
• Note xiDi is a new regressor (interaction/moderation) term
- Note: The variable xiDi must be calculated before the analysis is run
• Regression equation for females (Di = 0):
Yi =  +  xi +   0 +  (xi  0) + i =  +  xi + i

• Regression equation for males (Di = 1):


Yi =  +  xi +  1 +  (xi 1) + i = ( +  )+( +  )xi + i

58
Dummy Variable Regression: Interactions

Testing
•Testing for interaction is testing for a
difference in slope between males and
females →  is the slope difference

•Statistical hypotheses
H0 : = 0
H1 :  0
Dummy Variable Regression: Interactions

Example
B SE t p-value
Intercept 12.113 4.263 2.841 .005
Gender 7.546 6.453 1.169 .244
income 0.001 0.00 8.927 .000
gender*income 0.0004 0.00 3.536 .001

ˆ= 12.113
ˆ = 7.546
ˆ+ ˆ = 12.113 + 7.546 = 19.66
ˆ= 0.001
ˆ= 0.00044
ˆ = 7.546
ˆ+ ˆ= 0.001 +0.00044 = 0.00144
→ Difference in prestige between males and females
increases one unit in income by
.0014 percentage points, on average 171
7
Dummy Variable Regression
These are following steps to build dummy variables:
Interaction Regression Model
Hypothesizes interaction between pairs of X
variables
– Response to one X variable varies at different
levels of another X variable
Contains two-way cross product terms
Y = 0 + 1x1 + 2x2 + 3x1x2 + 
Can be combined with other models
e.g. dummy variable models
Interaction Regression Model
An interaction between mom.hs and mom.iq—that is,
a new predictor which is defined as the product of these two
variables. This allows the slope to vary across subgroups. The
fitted model is

kid.score = −11 + 51 · mom.hs + 1.1 · mom.iq − 0.5 · mom.hs


· mom.iq + error
Interaction Regression Model
1. The intercept represents the predicted test scores for
children whose mothers did not complete high school and
had IQs of 0—not a meaningful scenario.
2. The coefficient of mom.hs can be conceived as the
difference between the predicted test scores for children
whose mothers did not complete high school and had IQs of
0, and children whose mothers did complete high school
and had Iqs of 0. You can see this by just plugging in the
appropriate numbers and comparing the equations. Since it
is implausible to imagine mothers with IQs of zero, this
coefficient is not easily interpretable.
Interaction Regression Model
3. The coefficient of mom.iq can be thought of as the comparison
of mean test scores across children whose mothers did not
complete high school, but whose mothers differ by 1 point in IQ.
This is the slope of the dark line in Figure 3.4.
4. The coefficient on the interaction term represents the
difference in the slope for mom.iq, comparing children with
mothers who did and did not complete high school.
Effect of Interaction
Given:
Yi =  0 + 1X 1i +  2 X 2i +  3 X 1i X 2i +  i

Without interaction term, effect of X1 on Y


is measured by 1
With interaction term, effect of X1 on
Y is measured by 1 + 3X2
– Effect increases as X2i increases
Interaction Example
Y Y = 1 + 2X1 + 3X2 + 4X1X2

12

4
0 X1
0 0.5 1 1.5
Interaction Example
Y Y = 1 + 2X1 + 3X2 + 4X1X2

12

8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0 X1
0 0.5 1 1.5
Interaction Example
Y Y = 1 + 2X1 + 3X2 + 4X1X2
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12

8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0 X1
0 0.5 1 1.5
Interaction Example
Y Y = 1 + 2X1 + 3X2 + 4X1X2
Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
12

8
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
4
0 X1
0 0.5 1 1.5
Effect (slope) of X1 on Y does depend on X2 value
When should we look for interactions?

Interactions can be important. In practice, inputs that have large


main effects also tend to have large interactions with other inputs
(however, small main effects do not preclude the possibility of
large interactions). For example, smoking has a huge effect on
cancer. In epidemiologial studies of other carcinogens, it is
crucial to adjust for smoking both as a main effect and as an
interaction. High levels of home radon exposure are associated
with greater likelihood of cancer—but this difference is much
greater for smokers than for nonsmokers.

You might also like