Professional Documents
Culture Documents
Applying Regression Analysis: Jean-Philippe Gauvin Université de Montréal
Applying Regression Analysis: Jean-Philippe Gauvin Université de Montréal
Jean-Philippe Gauvin
Université de Montréal
January 7 2016
Goals for Today
.5
Gauche-Droite Sociale
-.5
-1
20 40 60 80 100
Age
The Basic Linear Model (OLS)
.5
Gauche-Droite Sociale
-.5
-1
20 40 60 80 100
Age
Why a Line?
Yi = α + βXi + i
Where:
I y is the dependent variable
I α is the intercept
I β is the slope
I x is the predictor
I is the error term
The Components of OLS
Ŷi = α + βXi
Where is the error term? The OLS aims to keep the residual as
small as possible. In other words,
Yi = Ŷi + i
i = Yi − Ŷi
+-------------------------+
X | X Y Xbar Ybar |
RSS = 2i |-------------------------|
X 1. | 2 10 4.375 12.625 |
= (Yi − Ŷi )2
2. | 5 15 4.375 12.625 |
3. | 6 16 4.375 12.625 |
4. | 4 12 4.375 12.625 |
(Xi − X̄ )(Yi − Ȳ )
P
β̂ = 5. | 8 14 4.375 12.625 |
(Xi − X̄ )2
P
|-------------------------|
6. | 1 12 4.375 12.625 |
7. | 4 12 4.375 12.625 |
α̂ = Ȳ − β̂1 X̄ 8. | 5 10 4.375 12.625 |
+-------------------------+
Estimating the Regression Line
.5
Gauche-Droite Sociale
beta=.004
0
alpha = -.45
-.5
-1
10 30 50 70 90
Age
The Assumptions of OLS
100
Feelings about Stephen Harper
80
60
40
20
0
0 2 4 6 8 10
Left/right: Where would you place yourself on the scale below?
Example 1: The Bivariate Regression
Stata:
twoway (scatter harper leftright, jitter(20)) (lfit harper leftright)
100
80
60
40
20
0
0 2 4 6 8 10
Left/right: Where would you place yourself on the scale below?
------------------------------------------------------------------------------
harper | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
leftright | 7.452856 .3677241 20.27 0.000 6.731535 8.174178
_cons | 5.717891 1.995533 2.87 0.004 1.803484 9.632297
------------------------------------------------------------------------------
Example 1: Stata
------------------------------------------------------------------------------
harper | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
leftright | 7.452856 .3677241 20.27 0.000 6.731535 8.174178
_cons | 5.717891 1.995533 2.87 0.004 1.803484 9.632297
------------------------------------------------------------------------------
FH
d = 5.71 + 7.45(LR)
Example 1: R
Call:
lm(formula = harper ~ leftright)
Residuals:
Min 1Q Median 3Q Max
-79.246 -22.982 2.112 21.924 79.376
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.7179 1.9955 2.865 0.00423 **
leftright 7.4529 0.3677 20.268 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Yi = β0 + β1 X1 + β2 X2 + ... + βn Xn + i
The Multiple Regression Equation
Yi = β0 + β1 X1 + β2 X2 + ... + βn Xn + i
FH = β0 + β1 LR + β2 Age +
Example 2. The Multiple Regression
100 100
80 80
60 60
40 40 B= 0.12
B= 7.45
20 20
0 0
0 2 4 6 8 10 20 40 60 80 100
ideology age
Example 2. Stata
---------------------------------------------------------------------------
harper | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------+----------------------------------------------------------------
leftright | 7.488689 .3689223 20.30 0.000 6.765005 8.212373
age | .0417154 .0500968 0.83 0.405 -.0565554 .1399862
_cons | 2.990736 3.432887 0.87 0.384 -3.74327 9.724742
---------------------------------------------------------------------------
Example 2. R
> m2 <- lm(harper ~ leftright + age, data=data)
> summary(m2)
Call:
lm(formula = harper ~ leftright + age, data = data)
Residuals:
Min 1Q Median 3Q Max
-79.339 -21.988 2.211 21.543 79.612
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.99074 3.43289 0.871 0.384
leftright 7.48869 0.36892 20.299 <2e-16 ***
age 0.04172 0.05010 0.833 0.405
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
FH
d = 2.99 + 7.49(LR) + .04(Age)
= 2.99 + 7.49 ∗ 1 + .04 ∗ 50...
= 12.48
Predicting Ŷ ?
FH
d = 2.99 + 7.49(LR) + .04(Age)
= 2.99 + 7.49 ∗ 1 + .04 ∗ 50...
= 12.48
. sum yhat if leftright==1 & age==50 /*lucky we have one obs = 12.56*/
------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 12.5652 1.691971 7.43 0.000 9.246198 15.88419
------------------------------------------------------------------------------
Predicting Ŷ in R
But what does the coefficient mean? Being a male, rather than a
female, decreases feelings for Harper by 0.62 (not statistically
significant).
Adding a Binary Variable
Y = a + b1*LR + b2*Gender + e
30
25
Linear Prediction
20
15
10
b2{ Female intercept =b0
Male intercept = b0 + b2
5
0 .5 1 1.5 2 2.5 3
Left/right: Where would you place yourself on the scale below?
Ex 4. Categorical Variables in Stata
In Stata, you can use tab varname, gen(newvar) to
automatically create dummies, or you can simply use the prefix i.
or even specify the baseline with b1. b2. b3. etc.
. reg harper i.votechoice /*liberal is baseline*/
------------------------------------------------------------------------------
harper | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
votechoice |
Tories | 45.08578 1.168005 38.60 0.000 42.79546 47.3761
NDP | .1385281 1.388459 0.10 0.921 -2.584075 2.861131
BQ | -3.868295 1.912745 -2.02 0.043 -7.61896 -.1176298
Greens | -1.171874 2.515343 -0.47 0.641 -6.104162 3.760414
_cons | 29.37576 .9241736 31.79 0.000 27.56356 31.18795
------------------------------------------------------------------------------
Call:
lm(formula = harper ~ votechoice, data = data)
Residuals:
Min 1Q Median 3Q Max
-74.462 -19.462 0.538 15.538 70.624
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.3758 0.9242 31.786 <2e-16 ***
votechoiceTories 45.0858 1.1680 38.601 <2e-16 ***
votechoiceNDP 0.1385 1.3885 0.100 0.9205
votechoiceBQ -3.8683 1.9127 -2.022 0.0432 *
votechoiceGreens -1.1719 2.5153 -0.466 0.6413
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Ex 4. Categorical Variables in R
60 60
Linear Prediction
Linear Prediction
50 50
40 40
male male
female female
writing score writing score
30 30
20 25 30 35 40 45 50 55 60 65 70 20 25 30 35 40 45 50 55 60 65 70
social studies score social studies score
Example 5. Stata
-------------------------------------------------------------
write | Coef. Std. Err. t P>|t|
----------------+--------------------------------------------
female | 15.00001 5.09795 2.94 0.004
socst | .6247968 .0670709 9.32 0.000
|
female#c.socst | -.2047288 .0953726 -2.15 0.033
|
_cons | 17.7619 3.554993 5.00 0.000
-------------------------------------------------------------
Example 5. R
In R, variables are already defined as factors or not. Interactions
are thus handled automatically.
> m5 <- lm(write~ female*socst, data=data2)
> summary(m5)
Call:
lm(formula = write ~ female * socst, data = data2)
Residuals:
Min 1Q Median 3Q Max
-18.6265 -4.3108 -0.0645 5.0429 16.4974
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.76190 3.55499 4.996 1.29e-06 ***
femalefemale 15.00001 5.09795 2.942 0.00365 **
socst 0.62480 0.06707 9.315 < 2e-16 ***
femalefemale:socst -0.20473 0.09537 -2.147 0.03305 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
100
50
Residuals
-50
-100
0 20 40 60 80
Fitted values
t test of coefficients:
>
Example 6. Normality of Residuals in Stata
You can plot the residuals against a theoretical normal distribution.
1.00
0.75
Normal F[(r-m)/s]
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N+1)
3
2
Studentized Residuals(m6)
1
0
−1
−2
−3
−3 −2 −1 0 1 2 3
t Quantiles
Example 6. Outlier Plot in Stata
You can identify the outliers by graphing a leverage plot with
lvr2plot
.01
.008
Leverage
.006
.004
.002
3
Influence Plot
3810
2
Studentized Residuals
1
0
3084
−1
−2
−3
Yi = β0 + β1 X1 + β2 X2 + ui
Residuals .5
-.5
-1
-.5 0 .5 1
Fitted values
Maximum Likelihood Estimation
The world of MLE is the world of frequentist probability. Formally,
a probability is given by:
Pr (Y |M) = Pr (Data|Model)
Ideally, we would compute the inverse probability Pr (Model|Data),
but this is impossible.
Maximum Likelihood Estimation
The world of MLE is the world of frequentist probability. Formally,
a probability is given by:
Pr (Y |M) = Pr (Data|Model)
Ideally, we would compute the inverse probability Pr (Model|Data),
but this is impossible. Luckily, the likelihood function helps us a
lot.
N
Y
f (Y1 , Y2 , ..., Yn |θ) = f (Yi |θ) = L(θ|Y )
i=1
p(Y |θ) = L(θ|Y )
.8 .8
Linear Prediction
.6
Pr(Party2)
.6
.4
.4
.2
.2
0
0
really dislike 20 40 60 80 really like really dislike 20 40 60 80 really like
Feelings about Stephen Harper Feelings about Stephen Harper
Example 7. Logit in Stata
Logistic regression draw from a log odds distribution by maximizing
the log likelihood. Coefficients are thus hard to interpret.
. logit party2 harper leftright i.male age
------------------------------------------------------------------------------
party2 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
harper | .068045 .0046875 14.52 0.000 .0588576 .0772324
leftright | .5628441 .0684852 8.22 0.000 .4286156 .6970726
|
male |
Male | .1405912 .2059007 0.68 0.495 -.2629667 .5441492
age | .0091524 .0070297 1.30 0.193 -.0046257 .0229304
_cons | -7.549818 .6537195 -11.55 0.000 -8.831084 -6.268551
------------------------------------------------------------------------------
Example 7. Logit in Stata
------------------------------------------------------------------------------
party2 | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
harper | 1.070413 .0050176 14.52 0.000 1.060624 1.080293
leftright | 1.755659 .1202366 8.22 0.000 1.535131 2.007866
|
male |
Male | 1.150954 .2369822 0.68 0.495 .7687675 1.723142
age | 1.009194 .0070944 1.30 0.193 .995385 1.023195
_cons | .0005262 .000344 -11.55 0.000 .0001461 .001895
------------------------------------------------------------------------------
Example 7. Logit in R
> m7 <- glm(party2~harper+leftright+male+age, data=data, family="binomial")
> summary(m7)
Call:
glm(formula = party2 ~ harper + leftright + male + age, family = "binomial",
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4018 -0.3470 -0.1040 0.4138 3.1348
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.549818 0.653717 -11.549 <2e-16 ***
harper 0.068045 0.004688 14.516 <2e-16 ***
leftright 0.562844 0.068485 8.219 <2e-16 ***
maleMale 0.140591 0.205900 0.683 0.495
age 0.009152 0.007030 1.302 0.193
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
harper | .0134041 .0008582 15.62 0.000 .011722 .0150861
leftright | .1108737 .0135569 8.18 0.000 .0843027 .1374448
male | .0277301 .040754 0.68 0.496 -.0521462 .1076065
age | .0018029 .0013874 1.30 0.194 -.0009164 .0045222
------------------------------------------------------------------------------
Note: dy/dx for factor levels is the discrete change from the base level.
Example 7. Predicted Probabilities in Stata
You should then plot the predicted probabilities over a continous
predictor, either with margins or mcp, which will give you this
graph:
1
.8
Pr(party2), predict()
.6
.4
.2
0
0 20 40 60 80 100
harper
margins, at(harper=(1(1)100))
marginsplot
*or simply
mcp harper
Example 7. Marginal Effects in R
You can get marginal effects with the mfx package
install.packages("mfx")
library(mfx)
logitmfx(party2~harper+leftright+male+age, data=data, atmean =T)
Call:
logitmfx(formula = party2 ~ harper + leftright + male + age,
data = data, atmean = T)
Marginal Effects:
dF/dx Std. Err. z P>|z|
harper 0.0134041 0.0008582 15.6188 < 2.2e-16 ***
leftright 0.1108737 0.0135569 8.1784 2.876e-16 ***
maleMale 0.0277301 0.0407539 0.6804 0.4962
age 0.0018029 0.0013874 1.2995 0.1938
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "maleMale"
Ordered Logit
------------------------------------------------------------------------------
toobilingual | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
votechoice |
Tories | 2.302053 .462271 4.15 0.000 1.553056 3.412271
NDP | 1.08919 .1980942 0.47 0.639 .7625939 1.555658
BQ | .3370468 .0866534 -4.23 0.000 .2036337 .5578672
Greens | 1.267178 .3692448 0.81 0.416 .7158219 2.243213
|
harper | .999143 .0027409 -0.31 0.755 .9937854 1.00453
leftright | 1.20156 .0428846 5.14 0.000 1.12038 1.288622
male | 1.223467 .1457727 1.69 0.090 .9686658 1.545292
age | 1.010642 .0041668 2.57 0.010 1.002508 1.018842
-------------+----------------------------------------------------------------
/cut1 | -.0463056 .3173242 -.6682495 .5756383
/cut2 | 2.024309 .3232185 1.390812 2.657806
/cut3 | 3.722098 .3387192 3.058221 4.385976
------------------------------------------------------------------------------
Example 8. Logit in Stata
Be careful: predicted probabilities now need to be computed on all
outcomes
Predictive Margins with 95% CIs Predictive Margins with 95% CIs
.4 .5
Pr(Toobilingual==1)
Pr(Toobilingual==2)
.3
.4
.2
.3
.1
0 .2
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Left/right: Where would you place yourself on the scale below?
Left/right: Where would you place yourself on the scale below?
Predictive Margins with 95% CIs Predictive Margins with 95% CIs
.4
Pr(Toobilingual==3)
Pr(Toobilingual==4)
.4
.35 .3
.3
.2
.25
.1
.2
.15 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Left/right: Where would you place yourself on the scale below?
Left/right: Where would you place yourself on the scale below?
Coefficients:
Value Std. Error t value
votechoiceTories 0.8338049 0.200806 4.1523
votechoiceNDP 0.0854366 0.181873 0.4698
votechoiceBQ -1.0875314 0.257097 -4.2300
votechoiceGreens 0.2367981 0.291392 0.8126
harper -0.0008574 0.002744 -0.3125
leftright 0.1836207 0.035691 5.1448
maleMale 0.2016890 0.119147 1.6928
age 0.0105862 0.004124 2.5672
Intercepts:
Value Std. Error t value
1|2 -0.0463 0.3174 -0.1459
2|3 2.0243 0.3232 6.2626
3|4 3.7221 0.3387 10.9880
------------------------------------------------------------------------------
votechoice | RRR Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
Libs |
harper | .9331527 .0046223 -13.97 0.000 .924137 .9422562
leftright | .6151022 .0443692 -6.74 0.000 .5340077 .7085118
male | .8724498 .1928675 -0.62 0.537 .5656791 1.345584
age | 1.003075 .0077044 0.40 0.689 .9880881 1.01829
_cons | 331.5598 228.216 8.43 0.000 86.03424 1277.77
-------------+----------------------------------------------------------------
Tories | (base outcome)
-------------+----------------------------------------------------------------
NDP |
harper | .935527 .0050303 -12.39 0.000 .9257195 .9454383
leftright | .5065051 .0400814 -8.60 0.000 .4337361 .5914828
male | .7520023 .1827952 -1.17 0.241 .4669934 1.210954
age | .9872603 .0082014 -1.54 0.123 .9713161 1.003466
_cons | 1235.203 891.2959 9.87 0.000 300.2823 5080.975
-------------+----------------------------------------------------------------
BQ |
harper | .9345699 .0061696 -10.25 0.000 .9225556 .9467407
leftright | .5878576 .0555462 -5.62 0.000 .4884756 .7074593
male | 1.078164 .3300636 0.25 0.806 .5917011 1.964569
age | .9714043 .01007 -2.80 0.005 .9518667 .991343
_cons | 520.7184 433.5257 7.51 0.000 101.8433 2662.4
-------------+----------------------------------------------------------------
Greens |
harper | .9379658 .0070879 -8.47 0.000 .9241761 .9519612
Example 9. Multinomial Logit in Stata
. mlogtest, iia
Call:
multinom(formula = votechoice ~ harper + leftright + age + male,
data = data)
Coefficients:
(Intercept) harper leftright age maleMale
Tories -5.8035817 0.069186831 0.48596775 -0.003074969 0.1365095
NDP 1.3150999 0.002542445 -0.19424343 -0.015891507 -0.1485712
BQ 0.4514183 0.001519217 -0.04532072 -0.032082301 0.2116362
Greens 1.4727807 0.005145085 -0.22967514 -0.046365245 0.2165980
Std. Errors:
(Intercept) harper leftright age maleMale
Tories 0.6883018 0.004953379 0.07213321 0.007680753 0.2210648
NDP 0.4518466 0.003892178 0.05559792 0.006537470 0.1906751
BQ 0.6231527 0.005509566 0.07768518 0.009155438 0.2705429
Greens 0.6837026 0.006601243 0.09366259 0.010719374 0.3188646
#choose baseline
data$vote2 <- relevel(data$votechoice, ref = "Tories")
m10.2 <- multinom(vote2 ~ harper + leftright + age + male, data = data)
summary(m10.2)
Thank You!
Any questions?
jean-philippe.gauvin@umontreal.ca