Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Part 1b anova: application 1

In the file “smsa.txt” you can find information on some large Metropolitan Areas in the
US. The variables that are included are

Laborforce Total number of persons in labor force (in thousands)


Crimes Total number of serious crimes
Incclass Classification of the total income (in millions of dollars) received by all
residents in the area (divided in 3 income classes)
Region Geographical region classification: NE, NC, S, W
(northeast; northcentral, south, west)

1. Check whether the mean number of serious crimes is significantly different in the
different regions. Describe where the differences are located, if any.

2. Check whether the mean number of crimes is significantly different in the different
income classes and whether these differences depend on the region. Describe the
results if differences have been detected.

3. Check whether the number of crimes still depend on the income class if you have
already accounted for the laborforce in the area. Explain.

M. VANDEBROEK - 2023-2024 - KU LEUVEN


Part 1a anova: application 1 - solutions

1. Check whether the mean number of serious crimes is significantly different in the
different regions. Describe where the differences are located, if any.
smsadat = read.table( "smsa.txt", header = TRUE)
smsadat$incclass = as.factor(smsadat$incclass)
crimes_anova= aov(smsadat$crimes~smsadat$region)
Anova(crimes_anova)

Anova Table (Type II tests)

Response: smsadat$crimes
Sum Sq Df F value Pr(>F)
smsadat$region 3.8311e+09 3 3.1314 0.02842 *
Residuals 4.6491e+10 114
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

> emm1=emmeans(crimes_anova, ~region)


> pairs(emm1,adjust="tukey")
contrast estimate SE df t.ratio p.value
NC - NE 6123 5710 114 1.072 0.7070
NC - S -526 4809 114 -0.109 0.9995
NC - W -12051 5710 114 -2.111 0.1559
NE - S -6650 5254 114 -1.266 0.5865
NE - W -18174 6089 114 -2.985 0.0180
S - W -11525 5254 114 -2.194 0.1312

P value adjustment: tukey method for comparing a family of 4 estimates

main conclusion: there is a significant difference between the mean crimes in the
different regions (p-value = 0.0284) and the Tukey intervals show that only the difference
between the mean crimes in the West and in the North East are significantly different
from each other.

2. Check whether the mean number of crimes is significantly different in the different
income classes and whether these differences depend on the region. Describe the results
if differences have been detected.

smsadat$incclass = as.factor(smsadat$incclass)
smsadat$region = as.factor(smsadat$region)
twoway = aov(smsadat$crimes~smsadat$incclass*smsadat$region)
Anova(twoway)

M. VANDEBROEK - 2023-2024 - KU LEUVEN 1


Anova Table (Type II tests)

Response: smsadat$crimes
Sum Sq Df F value Pr(>F)
smsadat$incclass 2.9149e+10 2 107.5283 < 2.2e-16 ***
smsadat$region 6.1569e+09 3 15.1417 2.858e-08 ***
smsadat$incclass:smsadat$region 2.9750e+09 6 3.6581 0.002432 **
Residuals 1.4367e+10 106
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

main conclusion: there is a significant interaction effect between income class and
region (p-value = 0.00243), so the effect of the income class (= the differences between
the mean number of crimes in the different income classes) depends on the region and
the effect of the region depends on the income class. The following 1interaction plot
illustrates this.
interaction.plot(smsadat$incclass,smsadat$region,smsadat$crimes)
emm1=emmeans(twoway, ~region, by = "incclass")
pairs(emm1, adjust = "tukey")
emm2=emmeans(twoway, ~incclass, by = "region")
pairs(emm2, adjust = "tukey")
70000

smsadat$region
mean of smsadat$crimes

W
S
NC
50000

NE
30000
10000

1 2 3

smsadat$incclass
incclass = 1:
contrast estimate SE df t.ratio p.value
NC - NE 4590 5048 106 0.909 0.7999
NC - S -2419 3981 106 -0.608 0.9294
NC - W -6806 4769 106 -1.427 0.4855
NE - S -7009 4526 106 -1.549 0.4123
NE - W -11396 5233 106 -2.178 0.1361
S - W -4387 4212 106 -1.041 0.7255

incclass = 2:
contrast estimate SE df t.ratio p.value
NC - NE -1610 9741 106 -0.165 0.9984

M. VANDEBROEK - 2023-2024 - KU LEUVEN 2


NC - S -12423 7050 106 -1.762 0.2973
NC - W -11856 7810 106 -1.518 0.4303
NE - S -10814 9506 106 -1.138 0.6672
NE - W -10247 10082 106 -1.016 0.7403
S - W 567 7515 106 0.075 0.9998

incclass = 3:
contrast estimate SE df t.ratio p.value
NC - NE 16079 4964 106 3.239 0.0086
NC - S -280 4691 106 -0.060 0.9999
NC - W -26799 5629 106 -4.761 <.0001
NE - S -16359 4691 106 -3.487 0.0039
NE - W -42878 5629 106 -7.617 <.0001
S - W -26519 5389 106 -4.921 <.0001

P value adjustment: tukey method for comparing a family of 4 estimates

and
region = NC:
contrast estimate SE df t.ratio p.value
incclass1 - incclass2 -5871 6127 106 -0.958 0.6048
incclass1 - incclass3 -34187 4769 106 -7.168 <.0001
incclass2 - incclass3 -28317 6279 106 -4.509 <.0001

region = NE:
contrast estimate SE df t.ratio p.value
incclass1 - incclass2 -12071 9101 106 -1.326 0.3838
incclass1 - incclass3 -22699 5233 106 -4.338 0.0001
incclass2 - incclass3 -10628 8949 106 -1.188 0.4631

region = S:
contrast estimate SE df t.ratio p.value
incclass1 - incclass2 -15875 5293 106 -2.999 0.0094
incclass1 - incclass3 -32048 3886 106 -8.247 <.0001
incclass2 - incclass3 -16174 5681 106 -2.847 0.0145

region = W:
contrast estimate SE df t.ratio p.value
incclass1 - incclass2 -10921 6798 106 -1.607 0.2473
incclass1 - incclass3 -54181 5629 106 -9.625 <.0001
incclass2 - incclass3 -43259 7297 106 -5.928 <.0001

P value adjustment: tukey method for comparing a family of 3 estimates

main conclusion: the slices indicate that there is only a significant difference between
the average number of crimes in the different regions for the third income class. In this
income class, region W has significantly more crimes than the other regions, region S and
reion NC have similar crime numbers and region NE has significantly smaller numbers
of crime. Except for region NE, the number of crimes is significantly larger in the third
income class than in the first and second income class whereas the difference in number

M. VANDEBROEK - 2023-2024 - KU LEUVEN 3


of crimes between the first and second income class is not significantly different except
for region S.

3. Check whether the number of crimes still depend on the income class if you have already
accounted for the laborforce in the area. Explain.
ancova_aov = aov(crimes~incclass*laborforce,data = smsadat)
Anova(ancova_aov)

Anova Table (Type II tests)

Response: crimes
Sum Sq Df F value Pr(>F)
incclass 2.0244e+08 2 1.3731 0.2576
laborforce 1.5063e+10 1 204.3250 <2e-16 ***
incclass:laborforce 1.7981e+08 2 1.2195 0.2993
Residuals 8.2566e+09 112
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

main conclusion: as the interaction term is not significant (p-value = 0.2993, so the
lines can be assumed parallel), we can have a look at the main effects. As the effect
of incclass is not significant, the intercepts do not significantly differ from each other,
so there is indeed no difference between the average number of crimes in the different
income classes if we have already taken into account the laborforce. The laborforce has
a significant effect on the number of crimes as is also visualized in the following plot.
ggplot(smsadat, aes(x=laborforce, y=crimes, colour=incclass)) +
geom_point(aes(color = incclass))+ geom_smooth(formula=y~x,method="lm",
mapping=aes(y=predict(ancova_aov,smsadat)))

90000

incclass
crimes

60000 1
2
3

30000

0
200 400 600
laborforce

M. VANDEBROEK - 2023-2024 - KU LEUVEN 4


Part 1b anova: application 2

We cannot draw causal conclusions by observing simple before-and-after treatment


changes in outcomes, as factors other than the treatment may influence the outcome
over time. Difference-in-differences is an analytical approach that facilitates causal in-
ference even when randomization is not possible. Difference-in-differences compares the
before-and-after changes in outcomes for treatment and control group to estimate the
overall impact of the treatment. So we have a control group that is not “treated” and
a treatment group that gets the “treatment”, we measure
• the response in the control group at the start of the study, denoted by µcontrol,pre
• the response in the control group at the end of the study, denoted by µcontrol,post
• the response in the treatment group at the start of the study, denoted by µtreatment,pre
• the response in the treatment group at the end of the study, denoted by µtreatment,post
Remark that
• µcontrol,post − µcontrol,pre measures the ”normal evolution” (without treatment)
• µcontrol,pre − µtreatment,pre compares the control and treatment group
• assume that without treatment, the ”normal evolution” would be the same in both
groups, so one would expect µcontrol,post − µcontrol,pre = µtreatment,post − µtreatment,pre
if there was no treatment
• the difference between µcontrol,post −µcontrol,pre and µtreatment,post −µtreatment,pre mea-
sures the treatment effect (guess where the name DID comes from!)
• using dummy coding with the control group as reference level for the group and the
pre-level the reference level for the time variable, we get from µij = µ+αi +βj +αβij :
µcontrol,pre = µ
µcontrol,post = µ + βpost
µtreatment,pre = µ + αtreatment
µtreatment,post = µ + αtreatment + βpost + αβtreatment,post
and therefore

(µcontrol,post − µcontrol,pre ) − (µtreatment,post − µtreatment,pre ) = −αβtreatment,post

so we need to test whether the interaction term is significant or not

Apply this method to study the effect of a minimum wage change on employment.
Assume that the minimum wage was changed in state Y but not in state X where it
stayed the same. Data were collected on the employment in 410 fastfood restaurants
in both states 2 months before the change took place and 7 months after the change.
The data are in DIDdata (fte = full time equivalents). Check whether the employment
decreased significantly in state Y as generally expected.

M. VANDEBROEK - 2023-2024 - KU LEUVEN


Part 1b anova: application 2 - solutions

DIDdata = read.table("DIDdata.txt",header=TRUE)
DIDaov = aov(fte ~state*time,data=DIDdata)
Anova(DIDaov)
Anova Table (Type II tests)

Response: fte
Sum Sq Df F value Pr(>F)
state 285 1 3.2239 0.07295 .
time 1 1 0.0065 0.93562
state:time 235 1 2.6598 0.10331
Residuals 69888 790
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

main conclusion: the interaction term is not significant, so we cannot conclude from this
analysis that the FTE in the state in which the minimum wage was changed, evolved signifi-
cantly different than in the state where the minimum wage remained unchanged.
A visualization is always helpful. As the interaction.plot function apparently cannot cope with
the missing values in the FTE variable, I first create another dataset without the observations
that have missing values in any of the variables.
Diddata.plot = DIDdata[complete.cases(DIDdata),]
interaction.plot(Diddata.plot$time, Diddata.plot$state,Diddata.plot$fte)

main conclusion: so it looks as if the employment in state X where the minimum wage stayed
constant has decreased, whereas in state Y with a higher minimum wage, the employment
increased but the anova test state that the difference with parallel lines is just random variation
as the unexplained variance is very large compared to these differences. I don’t know how to
visualize this large unexplained variance in R (let me know if you know how to do it!), SAS
yields the plot at the right which is very illuminating.

M. VANDEBROEK - 2023-2024 - KU LEUVEN 1

You might also like