Part 1b

Part 1b anova: application 1
In the file “smsa.txt” you can find information on some large Metropolitan Areas in the
US. The variables that are included are
Laborforce Total number of persons in labor force (in thousands)

Crimes Total number of serious crimes
Incclass Classification of the total income (in millions of dollars) received by all
residents in the area (divided in 3 income classes)
Region Geographical region classification: NE, NC, S, W
(northeast; northcentral, south, west)
1. Check whether the mean number of serious crimes is significantly different in the
different regions. Describe where the differences are located, if any.
2. Check whether the mean number of crimes is significantly different in the different
income classes and whether these differences depend on the region. Describe the
results if differences have been detected.
3. Check whether the number of crimes still depend on the income class if you have
already accounted for the laborforce in the area. Explain.
M. VANDEBROEK - 2023-2024 - KU LEUVEN

Part 1a anova: application 1 - solutions
1. Check whether the mean number of serious crimes is significantly different in the
different regions. Describe where the differences are located, if any.
smsadat = read.table( "smsa.txt", header = TRUE)
smsadat$incclass = as.factor(smsadat$incclass)
crimes_anova= aov(smsadat$crimes~smsadat$region)
Anova(crimes_anova)
Anova Table (Type II tests)
Response: smsadat$crimes
Sum Sq Df F value Pr(>F)
smsadat$region 3.8311e+09 3 3.1314 0.02842 *
Residuals 4.6491e+10 114
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
> emm1=emmeans(crimes_anova, ~region)

> pairs(emm1,adjust="tukey")
contrast estimate SE df t.ratio p.value
NC - NE 6123 5710 114 1.072 0.7070
NC - S -526 4809 114 -0.109 0.9995
NC - W -12051 5710 114 -2.111 0.1559
NE - S -6650 5254 114 -1.266 0.5865
NE - W -18174 6089 114 -2.985 0.0180
S - W -11525 5254 114 -2.194 0.1312
P value adjustment: tukey method for comparing a family of 4 estimates
main conclusion: there is a significant difference between the mean crimes in the
different regions (p-value = 0.0284) and the Tukey intervals show that only the difference
between the mean crimes in the West and in the North East are significantly different
from each other.
2. Check whether the mean number of crimes is significantly different in the different
income classes and whether these differences depend on the region. Describe the results
if differences have been detected.
smsadat$incclass = as.factor(smsadat$incclass)
smsadat$region = as.factor(smsadat$region)
twoway = aov(smsadat$crimes~smsadat$incclass*smsadat$region)
Anova(twoway)
M. VANDEBROEK - 2023-2024 - KU LEUVEN 1

Response: smsadat$crimes
smsadat$incclass 2.9149e+10 2 107.5283 < 2.2e-16 ***
smsadat$region 6.1569e+09 3 15.1417 2.858e-08 ***
smsadat$incclass:smsadat$region 2.9750e+09 6 3.6581 0.002432 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
main conclusion: there is a significant interaction effect between income class and
region (p-value = 0.00243), so the effect of the income class (= the differences between
the mean number of crimes in the different income classes) depends on the region and
the effect of the region depends on the income class. The following 1interaction plot
illustrates this.
interaction.plot(smsadat$incclass,smsadat$region,smsadat$crimes)
emm1=emmeans(twoway, ~region, by = "incclass")
pairs(emm1, adjust = "tukey")
emm2=emmeans(twoway, ~incclass, by = "region")
pairs(emm2, adjust = "tukey")
70000
smsadat$region
mean of smsadat$crimes
W
S
NC
50000
NE
30000
10000
1 2 3
smsadat$incclass
incclass = 1:
NC - NE 4590 5048 106 0.909 0.7999
NC - S -2419 3981 106 -0.608 0.9294
NC - W -6806 4769 106 -1.427 0.4855
NE - S -7009 4526 106 -1.549 0.4123
NE - W -11396 5233 106 -2.178 0.1361
S - W -4387 4212 106 -1.041 0.7255
incclass = 2:
NC - NE -1610 9741 106 -0.165 0.9984

NC - S -12423 7050 106 -1.762 0.2973
NC - W -11856 7810 106 -1.518 0.4303
NE - S -10814 9506 106 -1.138 0.6672
NE - W -10247 10082 106 -1.016 0.7403
S - W 567 7515 106 0.075 0.9998
incclass = 3:
NC - NE 16079 4964 106 3.239 0.0086
NC - S -280 4691 106 -0.060 0.9999
NC - W -26799 5629 106 -4.761 <.0001
NE - S -16359 4691 106 -3.487 0.0039
NE - W -42878 5629 106 -7.617 <.0001
S - W -26519 5389 106 -4.921 <.0001
and
region = NC:
incclass1 - incclass2 -5871 6127 106 -0.958 0.6048
incclass1 - incclass3 -34187 4769 106 -7.168 <.0001
region = NE:
region = S:
region = W:
main conclusion: the slices indicate that there is only a significant difference between
the average number of crimes in the different regions for the third income class. In this
income class, region W has significantly more crimes than the other regions, region S and
reion NC have similar crime numbers and region NE has significantly smaller numbers
of crime. Except for region NE, the number of crimes is significantly larger in the third
income class than in the first and second income class whereas the difference in number

of crimes between the first and second income class is not significantly different except
for region S.
3. Check whether the number of crimes still depend on the income class if you have already
accounted for the laborforce in the area. Explain.
ancova_aov = aov(crimes~incclass*laborforce,data = smsadat)
Anova(ancova_aov)
Response: crimes
incclass 2.0244e+08 2 1.3731 0.2576
laborforce 1.5063e+10 1 204.3250 <2e-16 ***
incclass:laborforce 1.7981e+08 2 1.2195 0.2993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
main conclusion: as the interaction term is not significant (p-value = 0.2993, so the
lines can be assumed parallel), we can have a look at the main effects. As the effect
of incclass is not significant, the intercepts do not significantly differ from each other,
so there is indeed no difference between the average number of crimes in the different
income classes if we have already taken into account the laborforce. The laborforce has
a significant effect on the number of crimes as is also visualized in the following plot.
ggplot(smsadat, aes(x=laborforce, y=crimes, colour=incclass)) +
geom_point(aes(color = incclass))+ geom_smooth(formula=y~x,method="lm",
mapping=aes(y=predict(ancova_aov,smsadat)))
90000
incclass
crimes
60000 1
2
3
30000
0
200 400 600
laborforce

Part 1b anova: application 2
We cannot draw causal conclusions by observing simple before-and-after treatment

changes in outcomes, as factors other than the treatment may influence the outcome
over time. Difference-in-differences is an analytical approach that facilitates causal in-
ference even when randomization is not possible. Difference-in-differences compares the
before-and-after changes in outcomes for treatment and control group to estimate the
overall impact of the treatment. So we have a control group that is not “treated” and
a treatment group that gets the “treatment”, we measure
• the response in the control group at the start of the study, denoted by µcontrol,pre
• the response in the control group at the end of the study, denoted by µcontrol,post
• the response in the treatment group at the start of the study, denoted by µtreatment,pre
• the response in the treatment group at the end of the study, denoted by µtreatment,post
Remark that
• µcontrol,post − µcontrol,pre measures the ”normal evolution” (without treatment)
• µcontrol,pre − µtreatment,pre compares the control and treatment group
• assume that without treatment, the ”normal evolution” would be the same in both
groups, so one would expect µcontrol,post − µcontrol,pre = µtreatment,post − µtreatment,pre
if there was no treatment
• the difference between µcontrol,post −µcontrol,pre and µtreatment,post −µtreatment,pre mea-
sures the treatment effect (guess where the name DID comes from!)
• using dummy coding with the control group as reference level for the group and the
pre-level the reference level for the time variable, we get from µij = µ+αi +βj +αβij :
µcontrol,pre = µ
µcontrol,post = µ + βpost
µtreatment,pre = µ + αtreatment
µtreatment,post = µ + αtreatment + βpost + αβtreatment,post
and therefore
(µcontrol,post − µcontrol,pre ) − (µtreatment,post − µtreatment,pre ) = −αβtreatment,post
so we need to test whether the interaction term is significant or not
Apply this method to study the effect of a minimum wage change on employment.
Assume that the minimum wage was changed in state Y but not in state X where it
stayed the same. Data were collected on the employment in 410 fastfood restaurants
in both states 2 months before the change took place and 7 months after the change.
The data are in DIDdata (fte = full time equivalents). Check whether the employment
decreased significantly in state Y as generally expected.
M. VANDEBROEK - 2023-2024 - KU LEUVEN

Part 1b anova: application 2 - solutions
DIDdata = read.table("DIDdata.txt",header=TRUE)
DIDaov = aov(fte ~state*time,data=DIDdata)
Anova(DIDaov)
Response: fte
state 285 1 3.2239 0.07295 .
time 1 1 0.0065 0.93562
state:time 235 1 2.6598 0.10331
Residuals 69888 790
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
main conclusion: the interaction term is not significant, so we cannot conclude from this
analysis that the FTE in the state in which the minimum wage was changed, evolved signifi-
cantly different than in the state where the minimum wage remained unchanged.
A visualization is always helpful. As the interaction.plot function apparently cannot cope with
the missing values in the FTE variable, I first create another dataset without the observations
that have missing values in any of the variables.
Diddata.plot = DIDdata[complete.cases(DIDdata),]
interaction.plot(Diddata.plot$time, Diddata.plot$state,Diddata.plot$fte)
main conclusion: so it looks as if the employment in state X where the minimum wage stayed
constant has decreased, whereas in state Y with a higher minimum wage, the employment
increased but the anova test state that the difference with parallel lines is just random variation
as the unexplained variance is very large compared to these differences. I don’t know how to
visualize this large unexplained variance in R (let me know if you know how to do it!), SAS
yields the plot at the right which is very illuminating.

Part 1b

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part 1b

Uploaded by

Copyright:

Available Formats

Part 1b anova: application 1

Laborforce Total number of persons in labor force (in thousands)

M. VANDEBROEK - 2023-2024 - KU LEUVEN

Anova Table (Type II tests)

> emm1=emmeans(crimes_anova, ~region)

P value adjustment: tukey method for comparing a family of 4 estimates

M. VANDEBROEK - 2023-2024 - KU LEUVEN 1

M. VANDEBROEK - 2023-2024 - KU LEUVEN 2

P value adjustment: tukey method for comparing a family of 4 estimates

P value adjustment: tukey method for comparing a family of 3 estimates

M. VANDEBROEK - 2023-2024 - KU LEUVEN 3

Anova Table (Type II tests)

M. VANDEBROEK - 2023-2024 - KU LEUVEN 4

We cannot draw causal conclusions by observing simple before-and-after treatment

(µcontrol,post − µcontrol,pre ) − (µtreatment,post − µtreatment,pre ) = −αβtreatment,post

so we need to test whether the interaction term is significant or not

M. VANDEBROEK - 2023-2024 - KU LEUVEN

M. VANDEBROEK - 2023-2024 - KU LEUVEN 1

You might also like