Professional Documents
Culture Documents
Part 1b
Part 1b
In the file “smsa.txt” you can find information on some large Metropolitan Areas in the
US. The variables that are included are
1. Check whether the mean number of serious crimes is significantly different in the
different regions. Describe where the differences are located, if any.
2. Check whether the mean number of crimes is significantly different in the different
income classes and whether these differences depend on the region. Describe the
results if differences have been detected.
3. Check whether the number of crimes still depend on the income class if you have
already accounted for the laborforce in the area. Explain.
1. Check whether the mean number of serious crimes is significantly different in the
different regions. Describe where the differences are located, if any.
smsadat = read.table( "smsa.txt", header = TRUE)
smsadat$incclass = as.factor(smsadat$incclass)
crimes_anova= aov(smsadat$crimes~smsadat$region)
Anova(crimes_anova)
Response: smsadat$crimes
Sum Sq Df F value Pr(>F)
smsadat$region 3.8311e+09 3 3.1314 0.02842 *
Residuals 4.6491e+10 114
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
main conclusion: there is a significant difference between the mean crimes in the
different regions (p-value = 0.0284) and the Tukey intervals show that only the difference
between the mean crimes in the West and in the North East are significantly different
from each other.
2. Check whether the mean number of crimes is significantly different in the different
income classes and whether these differences depend on the region. Describe the results
if differences have been detected.
smsadat$incclass = as.factor(smsadat$incclass)
smsadat$region = as.factor(smsadat$region)
twoway = aov(smsadat$crimes~smsadat$incclass*smsadat$region)
Anova(twoway)
Response: smsadat$crimes
Sum Sq Df F value Pr(>F)
smsadat$incclass 2.9149e+10 2 107.5283 < 2.2e-16 ***
smsadat$region 6.1569e+09 3 15.1417 2.858e-08 ***
smsadat$incclass:smsadat$region 2.9750e+09 6 3.6581 0.002432 **
Residuals 1.4367e+10 106
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
main conclusion: there is a significant interaction effect between income class and
region (p-value = 0.00243), so the effect of the income class (= the differences between
the mean number of crimes in the different income classes) depends on the region and
the effect of the region depends on the income class. The following 1interaction plot
illustrates this.
interaction.plot(smsadat$incclass,smsadat$region,smsadat$crimes)
emm1=emmeans(twoway, ~region, by = "incclass")
pairs(emm1, adjust = "tukey")
emm2=emmeans(twoway, ~incclass, by = "region")
pairs(emm2, adjust = "tukey")
70000
smsadat$region
mean of smsadat$crimes
W
S
NC
50000
NE
30000
10000
1 2 3
smsadat$incclass
incclass = 1:
contrast estimate SE df t.ratio p.value
NC - NE 4590 5048 106 0.909 0.7999
NC - S -2419 3981 106 -0.608 0.9294
NC - W -6806 4769 106 -1.427 0.4855
NE - S -7009 4526 106 -1.549 0.4123
NE - W -11396 5233 106 -2.178 0.1361
S - W -4387 4212 106 -1.041 0.7255
incclass = 2:
contrast estimate SE df t.ratio p.value
NC - NE -1610 9741 106 -0.165 0.9984
incclass = 3:
contrast estimate SE df t.ratio p.value
NC - NE 16079 4964 106 3.239 0.0086
NC - S -280 4691 106 -0.060 0.9999
NC - W -26799 5629 106 -4.761 <.0001
NE - S -16359 4691 106 -3.487 0.0039
NE - W -42878 5629 106 -7.617 <.0001
S - W -26519 5389 106 -4.921 <.0001
and
region = NC:
contrast estimate SE df t.ratio p.value
incclass1 - incclass2 -5871 6127 106 -0.958 0.6048
incclass1 - incclass3 -34187 4769 106 -7.168 <.0001
incclass2 - incclass3 -28317 6279 106 -4.509 <.0001
region = NE:
contrast estimate SE df t.ratio p.value
incclass1 - incclass2 -12071 9101 106 -1.326 0.3838
incclass1 - incclass3 -22699 5233 106 -4.338 0.0001
incclass2 - incclass3 -10628 8949 106 -1.188 0.4631
region = S:
contrast estimate SE df t.ratio p.value
incclass1 - incclass2 -15875 5293 106 -2.999 0.0094
incclass1 - incclass3 -32048 3886 106 -8.247 <.0001
incclass2 - incclass3 -16174 5681 106 -2.847 0.0145
region = W:
contrast estimate SE df t.ratio p.value
incclass1 - incclass2 -10921 6798 106 -1.607 0.2473
incclass1 - incclass3 -54181 5629 106 -9.625 <.0001
incclass2 - incclass3 -43259 7297 106 -5.928 <.0001
main conclusion: the slices indicate that there is only a significant difference between
the average number of crimes in the different regions for the third income class. In this
income class, region W has significantly more crimes than the other regions, region S and
reion NC have similar crime numbers and region NE has significantly smaller numbers
of crime. Except for region NE, the number of crimes is significantly larger in the third
income class than in the first and second income class whereas the difference in number
3. Check whether the number of crimes still depend on the income class if you have already
accounted for the laborforce in the area. Explain.
ancova_aov = aov(crimes~incclass*laborforce,data = smsadat)
Anova(ancova_aov)
Response: crimes
Sum Sq Df F value Pr(>F)
incclass 2.0244e+08 2 1.3731 0.2576
laborforce 1.5063e+10 1 204.3250 <2e-16 ***
incclass:laborforce 1.7981e+08 2 1.2195 0.2993
Residuals 8.2566e+09 112
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
main conclusion: as the interaction term is not significant (p-value = 0.2993, so the
lines can be assumed parallel), we can have a look at the main effects. As the effect
of incclass is not significant, the intercepts do not significantly differ from each other,
so there is indeed no difference between the average number of crimes in the different
income classes if we have already taken into account the laborforce. The laborforce has
a significant effect on the number of crimes as is also visualized in the following plot.
ggplot(smsadat, aes(x=laborforce, y=crimes, colour=incclass)) +
geom_point(aes(color = incclass))+ geom_smooth(formula=y~x,method="lm",
mapping=aes(y=predict(ancova_aov,smsadat)))
90000
incclass
crimes
60000 1
2
3
30000
0
200 400 600
laborforce
Apply this method to study the effect of a minimum wage change on employment.
Assume that the minimum wage was changed in state Y but not in state X where it
stayed the same. Data were collected on the employment in 410 fastfood restaurants
in both states 2 months before the change took place and 7 months after the change.
The data are in DIDdata (fte = full time equivalents). Check whether the employment
decreased significantly in state Y as generally expected.
DIDdata = read.table("DIDdata.txt",header=TRUE)
DIDaov = aov(fte ~state*time,data=DIDdata)
Anova(DIDaov)
Anova Table (Type II tests)
Response: fte
Sum Sq Df F value Pr(>F)
state 285 1 3.2239 0.07295 .
time 1 1 0.0065 0.93562
state:time 235 1 2.6598 0.10331
Residuals 69888 790
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
main conclusion: the interaction term is not significant, so we cannot conclude from this
analysis that the FTE in the state in which the minimum wage was changed, evolved signifi-
cantly different than in the state where the minimum wage remained unchanged.
A visualization is always helpful. As the interaction.plot function apparently cannot cope with
the missing values in the FTE variable, I first create another dataset without the observations
that have missing values in any of the variables.
Diddata.plot = DIDdata[complete.cases(DIDdata),]
interaction.plot(Diddata.plot$time, Diddata.plot$state,Diddata.plot$fte)
main conclusion: so it looks as if the employment in state X where the minimum wage stayed
constant has decreased, whereas in state Y with a higher minimum wage, the employment
increased but the anova test state that the difference with parallel lines is just random variation
as the unexplained variance is very large compared to these differences. I don’t know how to
visualize this large unexplained variance in R (let me know if you know how to do it!), SAS
yields the plot at the right which is very illuminating.