Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

1.

Suppose you are interested in knowing if a job candidate's starting salary differs depending on
their educational background and whether or not the job candidate had previous work
experience. Educational background was categorized as arts or science/engineering.
Which test would you use?

1. Simple linear regression 4. One-way ANOVA


2. Multiple regression 5. Two-way ANOVA
3. Logistic regression 6. Repeated measures ANOVA

2. Researchers wanted to investigate whether there is a relationship between a person's number of


years of education and his or her driving record, recorded as total tickets in a lifetime.

Which test would you use?

1. Simple linear regression 4. One-way ANOVA


2. Multiple regression 5. Two-way ANOVA
3. Logistic regression 6. Repeated measures ANOV

3. Imagine we have a weighted coin where the probability of heads is .65.

a. What is the probability of it landing on tails?

b.What are the odds of getting heads in a toss?

c. What is the odds ratio for heads?

d.Describe the odds ratio you found in context.

Page 1
4. Researchers wanted to predict whether someone was hired or not based on their age in years,
gender, and years of education.

a. Which test would you use?

1. Simple linear regression 4. One-way ANOVA


2. Multiple regression 5. Two-way ANOVA
3. Logistic regression 6. Repeated measures ANOVA

b. What are the null and alternative hypotheses?

5. Researchers want to investigate the relationship between subjects’ scores on a test of empathy
and their gender, age in years, whether they were a parent or not, and their education level in
years.

Which test would you use?

1. Simple linear regression 4. One-way ANOVA


2. Multiple regression 5. Two-way ANOVA
3. Logistic regression 6. Repeated measures ANOVA

6. Use the SPSS output shown below to answer all parts of this question. Engineers want to use
three pond characteristics: depth, surface strength, and surface area, to predict whether ice type
was landfast or not. They randomly chose 350 ponds in northern Canada to collect data from.

a. What is the regression equation?

b.Assess the model utility.

c. Which predictors are significant?

Page 2
d. Interpret the beta for depth (βdepth).

Model Summary
Step -2 Log Cox & Snell R Nagelkerke R
likelihood Square Square
a
1 504.857 .156 .208
a. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.

Omnibus Tests of Model Coefficients


Chi-square df Sig.
Step 70.456 3 .000
Step 1 Block 70.456 3 .000
Model 70.456 3 .000

e. The researcher was interested in testing the complete second order model based on his
initial results above. What is the complete second order model?

f. Here is the model summary from the complete second order model (Below). What test
would you use to decide if the new model is better, and what are the hypotheses, test
statistic and degrees of freedom you would use?
Page 3
Model Summary
Step -2 Log Cox & Snell R Nagelkerke R
likelihood Square Square
a
1 456.309 .249 .332
a. Estimation terminated at iteration number 7 because
parameter estimates changed by less than .001.

g. What are the assumptions, how do you check them, and if you can tell, are they satisfied?

h. What is the prediction equation?

i. What is the odds of ice being landfast if the pond has a depth of 2, a surface strength of 10,
and a surface area of 100?

j. What is the probability of ice being landfast for the same pond as in part i.

7. Researchers wanted to test whether taking a GMAT prep course would improve subjects' scores
on the GMAT. They had students take the GMAT before the prep course, midway through the
course, and after the course was over. They then compared their scores.

a. Which test would you use?

1. Simple linear regression 4. One-way ANOVA


2. Multiple regression 5. Two-way ANOVA
3. Logistic regression 6. Repeated measures ANOVA

b. What are the null and alternative hypotheses?

Page 4
8. Researchers conducted an experiment investigating whether the average SAT score differed for
people with high or low short-term memory capacity. They randomly selected 100 incoming
freshman and tested their short-term memory. 50 were classified as having low short-term
memory capacity and 50 were classified as having high short-term memory capacity. When
the researchers plotted the data, they noticed it was not normally distributed.

a. What test should they use to investigate their question?

b. What hypotheses would they be testing?

9.Researchers wanted to test whether staying up all night affected memory recall. They randomly
assigned subjects to three groups; one group stayed up all night, one group stayed up for half of
the night, and the third groups slept normally. The next morning they recorded their
performance on a memory test.

a. Which test would you use?

1. Simple linear regression 4. One-way ANOVA


2. Multiple regression 5. Two-way ANOVA
3. Logistic regression 6. Repeated measures ANOVA

b. Can this study determine cause and effect?

10. Researchers wanted to investigate the correlation between the frequency of asthma attacks and
air quality in a small subset of asthmatics. They got a correlation value of .64, but only had a
sample of 10 subjects. Because their sample was small, they decided to use a randomization
test to obtain a pvalue for the correlation. Their original sample is this:

# of
attacks 13 5 10 18 3 25 12 17 6 12
air quality 52 43 44 54 38 57 51 59 57 60

What might one of their samples for the randomization test look like?
Page 5
# of
attacks
air quality

11. Researchers were interested in how birth weights related to a mother’s age and race. They
used a sample of 1450 births from North Carolina to answer the question. Race was
categorical, and contained whether a mom was white, black, hispanic, or other. To do this,
they included three dummy variables, IsWhite, which was 1 if white, 0 if anything else,
IsBlack, which was 1 if black, 0 if anything else, and IsHispancic, which was 1 if hispanic, 0 if
anything else. They got the following output:

Page 6
The researchers wondering if race was adding anything to the model. To test this, they used a nested F
test. Here are the results of the reduced model:

a. What hypotheses would they test?

b. What would the test statistic and degrees of freedom be, and what table would you look it
up in?

Page 7
12. A statistics professor was interested in how heart rate during exercise might be related to how
active a person is in general. They classified 232 students as to how active they were (1=not
active, 2=moderately active, 3= very active). Then they had them run up and down a flight of
stairs and measured their pulse rate. Some of the data from assumption checking is show
below:

Levene's Test of Equality of Error Variancesa


Dependent Variable: Active
F df1 df2 Sig.
2.639 2 229 .074
Tests the null hypothesis that the error variance of
the dependent variable is equal across groups.
a. Design: Intercept + Exercise

a. Why or why not might the Kruskall-Wallis test be more appropriate for this dataset?

Use the output below from the Kruskall-Wallis test to answer:

b. Is the effect of exercise level significant?

Page 8
c. Follow up the main effect with posthoc tests.

Page 9
13. Investigators were interested in studying how many calories (kcal) people burned when
walking. They recorded this for a sample of 68 randomly selected people from a typical major
city. They were assigned randomly to a day to walk. They also recorded the minutes the
walked (min), whether it was raining or not (IsRain =1 when raining, 2 when not), and whether
it was the weekend (IsWeekend = 1 when weekend, 0 when weekday). They got the following
model below.
a. Did they meet the assumptions?

Page 10
b. Is there any reason to use a non-parametric test?

c. When the researchers submitted their paper, the reviewers asked for a bootstrapping
analysis. They ran it and got the following output. Test and interpret the betas. Do any
of the interpretations change when considering the bootstrapping analysis?

Page 11
Solutions
1. two-way ANOVA

2. simple linear regression

3.

a. P(tails) = 1 - P(heads) = 1 - .65 = .35

b. The odds of getting heads = P(heads) /1- P(heads) = .65/(1-.65) = .65/.35 = 1.857

odds of heads .65 .35 1.857


c. odds ratio for heads = = / = = 3.449
odds of tails 1− .65 1− .35 .538

d. The odds of getting heads is 3.45 times greater than the odds of getting tails.

4.
a. Logistic regression
b. Gender = 1 when male, 0 when female.
H0: βage = βgender = βeducation = 0; Ha: the betas are not all zero

5. multiple regression
6.
a. y = log(odds⌃) = .296 + 4.128xdepth + 47.123xstrength – 31.144xarea

b. H0: βdepth = βstrength = βarea = 0; Ha: the betas are not all zero

Since p <.001, we reject the null hypothesis, and conclude that taken together, depth,
surface strength, and surface area are useful in predicting the odds of whether ice is
landfast or not.

c. H0: βi = 0
Ha: βi ≠ 0
All three predictors, depth, surface strength and surfacearea, have a p<.001, so we can
reject the null hypothesis. This suggest there is a significant log-linear relationship
between each predictor and the odds of whether ice is landfast or not, after accounting for
the other predictors in the model.
d. For every one unit increase in depth, the odds of being landfast increase by a factor
of 62.065 for given values of surface area and surface strength.

Page 12
e. log(odds)=β0 + β1Depth + β2Surfacestrng + β3Surfacearea + β4DepthSurfacestrng +
β5DepthSurfacearea + β6SurfacestrngSurfacearea + β7Depth2 + β8Surfacestrng2 +
β9Surfacearea2
f. nested ratio test
H0: β4= β5= β6= β7= β8= β9=0
Ha: not all betas are zero
Test statistic = -2loglikelihoodNested - -2loglikelihoodFull
= 504.857 – 456.309
=48.548
Degrees of freedom = number of betas in subset being tested = 6

g. Assumptions:
Linearity: the Box-Tidwell test that tests the significance of the interaction of the
predictor and the log of the predictor of interest. It needs to be run separately for each
quantitative predictor.
Independence: The ponds are all in a similar geographical area, though there is not
necessarily any reason to think the kind of ice in one pond affects the kind of ice in
another, so probably a reasonable assumption.
Randomness: The sample was randomly selected, suggesting the spinner model is
accurate.
h. Log(odds⌃)= .296 +4.128Depth + 47.123Surfacestrength -31.144 SurfaceArea
i. We would plug in the numbers:

Log(odds⌃)= .296 +4.128(2) + 47.123(10) -31.144(100)


Log(odds⌃)= .296 + 8.256 + 471.23 – 3114.4
Log(odds⌃)= -2634.618
Odds=e-2634.618
j. For the probability, we can use the probability form of the model

π = e.296 +4.128Depth + 47.123Surfacestrength -31.144 SurfaceArea / 1 + e.296 +4.128Depth +


47.123Surfacestrength -31.144 SurfaceArea

π = e-2634.618 / 1+ e-2634.618

Page 13

# of
attacks 13 5 10 18 3 25 12 17 6 12
air quality 43 59 54 52 51 44 57 38 57 60
7. Repeated
measures ANOVA
8.
a. Wilcoxon-Mann-Whitney
b.H0: θ1= θ2
H a : θ1 ≠ θ2

9.
a. one-way ANOVA
b.Yes

10. Answers can vary here. The key is to use each air quality value once, but associated with a
different # of attacks.

11.
a. H0: β2 = β3 = β4 = 0
Ha: all the betas are not 0

b. F=((SSModelfull-SSModelreduced)/# of predictors tested)/(SSEfull/(n-k-1)

F=(( 25110.353-15421.439)/3)/(697223.763/(1450-4-1))
=3229.638/482.508
=6.693
Test statistic = 6.693
Degrees of freedom: numerator = # of predictors being tested, denominator = error
degrees of freedom for the full model
Numerator = 3, denominator = 1445
You would look this up in a F table.

12.
a. Levene’s test (H0: the variances are all equal, Ha: the variances are not all equal) is close to
significant, suggesting that there is a trend for the variances to not be all equal, though the
graph of the residuals looks ok. We don’t have the data for the rule of thumb test. The
normal probability plot looks ok, the dots fall close to the line, suggesting that we don’t
violate the assumption of normality. Our sample size is also fairly large, which suggests
a normal ANOVA would be ok. There do appear to be several outliers, which would be a
Page 14
reason to run a non-parametric test. However, in general, it seems as if we might be ok
with a regular ANOVA.
b.H0: θ1= θ2= θ3
Ha: the medians are not all equal
Since p<.001, we can reject the null hypothesis. This suggests that the median heart rate
differs depending on how active a person is.

c. H0: θi= θj
H a : θi ≠ θj

Since p<.05 for very active students versus moderately active (p<.001) and not active
(p<.001), we can reject the null hypothesis. This suggests that very active student’s median
lower heart rate is lower after running up and down stairs than students who are not active or
moderately active. Since p>.05 for not active versus moderately active students, we can not
reject the null hypothesis. This suggests that they have a similar median heart rate after
exercise.
13.
a. Mean zero: met automatically.
Randomness and independence: participants were randomly selected and assigned to days to
walk, so we should be good.
Normality: the NPP plot looks good, the dots fall mostly on the line.
Equal variance: The residual plot looks fairly uniform for all values of the predicted values.
However, there are a few outliers, with residuals larger than 2.

b.There are a few outliers, but nothing too major like failing the assumptions of equal
variance or normality, or having a very small sample size.

c.
H0: βIsRain = 0
Ha: βiIsRain≠ 0
Since p<.05, we reject the null hypothesis. This suggests that there is a linear relationship
between average calories burned and whether it was raining or not, such that people on averaged
burned 4.954 less calories when it was raining, after accounting for the minutes walked and
whether it was the weekend. When we relax the assumptions for regression and try to account
for sampling variability by running a bootstrapping analysis, this result holds, as p<.05 and our
bias is only .023.

H0: βMin = 0
Ha: βMin≠ 0

Page 15
Since p<.05, we reject the null hypothesis. This suggest that there is a linear relationship
between average calories burned and the time people walk, such that for each additional minute
of walking average calories burned increases by 5.38, after accounting for whether it rained and
if it was the weekend. When we relax the assumptions for regression and try to account for
sampling variability by running a bootstrapping analysis, this result holds with p<.001 and a bias
of only .001.

H0: βIsWeekend = 0
Ha: βIsWeekend≠ 0
Since p>.05, we fail to reject the null hypothesis. This suggest that there is not a linear
relationship between average calories burned and whether it was the weekend or not, after
accounting for whether it rained and how long someone walked. When we relax the assumptions
for regression and try to account for sampling variability by running a bootstrapping analysis,
this result holds with a p=.097.

Page 16

You might also like