6130 Test 2.1 Practice

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

DEPARTMENT OF STATISTICS AND DATA SCIENCE

The Wharton School


University of Pennsylvania

Statistics 6130 Fall 2022


Practice Test 2.1

IMPORTANT: You have a single attempt for this test. Make sure you do not submit your
test until you are completely settled on your final answers.

Make sure you have scrolled carefully through the output to make sure you did not miss
any questions.

There are 26 multiple choice questions.

This is a closed book/closed notes test. You are allowed to use a calculator.

You can use a 1 page cheat sheet, both sides. It can be handwritten or typed.

You have one hour and twenty minutes to complete the exam.

The exam is to be completed individually. No interaction with other students is permitted.

The computer output associated with the questions should be considered an essential part
of the questions. The multiple-choice questions are equally weighted; the number of
correct answers determines your grade.

Unless otherwise stated prediction intervals should use a 95% confidence level.
All logarithms are natural logs (that is, ln or loge) unless otherwise noted.
You can use scratch paper for any calculations.

STOP
DO NOT TURN THE PAGE UNTIL YOU ARE INSTRUCTED TO PROCEED.
Stat 6130, Practice Test 2.1 -2 of 13-

1. In a simple regression model, the 95% confidence interval for 𝛽𝛽1 is (5, 20). What
would you expect to happen if the sample size were increased by a factor of four?
a. The confidence interval would roughly halve in length
b. The confidence interval would roughly double in length
c. The confidence interval would be reduced by roughly a factor of 4
d. The confidence interval would be increased by roughly a factor of 4
e. None of the above

2. In using a simple regression model, which of the following actions requires the
normality of the error terms?
a. Estimating the intercept.
b. Estimating the slope.
c. Estimating 𝜎𝜎𝜀𝜀 .
d. Using the rule of thumb for a prediction interval: (𝑦𝑦� ± 2 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅) .
e. All of the above actions require the normality assumption on the error terms.

3. Pollsters for Gallup want to conduct a poll of likely voters, and they want their
margin of error (for 95% confidence) to be ±3%. About how many likely voters
should they sample?
a. 100 people
b. 1,000 people
c. 10,000 people
d. 100,000 people
e. 1,000,000 people

4. Which of the following will cause the width of a confidence interval for a population
mean to increase, all other things being held constant?
a. If you increase the level of confidence from 95% to 99.7%.
b. If the standard deviation is decreased from 50 to 30.
c. If you increase the sample size from 200 to 1000.
d. If you increase the sample size from 200 to 400.
e. None of the above.

5. In a hypothesis test in which the null hypothesis is that the average salary for dentists
is equal to $200,000, the p-value was 0.07. What is the correct interpretation of the p-
value?
a. The chance that the null hypothesis is true is 0.07.
b. The probability of Type II error is 0.07.
c. 7% of dentists in the sample earned under $200,000.
d. 7% of dentists in the sample earned over $200,000.
e. If the null is true, then the chance of observing a sample average more
extreme than the one observed is 0.07.
Stat 6130, Practice Test 2.1 -3 of 13-

6. A random variable can take on the values, 10, 20 and 30 with probabilities 0.2, 0.5
and 0.3 respectively. What is E(X) = μ?
a. 0.5.
b. 20.
c. 21.25.
d. 21.
e. None of the above.

7. Below is a residual plot from a time-series regression of sales against day of the year.
Does it exhibit any problems regarding the standard regression assumptions?

Residual by Row Plot

0.4

0.2

0.0

-0.2

-0.4

-0.6

0 10 20 30 40 50 60 70 80 90

Row Number

a. Collinearity.
b. Heteroscedasticity
c. Autocorrelation
d. Interaction
e. It doesn’t suggest any issues with the assumptions.

8. You have been asked to do 30 t-tests on a dataset. What is a valid concern with the
request if you were to use a p-value cut-off of 0.05 on each test?

a. Multiplicity.
b. 30 is not a large enough sample size for the Central Limit Theorem to be reliable.
c. Autocorrelation.
d. The p-values could all be biased toward zero.
e. None of the above are valid concerns.
Stat 6130, Practice Test 2.1 -4 of 13-

9. Which of the following would not be a potentially useful diagnostic to identify


collinearity?
a. The leverage plots.
b. A comparison of the significance of the overall F-statistic to the t-statistics.
c. The VIF’s.
d. The scatterplot matrix.
e. The normal quantile plot of the residuals.

Questions 10-13

The output above represents a simple random sample of 399 workers who began work in an
entry level job in the retail space. The data represents the number of years until their first
promotion.

10. What action would make the data look more symmetric?
a. If you remove the two most egregious outliers.
b. If you take the log of every observation.
c. If you multiply every number by a constant.
d. If you subtract 1.00526 from every number.
e. Squaring every number.

11. What is the approximate 95% confidence interval for the average number of years
until an entry level worker gets promoted?
a. [3.66, 4.23]
b. [1.06, 12.34]
c. [1.82, 5.20]
d. [0, 9.48]
e. The numbers above are not valid because the distribution of the sample is not
normal.
Stat 6130, Practice Test 2.1 -5 of 13-

12. What is the p-value for the two-sided hypothesis test where the null is that the mean
number of years until promotion is equal to 3.5?
a. Between 0 and 0.05
b. Between 0.05 and 0.1
c. Between 0.1 and 0.2
d. Above 0.2
e. Impossible to compute from the data provided.

Q13 – Q16

Low birth weight (LBW) in babies is well known to be associated with many negative health
outcomes. Identifying potential risk factors for LBW can be helpful in the development of
health interventions for pregnant mothers.

A study was recently conducted over the current year in a single specific district in rural
Nepal to investigate such risk factors. A simple random sample of 296 births (from a large
population of births in the district) were reviewed along with their associated administrative
records to provide data for the study.

The output below comes from a simple regression of birth weight (measured in grams)
against the age of the mother, in years.
Stat 6130, Practice Test 2.1 -6 of 13-
Stat 6130, Practice Test 2.1 -7 of 13-

13. Referencing the residual plots, which of the regression assumptions are clearly
violated?

i. Independence.
ii. Constant variance
iii. Approximate normality.

a. There is no strong evidence to doubt any of these assumptions.


b. (i) and (ii)
c. (ii) and (iii)
d. (iii)
e. (i)

14. If two mothers differed in age by 5 years, then by how much would you expect their
babies' birth weight to differ by?

a. -521 to 1201, with 95% confidence.


b. 112 to 228, with 95% confidence.
c. 224 to 455, with 95% confidence.
d. 158 to 181, with 95% confidence.
e. None of the above.

15. In an identical study conducted in the previous year, the equivalent regression slope
was 35. Taking 35 as the null hypothesis value, what do you learn from this year's
study?
a. The R2 for the regression is too low to provide any reliable information.
b. The current study does not address this question in any way.
c. There has been a significant increase in the slope.
d. There has been a significant decrease in the slope.
e. There is no statistical evidence that the slope has changed from the previous
year.

16. Which of the following provides the best interpretation of the regression slope in this
simple regression?
a. For two mothers who differ in weight by 10 years, the older one's baby is
expected to be approximately 170 grams heavier.
b. Babies are expected to gain 34 grams in their first year of life.
c. For two mothers who differ in age by 5 years, the older one's baby is expected
to be approximately 170 grams heavier at birth.
Stat 6130, Practice Test 2.1 -8 of 13-

d. The average weight of these children is 3396 grams.


e. After having taken into account mother's weight, for each additional year in
mother's age, a baby is expected to weigh an additional 34 grams.

Questions 17-23

Economists often try to understand the labor market. A group of economists tried to predict
people’s wages with their years of education and the number of years of experience in the
labor market. Given the nature of the variables and their relationships, it was found most
prudent to apply a natural log transformation to all of them. The output from the regression is
found below. All logarithms in the output are natural logs, that is, log base e.
Stat 6130, Practice Test 2.1 -9 of 13-

17. Interpret the coefficient of Log(Education).


a. Holding years of experience constant, an extra year of education is predicted
to increase salary by 27%.
b. Holding years of experience constant, an extra year of education is predicted
to increase salary by 1.27%.
c. Holding years of experience constant, increasing years of education by one
percent is predicted to increase salary by 1.27%.
d. Holding years of experience constant, an extra year of education is predicted
to increase salary by 27%.
e. None of the above.

18. What is the approximate 95% prediction interval for somebody with 10 years of
experience and 16 years of education?
a. (39,236, 39,237)
b. (15,480, 99,454)
c. (18,290, 80,380)
d. (22,285, 54,230)
e. (345, 2213)

19. What does the residual by predicted plot tell you?


a. That the range of Log(salaries) is between -1.5 and 1.5.
b. That neither predictor variable have points with high leverage.
c. That the predictor variables are largely uncorrelated.
d. That there is one point that has high leverage.
e. None of the above

Understanding that there may be different labor conditions in different regions in the US, the
economist then considered whether a given worker came from the East, the Midwest, or the
West, and added the categorical variable Region to the model. The output follows.
Stat 6130, Practice Test 2.1 -10 of 13-

20. Based on this fitted model and holding education and experience constant, in which
region is someone predicted to have the highest salary?
a. East
b. Midwest
c. West
d. Cannot say without considering the interaction between education and region.
e. Cannot say without the residual by predicted plot.

21. Is there a significant difference in salary between the regions, holding all else
constant? Take the MRM assumptions as vaid.
a. Yes, and the p-value of the relevant test is 0.0017.
b. No, and the p-value of the relevant test is 0.9936.
c. Yes, because the intercept is statistically significant.
d. Yes, and the p-value of the relevant test is 0.0064.
e. There is no information available to provide an answer.
Stat 6130, Practice Test 2.1 -11 of 13-

Finally, the interaction between Region and Log(Experience) was considered, with the output
below:

22. What does including the interaction allow the economist to now consider?
a. Whether the effect of experience on salary (holding education constant) is the
same across the different regions in the US.
b. Whether the intercepts of the respective regions’ lines are the same or
different.
c. Whether one region is better in terms of salary for a specific combination of
experience and education while another region might be better for a different
combination of experience and education.
d. (a) and (c).
e. None of the above.
Stat 6130, Practice Test 2.1 -12 of 13-

23. What is a reasonable next modeling step here?


a. Remove Log(Experience)*Region[Midwest].
b. Remove Region[Midwest].
c. Remove Log(Experience)*Region.
d. Remove Region.
e. Remove both the Region and Log(Experience)*Region effects.

Questions 24-26

The organizer of a regional consumer electronics show has developed a joint probability
model that describes two features of the individuals who visit the show. The variable X
describes the distance an individual lives away from the show and takes one of three values
in the categories “0-5 miles”, “5-15 miles” and “more than 15 miles”. The random variable Y
describes when they make the decision to attend the show. Again, it is a three-level
categorical with values “The day of the show”, “The week before the show” and “More
than a week before the show”. The model has the potential to inform the show’s
management about their advertising spend, both geographically and temporally (i.e. in time
and space).

The table below presents the joint probability distribution:

Distance
0-5 miles 5-15 miles 15+ miles TOTAL
Day of 0.1 0.1 W= 0.23
Decision

Week before 0.07 0.2 Z=


Week+ 0.03 0.1 0.1
TOTAL

24. Two numbers have been omitted from the table, W and Z. What are they?
a. W = 0.1, Z = 0.
b. W = 0, Z = 0.1.
c. W = 0.03, Z = 0.27.
d. W = 0.05, Z = 0.05.
e. There is not enough information to determine these two numbers.

25. What is the probability that a person lives both 5-15 miles from the show and doesn’t
make their attendance decision on the day of the show?
a. 0.1.
b. 0.3.
c. 0.03.
d. 0.092.
e. None of the above.
Stat 6130, Practice Test 2.1 -13 of 13-

26. Given that an individual lives within 5 miles of the show, what is the probability that
they make their attendance decision on the day of the show?
a. 1.0.
b. 0.046.
c. 0.1.
d. 0.5.
e. None of the above.

You might also like