Professional Documents
Culture Documents
Sample Final Solutions
Sample Final Solutions
Sample Final Solutions
1. Suppose you are interested in examining the determinants of earnings. You have
information on the age of the individual as well as their level of education: high school
graduate, college graduate or graduate degree. Let Y= earnings, X1= age, X2= 1 if the
person has studied till high school or less and 0 otherwise, X3 = 1 if the person has
studied more than high school but earned less than a masters degree and 0 otherwise,
X4= 1 if the person has earned a masters level or higher degree and 0 otherwise. Which of
the following model specifications cannot be estimated?
Answer (B): X2, X3 and X4 represent dummy variables for the three possible categories.
We must always use one less dummy variable than the number of categories, otherwise
we get into the problem of prefect linear dependence in the model. This is because
X2+X3+X4=1 for all observations, and the variable multiplied with the intercept has a
value equal to 1 for all observations (this is implicit in the system).
2. A manager of a newly opened coffee shop is struggling to manage the waiting time of
customers. Based on preliminary data he has collected, he hypothesizes that there is a
linear relationship between the average waiting time experienced by customers to
receive their drink and number of customers sitting in the coffee shop. He does not have
access to any statistical software but only has the following descriptive statistics.
The average waiting time for a customer who walks into an empty coffee shop is
(ignore set-up time)
A. 8.5 min
B. 9.24 min
C. 12.07 min
D. Not enough information
Answer (B): Use the data to calculate rxy = 0.85. Then the slope b1= 1.15. Knowing that the
line passes through the mean, we can calculate b0 = 9.24 min.
1
Questions 3 through 6 are based on the following situation:
A researcher intended to study the relationship between the number of major natural
calamities such as tornadoes, hurricanes, earthquakes, floods that occurred during a year (X)
and the average profit (in millions of dollars) of all insurance companies in the country in
that year (Y). She took a random sample of 10 years in which number of calamities per year
varied from 10 through 23 and found that the estimated least squares regression line is 𝑦" =
212.6 − 1.9𝑥.
3. The number 212.6 in the above regression can be interpreted most reasonably as
A. The part of the average profit of the insurance companies that is not associated
with the number of natural calamities
B. Change in profit of insurance companies associated with an additional natural
calamity
C. The average profits for all the insurance companies in the country in a year with
no major natural calamities
D. None of the above
Answer (A): A value of 0 for X makes sense but is outside the range of the data. Hence,
option (A) is better than option (C)
A. -1.9
B. Negative but cannot determine the magnitude.
C. +1.9
D. Positive but cannot determine the magnitude.
Answer (B): The slope in a simple linear regression will have the same sign as the correlation
between X and Y.
5. A randomly selected year had 24 major calamities, and the actual average profits in that
year were $200 million. The residual associated with this year is
A. $200 million
B. $167 million
C. $33 million
D. - $33 million
Answer (C): The estimated average profits are 212.6 – 1.9*24 = 167. Therefore, the residual is
200 – 167 = 33 million
2
Answer (C): Both inaccuracies in estimating population parameters as well as the errors in
the population model (due to the variables left out) cause errors in estimation.
7. While comparing two regression models for the same response variable from the same
dataset, it was found that R2 of model A is 0.80 while that of model B is 0.512. Which of
the following is true about the ratio of the RMSE (se) for the two models (Model
A/Model B)?
Questions 8 through 10 are based on the following regression output which was obtained
from a study which linked Age and Smoking (1 = Smoker, 0 = Non-smoker) to the risk of
heart disease:
R-squared
Multiple R
Adj. R-squared 58.46%
Standard Error of Estimate 9.572
Durbin-Watson 1.684
Number of Observations 20
3
8. The missing value of R2 in the regression should be:
A. 62.84%
B. 72.97%
C. 59.46%
D. 58.46%
A. 3.066
B. 5.863
C. 2.853
D. 2.750
10. Assuming that the OLS assumptions hold, an unbiased estimate for standard deviation
of the error term (σε)is:
A. – 28.086
B. 91.621
C. 9.572
D. 16.707
Answer (C): That’s the same as the Root Mean Square Error. One can either take the square
root of the mean square error or directly read the value of standard error of estimate.
11. Which of the following statements are TRUE in reference to a simple linear regression
line of y on x?
I. The regression line will always pass through at least one of the sample
points (xi, yi)
II. The regression line will always pass through the point ( x , y ), where x
and y are respectively the sample means of x and y
III. The point ( x , y ), where x and y are as above, is always included as one
of the points in the sample data set that OLS uses to estimate the
parameters of the model
4
Answer (B): The regression line does not have to pass through a sample point. Every one of
the data points can have a non-zero residual. What’s included in the data depends on the
data, not the model.
12. Consider a multiple regression model with two predictors. If the overall F-ratio is
significant, i.e., if p-value associated with F-ratio is less than a-value, then we can
conclude that
Answer (D): It tests only slopes, not intercept, and doesn’t test them separately.
13. You did a multiple linear regression with a set of predictor variables and found that the
overall regression was significant, while the individual predictors were all insignificant.
The most likely explanation for this is that:
Answer (C): This is a classic symptom of multicollinearity and it is due to the inflation in
standard errors, not cancellations of slopes. Further since the overall test was significant, at
least one of the predictors has a linear relationship with the response.
14. The correlation between two variables in a sample equals zero. This implies that:
Answer (C): Since the sample correlation is zero, so will the R2 of simple linear regression.
Substituting into the formula for Adjusted R2 then gives the answer.
15. Consider the plot of residuals versus predictor below for a simple linear regression.
Which of the following statements is true for this regression?
5
Standardised Residuals v Food
Residuals
Stand. Res
0
0 5 10 15 20 25 30
-1
-2
-3
-4
Food
X
Answer (B): The scatterplot indicates that there will be heteroskedasticity, which makes the
estimate for RMSE incorrect, which in turn makes the prediction intervals incorrect.
A. 25.56
B. 7.94
C. 28.24
D. 22.26
Answer (A): SSR = SST – SSE = 960-270=690. SSR/k = 690/3, and SSE/n-k-1=
270/(34-4). F = (SSR/K)/(SSE/n-k-1)=25.56
17. (IGNORE)
A total of 82 games were played in the 2006-2007 season of the National Hockey League
by every team. A team could win maximum of two points per match or a total of 164
points. For each of the 30 teams, data on the number of goals scored per game (Goals/G)
and the percentage of the 164 possible points they won (Win%) during the season were
collected. The least squares fit between the two had the following equation:
With RMSE = 2.2 and R2 = 0.40. If assumptions of Simple Regression Model are satisfied,
what is the probability that a team scoring 2.5 goals per game will have a Win% of 54.2
or more?
For a hint, please note that since information about Sp is not given, you can make the
usual approximation and use RMSE (i.e. Se) in place of Sp for your analysis. I expect that
you will be able to make such appropriate assumptions without these hints in the finals.
6
A. Less than 0.5%
B. Between 0.5% and 1%
C. Between 1% and 2.5%
D. Between 2.5% and 5%
Answer (B): The point estimate of Win% is 0.932+19.022*2.5 = 48.5. The actual Win% of a
team can be taken to follow a t-distribution with 30-2 =28 degrees of freedom. The mean of
this distribution is at the point estimate, 48.5, and the SE of this distribution is approximated
with RMSE. Thus, the relevant t-statistic is (54.2-48.5)/2.2 = 2.59. Hence, Pr (T>2.59) = 0.0075.
18. Following estimated regression equation compares total compensation among top
executives in a large set of US public corporations in the 1990s. The variables in the data
set are:
6
ln(𝐸𝑎𝑟𝑛𝚤𝑛𝑔𝑠) = 3.86 − 0.28 𝐹𝑒𝑚𝑎𝑙𝑒 + 0.37 ln(𝑀𝑎𝑟𝑘𝑒𝑡𝑉𝑎𝑙𝑢𝑒) + 0.004𝑅𝑒𝑡𝑢𝑟𝑛
We can conclude from the above regression equation is that, controlling for return:
Answer (D): Concluding anything about the relationship between Gender and MarketValue
requires an interaction term between these two.
A professor decides to run an experiment to measure the effect of time pressure on final
exam scores. He gives each of the 285 students in his course the same final exam, but some
students have 90 minutes to complete the exam while others have 120 minutes to complete
it. Each student is randomly assigned one of the examination times based on the flip of a
coin. Consider a regression model of the form: Score = b0 + b1X + e.
19. The professor is considering two different choices for X. The first choice would be to
treat X as the time given for the exam (in the sample data set, it would only take values
of 90 and 120 minutes). The second choice would be to make X into a dummy variable,
and code it as 0 for students who are given 90 minutes, and code it as 1 otherwise.
Unable to choose between these two alternatives, the professor decided to include both
7
variables as X1 and X2 respectively in a multiple regression of Score on these two
predictors. Which of the following statements is true?
Answer (D): Adding both will introduce perfect multicollinearity in the model. The time
variable can be specified as 90 + 30dummy. Therefore, OLS will not even be able to estimate
the model.
20. After more deliberation, the professor decided to go with the time variable instead of the
dummy variable. The estimated simple linear equation was E(Score) = 60 + 0.5X. Based
on this, and the accompanying regression output, the professor estimated with
reasonable confidence that the extra 30 minutes resulted in an expected increase in score
of somewhere between 10 and 20 points. The uncertainty in this estimate is primarily
driven by:
A. Sampling variation
B. The difficulty of the exam
C. The choice of 90 and 120 minutes as the test times
D. None of the above
Answer (A): The uncertainty in estimating the mean is primarily due to the uncertainty in
estimating b0 and b1, which in turn is driven by the sample size and quality.
21. Which of the following might be the most likely driver of the intercept term 60 in the
above equation in question 20?
Answer (D): The intercept is the part of the score that is unrelated to time, and unrelated to
individual abilities. Further, it is common to all test takers. So, it reflects the difficulty of the
exam.
22. Consider a simple regression where the estimated coefficient of the x variable is greater
than 1. This necessarily implies that:
8
Answer (A): Recall that b1 = rxy(Sy/Sx). Since rxy ≤ 1, b1> 1 implies that Sy>Sx.
23. In the previous problem, suppose the coefficient of the x-variable is lower than 1. This
necessarily implies that
Answer (D): If rxy were 1, then we could have concluded that (B) was correct. However, the
errors in the regression model may (very likely) make rxy< 1. Therefore, we cannot conclude
that Sy/Sx< 1.
24. Let R2YX be the R2 value associated with the regression of Y (response) on X (predictor).
Let R2XY be the R2 value associated with the regression of X (response) on Y (predictor).
Which of the following is true?
A. R2XY = R2YX
B. R2XY>R2YX
C. R2XY<R2YX
D. None of the above have to be true
Answer (A): Recall that the R2 value for a simple linear regression is equal to r2XY. Therefore,
both the R2 values will be equal.
25. (IGNORE)
An IT project manager uses a random sample of the projects he has managed in the past
4 years to estimate a linear relationship between delay in project completion (in days)
and the size of the code (’00s of lines): Estimated delay = 8.30 + 0.125*(code size). Which of
the following can we conclude without any further information on the data or the rest of
the regression output?
Answer (D): Having zero lines of code is not a realistic scenario hence Option A is ruled out.
Option B implies causality while Option C makes inference about the population. Hence
neither statement is justified without further information.
26. A multiple regression model with a person’s weight as response and the person’s height
and the average number of calories consumed per day as predictors was found to have
both slopes positive and significant. Assume further that the height is positively
correlated with calorie consumption. If we consider a simple linear regression model of
weight (response) on height (predictor), the estimated slope from this regression will be:
9
A. An unbiased estimate of the true population parameter
B. The estimate will be biased upwards
C. The estimate will be biased downwards
D. Cannot conclude any of the above without further information
Answer (B): Since calorie consumption is positively related to weight as well as to height, in
the absence of calorie consumption in the regression, the height variable does double duty
and picks up the effect of both variables. Therefore, the coefficient of the simple linear
regression of weight on height will be biased upwards.
27. A famous auction house in London is initiating a data analytics approach to understand
factors associated with the value of antique clocks. A regression of a random sample of
32 clocks sold in the last 10 years with Price (‘00$) gave the following output.
The general manager of the auction house claims that this is evidence against the
industry maxim that each additional year of a clock’s age is associated with an average
increase of $1500 in the value of the clock. Do you agree?
Answer (A): Here, the null hypothesis is the industry maxim, i.e., coefficient of age is 15.
Corresponding to the null hypothesis, the t-statistic is (12.736199-15)/0.90238 = -2.51. The p-
value for a two-tailed test for this is 0.018 (approximately 0.02 if you look it up in the t-table).
Hence, the null hypothesis can be rejected at the 5% significance level but not at the 1% level.
Questions 28 – 29 are related to the following description and the JMP reports that follow.
The dataset comes from a set of 420 school districts in California. The response variable is
the test scores of 5th grade students in these districts and is calculated as the average of math
and reading scores for students in each district. The Superintendent of education is
considering whether to decrease class sizes (decrease the student to teacher ratio, labeled
STR hereafter), and wondering whether this would improve student performance on test
scores. Of course, the flip side is that decreasing STR would increase costs and take up much
of the scarce financial resources these school districts have. The regression output of a
simple linear regression of test scores on class sizes (STR) is shown below.
10
710
700
690
680
670
TestScr
660
650
640
630
620
610
600
13 14 15 16 17 18 19 20 21 22 23 24 25 26
STR
Parameter Estimates
Term Estim ate Std Error t Ratio Prob>|t|
Intercept 698.93295 9.467491 73.82 <.0001*
STR -2.279808 0.479826 -4.75 <.0001*
Answer (D): Based on the p-value of the slope, as well as the F-ratio, the regression is
significant.
29. A second predictor, the percentage of students whose native language is not English
(PctEL), is added to the regression. While admitting that this leads to an impressive
increase in R2, the school superintendent decides to take a look at the scatterplot of Test
scores on PctEL. The scatterplot, reproduced below, MOST LIKELY indicates that:
11
Bivariate Fit of TestScr By PctEL
710
700
690
680
670
TestScr
660
650
640
630
620
610
600
0 10 20 30 40 50 60 70 80 90
PctEL
Answer (C): There is more variation in the response for small values of PctEL than for large
values of PctEL.
30. Suppose that you fit a simple regression line between response variable Y and predictor
variable X. Further, suppose that you fit a second regression line between response
variable Y and the predicted values of the response variable Ŷ. Which of the following
statements will be true about this second fitted line?
Answer (C): The first regression finds the best fit line. Hence, the predicted values at each X
remain the same after the second regression. Thus, the second regression line is Ŷi = b0+ b1
Ŷi. Thus, Ŷi(1-b1) = b0. This last equation must be true for all Ŷi. This can only happen if
b0=0 and b1=1.
12