Sample Final Solutions

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Sample Final Exam (SMMD – Term I)

Part A: Each question in this part is worth 1 point.

1. Suppose you are interested in examining the determinants of earnings. You have
information on the age of the individual as well as their level of education: high school
graduate, college graduate or graduate degree. Let Y= earnings, X1= age, X2= 1 if the
person has studied till high school or less and 0 otherwise, X3 = 1 if the person has
studied more than high school but earned less than a masters degree and 0 otherwise,
X4= 1 if the person has earned a masters level or higher degree and 0 otherwise. Which of
the following model specifications cannot be estimated?

A. Y = b0 + b1X1 + b2X2 + b3X3


B. Y = b0 + b1X1 + b2X2 + b3X3+b4X4
C. Y = b0 + b1X1
D. None of the above.

Answer (B): X2, X3 and X4 represent dummy variables for the three possible categories.
We must always use one less dummy variable than the number of categories, otherwise
we get into the problem of prefect linear dependence in the model. This is because
X2+X3+X4=1 for all observations, and the variable multiplied with the intercept has a
value equal to 1 for all observations (this is implicit in the system).

2. A manager of a newly opened coffee shop is struggling to manage the waiting time of
customers. Based on preliminary data he has collected, he hypothesizes that there is a
linear relationship between the average waiting time experienced by customers to
receive their drink and number of customers sitting in the coffee shop. He does not have
access to any statistical software but only has the following descriptive statistics.

Variable Mean Covariance matrix


Waiting time Number of customers
Waiting time 23 min 136 min2 86 cust-min
Number of customers 12 cust 86 cust-min 75 cust2

The average waiting time for a customer who walks into an empty coffee shop is
(ignore set-up time)

A. 8.5 min
B. 9.24 min
C. 12.07 min
D. Not enough information

Answer (B): Use the data to calculate rxy = 0.85. Then the slope b1= 1.15. Knowing that the
line passes through the mean, we can calculate b0 = 9.24 min.

1
Questions 3 through 6 are based on the following situation:

A researcher intended to study the relationship between the number of major natural
calamities such as tornadoes, hurricanes, earthquakes, floods that occurred during a year (X)
and the average profit (in millions of dollars) of all insurance companies in the country in
that year (Y). She took a random sample of 10 years in which number of calamities per year
varied from 10 through 23 and found that the estimated least squares regression line is 𝑦" =
212.6 − 1.9𝑥.

3. The number 212.6 in the above regression can be interpreted most reasonably as

A. The part of the average profit of the insurance companies that is not associated
with the number of natural calamities
B. Change in profit of insurance companies associated with an additional natural
calamity
C. The average profits for all the insurance companies in the country in a year with
no major natural calamities
D. None of the above

Answer (A): A value of 0 for X makes sense but is outside the range of the data. Hence,
option (A) is better than option (C)

4. For the above regression equation, correlation between X and Y will be

A. -1.9
B. Negative but cannot determine the magnitude.
C. +1.9
D. Positive but cannot determine the magnitude.

Answer (B): The slope in a simple linear regression will have the same sign as the correlation
between X and Y.

5. A randomly selected year had 24 major calamities, and the actual average profits in that
year were $200 million. The residual associated with this year is

A. $200 million
B. $167 million
C. $33 million
D. - $33 million

Answer (C): The estimated average profits are 212.6 – 1.9*24 = 167. Therefore, the residual is
200 – 167 = 33 million

6. The reason for the residual in the previous question is:

A. Sampling variability – the coefficients were estimated from a random sample


B. Insurance company profits are determined by things other than number of
natural calamities
C. Both of the above
D. None of the above

2
Answer (C): Both inaccuracies in estimating population parameters as well as the errors in
the population model (due to the variables left out) cause errors in estimation.

7. While comparing two regression models for the same response variable from the same
dataset, it was found that R2 of model A is 0.80 while that of model B is 0.512. Which of
the following is true about the ratio of the RMSE (se) for the two models (Model
A/Model B)?

A. It’s exact value cannot be determined based on the information given


B. It will be greater than 1
C. It is equal to 0.64
D. Both A and B
RMSE A SSEA
Answer (C): Using definition of RMSE, we get = because n-2 is the same
RMSEB SSEB
for both models (we are using the same dataset). Now use the fact that SSE = SST – SSR and
that SSTA = SSTB again because we are using the same dataset. Then, we can write that
RMSE A 1 - SSRA / SST 1 - RA2
= = . Substituting appropriate values, you get the
RMSEB 1 - SSRB / SST 1 - RB2
correct answer.

Questions 8 through 10 are based on the following regression output which was obtained
from a study which linked Age and Smoking (1 = Smoker, 0 = Non-smoker) to the risk of
heart disease:

Analysis of Variance, ANOVA


Degrees Sum of Mean Square,
Freedom, df Squares, SS MS F-Ratio p-Value
Regression 2 2633.388 1316.694 14.371 0.00022
Error 17 1557.562 91.621
Total 19 4190.950

Regression Equation Results


Dependent Variable, Y: RISK
RISK = -28.086 + 0.689 AGE + 14.396 SMOKER

Standard 95% Conf. 95% Conf.


Indep. X Variables Coefficient Error t Statistic p-Value Lower Upper VIF
Intercept -28.086 16.707 -1.6811 0.11103 -63.334 7.163
AGE 0.689 0.25 2.7501 0.01367 0.16 1.217 1.203
SMOKER 14.396 4.695 4.49 24.302 1.203

R-squared
Multiple R
Adj. R-squared 58.46%
Standard Error of Estimate 9.572
Durbin-Watson 1.684
Number of Observations 20

3
8. The missing value of R2 in the regression should be:

A. 62.84%
B. 72.97%
C. 59.46%
D. 58.46%

Answer (A): R2 = SSR/SST = 2633.388/4190.95 = 62.84% (approx)

9. The missing value of the t-statistic for SMOKER should be:

A. 3.066
B. 5.863
C. 2.853
D. 2.750

Answer (A): t-statistic = b1/se(b1) = 14.396/4.695 = 3.066

10. Assuming that the OLS assumptions hold, an unbiased estimate for standard deviation
of the error term (σε)is:

A. – 28.086
B. 91.621
C. 9.572
D. 16.707
Answer (C): That’s the same as the Root Mean Square Error. One can either take the square
root of the mean square error or directly read the value of standard error of estimate.

11. Which of the following statements are TRUE in reference to a simple linear regression
line of y on x?

I. The regression line will always pass through at least one of the sample
points (xi, yi)
II. The regression line will always pass through the point ( x , y ), where x
and y are respectively the sample means of x and y
III. The point ( x , y ), where x and y are as above, is always included as one
of the points in the sample data set that OLS uses to estimate the
parameters of the model

A. I., II. and III.


B. II. only
C. I. and II. only
D. None of the statements is TRUE

4
Answer (B): The regression line does not have to pass through a sample point. Every one of
the data points can have a non-zero residual. What’s included in the data depends on the
data, not the model.

12. Consider a multiple regression model with two predictors. If the overall F-ratio is
significant, i.e., if p-value associated with F-ratio is less than a-value, then we can
conclude that

A. b0, b1 and b2 are different from zero


B. b1 and b2 are different from zero
C. b1 or b2is different from zero, but not both
D. Either b1 or b2or both are different from zero

Answer (D): It tests only slopes, not intercept, and doesn’t test them separately.

13. You did a multiple linear regression with a set of predictor variables and found that the
overall regression was significant, while the individual predictors were all insignificant.
The most likely explanation for this is that:

A. The response has no linear relationship with any of the predictors


B. The predictors are each cancelling out the other predictors’ effects
C. There is multicollinearity among the predictor variables
D. We’ll need to look at the value of R2 before trying to explain this

Answer (C): This is a classic symptom of multicollinearity and it is due to the inflation in
standard errors, not cancellations of slopes. Further since the overall test was significant, at
least one of the predictors has a linear relationship with the response.

14. The correlation between two variables in a sample equals zero. This implies that:

A. The two variables must be independent in the population


B. A regression with one of these variables as a response, and the other one as a
predictor will be significant
C. The adjusted R2 for the regression in alternative B will be negative
D. None of the above

Answer (C): Since the sample correlation is zero, so will the R2 of simple linear regression.
Substituting into the formula for Adjusted R2 then gives the answer.

15. Consider the plot of residuals versus predictor below for a simple linear regression.
Which of the following statements is true for this regression?

5
Standardised Residuals v Food

Residuals
Stand. Res
0
0 5 10 15 20 25 30
-1

-2

-3

-4
Food
X

A. This regression is insignificant


B. The prediction intervals based on this regression will be incorrect
C. RMSE is more than 3
D. The errors are likely to be autocorrelated

Answer (B): The scatterplot indicates that there will be heteroskedasticity, which makes the
estimate for RMSE incorrect, which in turn makes the prediction intervals incorrect.

16. In a regression model involving 34 observations, the following estimated regression


model was obtained: yˆ = 48 + 2.5 x1 + 1.2 x2 - 0.7 x3 . For this model, the following
statistics were given: SST = 960 and SSE = 270. Then, the value of the F statistic for
testing the validity of this model is:

A. 25.56
B. 7.94
C. 28.24
D. 22.26

Answer (A): SSR = SST – SSE = 960-270=690. SSR/k = 690/3, and SSE/n-k-1=
270/(34-4). F = (SSR/K)/(SSE/n-k-1)=25.56

17. (IGNORE)
A total of 82 games were played in the 2006-2007 season of the National Hockey League
by every team. A team could win maximum of two points per match or a total of 164
points. For each of the 30 teams, data on the number of goals scored per game (Goals/G)
and the percentage of the 164 possible points they won (Win%) during the season were
collected. The least squares fit between the two had the following equation:

Win% = 0.932 + 19.022 (Goals/G)

With RMSE = 2.2 and R2 = 0.40. If assumptions of Simple Regression Model are satisfied,
what is the probability that a team scoring 2.5 goals per game will have a Win% of 54.2
or more?

For a hint, please note that since information about Sp is not given, you can make the
usual approximation and use RMSE (i.e. Se) in place of Sp for your analysis. I expect that
you will be able to make such appropriate assumptions without these hints in the finals.

6
A. Less than 0.5%
B. Between 0.5% and 1%
C. Between 1% and 2.5%
D. Between 2.5% and 5%

Answer (B): The point estimate of Win% is 0.932+19.022*2.5 = 48.5. The actual Win% of a
team can be taken to follow a t-distribution with 30-2 =28 degrees of freedom. The mean of
this distribution is at the point estimate, 48.5, and the SE of this distribution is approximated
with RMSE. Thus, the relevant t-statistic is (54.2-48.5)/2.2 = 2.59. Hence, Pr (T>2.59) = 0.0075.

18. Following estimated regression equation compares total compensation among top
executives in a large set of US public corporations in the 1990s. The variables in the data
set are:

Earnings: Total compensation (in $ ‘000s)


Female: Dummy variable – Equals 1 for females and 0 for males
MarketValue: A measure of firm size (in $ millions)
Return: Stock return (a measure of firm performance in percentage points)

The estimated regression equation is (all predictors were significant):

6
ln(𝐸𝑎𝑟𝑛𝚤𝑛𝑔𝑠) = 3.86 − 0.28 𝐹𝑒𝑚𝑎𝑙𝑒 + 0.37 ln(𝑀𝑎𝑟𝑘𝑒𝑡𝑉𝑎𝑙𝑢𝑒) + 0.004𝑅𝑒𝑡𝑢𝑟𝑛

We can conclude from the above regression equation is that, controlling for return:

A. Females in larger companies suffer a smaller salary discount than smaller


companies
B. Females in smaller companies suffer a smaller salary discount than larger
companies
C. A 1% increase in size (as measured by MarketValue) decreases the salary
discount of female executives, on average, by 0.37%
D. None of the above

Answer (D): Concluding anything about the relationship between Gender and MarketValue
requires an interaction term between these two.

Questions 19 – 21 are based on the following problem.

A professor decides to run an experiment to measure the effect of time pressure on final
exam scores. He gives each of the 285 students in his course the same final exam, but some
students have 90 minutes to complete the exam while others have 120 minutes to complete
it. Each student is randomly assigned one of the examination times based on the flip of a
coin. Consider a regression model of the form: Score = b0 + b1X + e.

19. The professor is considering two different choices for X. The first choice would be to
treat X as the time given for the exam (in the sample data set, it would only take values
of 90 and 120 minutes). The second choice would be to make X into a dummy variable,
and code it as 0 for students who are given 90 minutes, and code it as 1 otherwise.
Unable to choose between these two alternatives, the professor decided to include both

7
variables as X1 and X2 respectively in a multiple regression of Score on these two
predictors. Which of the following statements is true?

A. The estimated coefficients of both predictors will be identical


B. The estimated coefficient of the time variable will be 30 times the coefficient of
the dummy variable
C. The estimated coefficient of the dummy variable will be 30 times the coefficient of
the dummy variable
D. None of the above

Answer (D): Adding both will introduce perfect multicollinearity in the model. The time
variable can be specified as 90 + 30dummy. Therefore, OLS will not even be able to estimate
the model.

20. After more deliberation, the professor decided to go with the time variable instead of the
dummy variable. The estimated simple linear equation was E(Score) = 60 + 0.5X. Based
on this, and the accompanying regression output, the professor estimated with
reasonable confidence that the extra 30 minutes resulted in an expected increase in score
of somewhere between 10 and 20 points. The uncertainty in this estimate is primarily
driven by:

A. Sampling variation
B. The difficulty of the exam
C. The choice of 90 and 120 minutes as the test times
D. None of the above

Answer (A): The uncertainty in estimating the mean is primarily due to the uncertainty in
estimating b0 and b1, which in turn is driven by the sample size and quality.

21. Which of the following might be the most likely driver of the intercept term 60 in the
above equation in question 20?

A. The random allocation of students to the two groups


B. The variation in intelligence levels among students
C. The variation in susceptibility to time pressure among students
D. The degree of difficulty of the exam

Answer (D): The intercept is the part of the score that is unrelated to time, and unrelated to
individual abilities. Further, it is common to all test takers. So, it reflects the difficulty of the
exam.

22. Consider a simple regression where the estimated coefficient of the x variable is greater
than 1. This necessarily implies that:

A. The variance of y-variable is higher than the variance of x-variable


B. The variance of y-variable is lower than the variance of x-variable
C. The variance of y-variable is equal to the variance of x-variable
D. We cannot conclude any of the above without further information

8
Answer (A): Recall that b1 = rxy(Sy/Sx). Since rxy ≤ 1, b1> 1 implies that Sy>Sx.

23. In the previous problem, suppose the coefficient of the x-variable is lower than 1. This
necessarily implies that

A. The variance of y-variable is higher than the variance of x-variable


B. The variance of y-variable is lower than the variance of x-variable
C. The variance of y-variable is equal to the variance of x-variable
D. We cannot conclude any of the above without further information

Answer (D): If rxy were 1, then we could have concluded that (B) was correct. However, the
errors in the regression model may (very likely) make rxy< 1. Therefore, we cannot conclude
that Sy/Sx< 1.

24. Let R2YX be the R2 value associated with the regression of Y (response) on X (predictor).
Let R2XY be the R2 value associated with the regression of X (response) on Y (predictor).
Which of the following is true?

A. R2XY = R2YX
B. R2XY>R2YX
C. R2XY<R2YX
D. None of the above have to be true

Answer (A): Recall that the R2 value for a simple linear regression is equal to r2XY. Therefore,
both the R2 values will be equal.

25. (IGNORE)
An IT project manager uses a random sample of the projects he has managed in the past
4 years to estimate a linear relationship between delay in project completion (in days)
and the size of the code (’00s of lines): Estimated delay = 8.30 + 0.125*(code size). Which of
the following can we conclude without any further information on the data or the rest of
the regression output?

A. Minimum delay in project completion is 8.30 days


B. Larger codes result in larger delays
C. There is a positive relationship between size of the code and project delay in the
company
D. None of the above.

Answer (D): Having zero lines of code is not a realistic scenario hence Option A is ruled out.
Option B implies causality while Option C makes inference about the population. Hence
neither statement is justified without further information.

Part B: Each question in this part is worth 2 points.

26. A multiple regression model with a person’s weight as response and the person’s height
and the average number of calories consumed per day as predictors was found to have
both slopes positive and significant. Assume further that the height is positively
correlated with calorie consumption. If we consider a simple linear regression model of
weight (response) on height (predictor), the estimated slope from this regression will be:

9
A. An unbiased estimate of the true population parameter
B. The estimate will be biased upwards
C. The estimate will be biased downwards
D. Cannot conclude any of the above without further information

Answer (B): Since calorie consumption is positively related to weight as well as to height, in
the absence of calorie consumption in the regression, the height variable does double duty
and picks up the effect of both variables. Therefore, the coefficient of the simple linear
regression of weight on height will be biased upwards.

27. A famous auction house in London is initiating a data analytics approach to understand
factors associated with the value of antique clocks. A regression of a random sample of
32 clocks sold in the last 10 years with Price (‘00$) gave the following output.

The general manager of the auction house claims that this is evidence against the
industry maxim that each additional year of a clock’s age is associated with an average
increase of $1500 in the value of the clock. Do you agree?

A. Yes, at 5% significance level


B. Yes, at 1% significance level
C. Yes at 0.1% significance level
D. Cannot be determined from the output

Answer (A): Here, the null hypothesis is the industry maxim, i.e., coefficient of age is 15.
Corresponding to the null hypothesis, the t-statistic is (12.736199-15)/0.90238 = -2.51. The p-
value for a two-tailed test for this is 0.018 (approximately 0.02 if you look it up in the t-table).
Hence, the null hypothesis can be rejected at the 5% significance level but not at the 1% level.

Questions 28 – 29 are related to the following description and the JMP reports that follow.
The dataset comes from a set of 420 school districts in California. The response variable is
the test scores of 5th grade students in these districts and is calculated as the average of math
and reading scores for students in each district. The Superintendent of education is
considering whether to decrease class sizes (decrease the student to teacher ratio, labeled
STR hereafter), and wondering whether this would improve student performance on test
scores. Of course, the flip side is that decreasing STR would increase costs and take up much
of the scarce financial resources these school districts have. The regression output of a
simple linear regression of test scores on class sizes (STR) is shown below.

10
710
700
690
680
670

TestScr
660
650
640
630
620
610
600
13 14 15 16 17 18 19 20 21 22 23 24 25 26
STR

TestScr = 698.93295 - 2.2798083*STR


Summary of Fit Analysis of Variance
RSquare 0.05124 Sum of
RSquare Adj 0.04897 Source DF Squares Mean Square F Ratio
Root Mean Square Error 18.58097 Model 1 7794.11 7794.11 22.5751
Mean of Response 654.1565 Error 418 144315.48 345.25 Prob > F
Observations (or Sum Wgts) 420 C. Total 419 152109.59 <.0001*

Parameter Estimates
Term Estim ate Std Error t Ratio Prob>|t|
Intercept 698.93295 9.467491 73.82 <.0001*
STR -2.279808 0.479826 -4.75 <.0001*

28. Based on the regression output, we can say that

A. The regression is not statistically significant because the R2 is only 5%


B. The regression is not statistically significant because the RMSE is much bigger
than the absolute value of slope
C. STR isn’t a significant predictor of test scores because the intercept term
dominates the slope term
D. None of the above

Answer (D): Based on the p-value of the slope, as well as the F-ratio, the regression is
significant.

29. A second predictor, the percentage of students whose native language is not English
(PctEL), is added to the regression. While admitting that this leads to an impressive
increase in R2, the school superintendent decides to take a look at the scatterplot of Test
scores on PctEL. The scatterplot, reproduced below, MOST LIKELY indicates that:

11
Bivariate Fit of TestScr By PctEL
710
700
690
680
670

TestScr
660
650
640
630
620
610
600
0 10 20 30 40 50 60 70 80 90
PctEL

A. There are a number of influential points


B. The R2 number must have been misread because test scores seem to decrease
with PctEL
C. The errors in the multiple regression model will be heteroskedastic
D. The errors are correlated.

Answer (C): There is more variation in the response for small values of PctEL than for large
values of PctEL.

30. Suppose that you fit a simple regression line between response variable Y and predictor
variable X. Further, suppose that you fit a second regression line between response
variable Y and the predicted values of the response variable Ŷ. Which of the following
statements will be true about this second fitted line?

A. Its slope will be the same as the slope between Y and X


B. Its slope will be the reciprocal of the slope between Y and X
C. Its slope will be 1
D. Its slope will be zero

Answer (C): The first regression finds the best fit line. Hence, the predicted values at each X
remain the same after the second regression. Thus, the second regression line is Ŷi = b0+ b1
Ŷi. Thus, Ŷi(1-b1) = b0. This last equation must be true for all Ŷi. This can only happen if
b0=0 and b1=1.

12

You might also like