Exam 3 (F20)

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

STAT 3113 Regression Analysis Exam 3 (Fall 2020) Name: Jacob Sheridan

The Boston Housing Dataset


In R, the MASS library contains Boston data set, which has 506 rows and 14 columns, which records
medv (median house value) and 13 predictors for 506 neighborhoods around Boston. The data frame
contains the following columns:

* crim ---per capita crime rate by town.

* zn --- proportion of residential land zoned for lots over 25,000 sq.ft.

* indus --- proportion of non-retail business acres per town.

* chas --- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

* nox --- nitrogen oxides concentration (parts per 10 million).

* rm --- average number of rooms per dwelling.

* age --- proportion of owner-occupied units built prior to 1940.

* dis --- weighted mean of distances to five Boston employment centres.

* rad --- index of accessibility to radial highways.

* tax --- full-value property-tax rate per $10,000.

* ptratio --- pupil-teacher ratio by town.

* black --- $1000 (Bk - 0.63)^2$ where $Bk$ is the proportion of blacks by town.

* lstat --- lower status of the population (percent).

* medv --- median value of owner-occupied homes in $1000s. (dependent variable)

Use the following R code to load the data:

library(MASS)

names(Boston)
STAT 3113 Regression Analysis Exam 3 (Fall 2020) Name: Jacob Sheridan

Question 1: Conduct a stepwise regression analysis of the Boston data using R to find the “best”
predictors of medv. Please use 0.15 for both -to-remove and -to-enter. Include the R output and
comment on the output.

It appears that lstat is the best variable, followed by rm, then ptratio, then dis.

Question 2: What are the dangers associated with drawing inferences from the stepwise model?

It doesn’t take the interaction model into account.

Question 3: Use all-possible-regressions-selection to find the “best” predictors of medv. Include the
adjusted r-square plot and Cp plot, and illustrate how the choice is made.

The higher the adjusted r-square and the lower the Cp, the better the model. Therefore, wherever both
graphs start to level out is about the number of variables to include in the model.

Question 4: Compare the results in Question 1 and 3, which independent variables consistently are
selected as the “best” predictors?

The variables that are consistently selected as the “best” are lstat, rm, ptratio, and dis
STAT 3113 Regression Analysis Exam 3 (Fall 2020) Name: Jacob Sheridan

Question 5: Fit the first-order linear model with 10 independent variables: crim, chas, nox, rm, dis,
rad, tax, ptratio, black, lstat. Plot the residual plots. Comment on the four residual plots one by one on
the issues you found. What model adjustments would you recommend?

There is a slight curvilinear trend in the residual plot. The points deviate from the line towards the right
of the Normal Q-Q plot. The points aren’t very spread out in the Scale-Location plot. The Residuals vs
Leverage plot is ok since none of the points are beyond Cook’s Distance. I would suggest transforming
one of the variables.

Question 6: Use the following code to plot the partial residual plots. Comment on the partial residual
plots. What do the plots reveal the information between medv and the independent variables?

library(car)
crPlots(fit)
They reveal whether each variable has a positive ore negative effect on medv. They also reveal that
most of the variables have a mostly linear relationship with medv except rm which have a curvilinear
relationship with medv.
STAT 3113 Regression Analysis Exam 3 (Fall 2020) Name: Jacob Sheridan

Question 7: Fit the model with 10 variables in Question 5 and add second-order term of Istat,
second-order term of rm, and the interaction between rad & lstat, rm &rad, crim & chas, chas & nox.
Include the R output of the model fitting. Comment on the model fitted.

The addition of these terms has caused rm and black to become a lot less significant. The adjusted r-
squared has increased by .1333.
STAT 3113 Regression Analysis Exam 3 (Fall 2020) Name: Jacob Sheridan

Question 8: For the model fitted in Question 7, is the normality assumption reasonably satisfied?

Yes

Question 9: Use Studentized Deleted Residual method to identify whether there are any outliers.
Hint: In Course Content -> R -> Chapter 8 Residual Analysis R code, find the code to get studentized
deleted residuals of the model you fit in Question 7.
STAT 3113 Regression Analysis Exam 3 (Fall 2020) Name: Jacob Sheridan

Question 10: Use Cook’s Distance method to identify whether there are any influential observations.
Hint: In Course Content -> R -> Chapter 8 Residual Analysis R code, find the code to get Cook’s Distance
of the model you fit in Question 7.

You might also like