Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

David Nguyen 740 Fall 2017 Kraker

Assignment Lesson 2 due 09/16/2017 at 11:59pm CDT

every time we split the data, we must re-standardize the training


Important Note: in order to match the solutions, please useli- data.
brary(FNN)
for any k-nearest-neighbors predictions, throughout this assign- a. Which model assessment measure must have the lower value?
ment. A. MSE
B. CV(n)
We will be continuing (from Webwork Lesson 1) to analyze a b. Which model assessment measure do you expect to be more
data set about neighborhoods in Boston.We will begin by using accurate?
these data to fit the model to the full data set.
A. MSE
a. In Webwork Lesson 1, you saved the data file Boston- B. CV(n)
Std.csv(with variables age, rad, age.std, rad.std, and crim).You c. Compute the leave-one-out cross-validated (LOOCV) mea-
may either open the dataset if you still have it,or re-make the set sure, CV(506) , for 25-nearest-neighbors model, reporting that
of variables from the Boston data set in the library MASS. value (out to 5 decimal places) here: (*remember to restandard-
ize the data within each fold)
Save the matrix of variables as BostonStd.
CV(506) =
A. Show me how Answer(s) submitted:
B. OK, got it!

b. We want to fit the model for future use in making pre-
dictions.How many data points are going to be used to fit the
model? (incorrect)

# data points for fitting = 3. (2 points)


c. Use BostonStd to define a matrix x.std containingthe variables A third measure for model assessment is m-fold** cross-
age.std and rad.std, and define y to be the variable crim. validated measure CV(m) . We will use 10-fold cross-validation
A. Show me how to compute the value of the measure CV(10) .
B. OK, got it!
*IMPORTANT NOTE for any model requiring data stan-
Fit the 25-nearest-neighbors model on the full data set, using dardization: Since the entire model-fitting process includes
the two variables in x.std (age.std and rad.std) to predict crim. both standardizing the data and fitting the model to the data,
Compute MSE (using predicted values for the full data set) and every time we split the data, we must re-standardize the training
report that value (out to 5 decimal places) here: data.
MSE = **We use "m" to denote the number of folds, to avoid confusion
Answer(s) submitted:
with the "k" in nearest neighbors.


a. Starting with set.seed(100), use the sample function to come

up with cvgroups, containing the labels for a 10-fold split of the

data.
(incorrect)
A. OK, got it!
2. (2 points) B. Show me how
We will now use the standardized Boston data to assess the b. Compute the 10-fold cross-validated measure, CV(10) , for 25-
model (via cross-validation). Two measures for model assess- nearest-neighbors model, reporting that value (out to 5 decimal
ment are the mean-squared error, MSE, for all the data and places) here: (*remember to restandardize the data within each
leave-one-out cross-validated measure CV(n) . fold)

*IMPORTANT NOTE for any model requiring data stan- CV(10) =


dardization: Since the entire model-fitting process includes c. Identify the correct choice and reason for selecting between
both standardizing the data and fitting the model to the data, LOOCV and m-fold CV for model assessment.
1
A. We should use LOOCV, since it results in a smaller
value of the CV measure. 5. (2 points)
B. We prefer to use LOOCV, since it results in a less We will continue using the data in matrix BostonStd. We will be
variable estimate of error than 10-fold cross-validation. using these data to fit a multiple linear regression model and to
C. We use 10-fold cross-validation, as a good compro- compute standard errors via bootstrapping.
mise for bias-variance trade-off.
Answer(s) submitted: a. Define a new data frame, BostonTrans, that contains the vari-
ables age, rad, and log.crim (the natural log transformation of
the variable crim), in that order. Note that log command refers

to the natural log transformation.

A. Show me how
(incorrect) B. OK, got it!
b. Fit the multiple linear regression model of log.crim on the
4. (1 point) predictors age and rad. Produce a summary of the regression
We now focus on the using of m-fold cross-validation for select- model, and report the standard errors (out to 6 decimal places)
ing between models. of the coefficients:

a. Using the same type of loops as in problem 3, we could SE(intercept) =


compute assessment measure CV(10) for each model fit us-
ing number of nearest neighbors, k = 1, 2, ..., 50. SE(coefficient for age) =
Doing so results in the information displayed on the fol-
lowing graph. SE(coefficient for rad) =
c. Define a function called beta.fn that takes inputdata (the data
set to use for model-fitting) and index (of which data points to
use) as inputs and returns the coefficients fit for the multiple
linear regression model of log.crim on predictors age and rad.
A. Show me how
One line represents the MSE values for using the fitted model to B. OK, got it!
predict the same data originally used to fit the model; the other d. Use the boot function from the library boot to compute stan-
line represents the CV(10) values calculated via 10-fold cross- dard error based on bootstrap sample estimates of the coeffi-
validation (which predicts truly new data). Which measures cients. Set the seed to be 100, and use 5000 bootstrap samples.
values are represented by the black solid line? Report the bootstrapped standard errors (rounded to 6 decimal
A. MSE places):
B. CV(10)
a. Using the graph from part a., choose the preferred value for SEB (intercept) =
the number of nearest neighbors k to make the best predic-
tions. SEB (coefficient for age) =
A. 10 SEB (coefficient for rad) =
B. 5 Answer(s) submitted:
C. 20

D. 25
E. 50
Answer(s) submitted:



(incorrect)
(incorrect)

Generated by WeBWorK,
c http://webwork.maa.org, Mathematical Association of America

You might also like