This document provides instructions for analyzing a dataset on Boston neighborhoods using k-nearest neighbors regression and linear regression models. It involves standardizing variables, fitting models to the full dataset and using cross-validation to evaluate model performance. Key steps include using k-nearest neighbors with k=25 to predict housing prices, computing mean squared error and leave-one-out cross-validation, then using 10-fold cross-validation and comparing models with different k values. Linear regression is also fitted to predict log-transformed housing prices from age and distance to employment centers, and standard errors are estimated using bootstrapping.
This document provides instructions for analyzing a dataset on Boston neighborhoods using k-nearest neighbors regression and linear regression models. It involves standardizing variables, fitting models to the full dataset and using cross-validation to evaluate model performance. Key steps include using k-nearest neighbors with k=25 to predict housing prices, computing mean squared error and leave-one-out cross-validation, then using 10-fold cross-validation and comparing models with different k values. Linear regression is also fitted to predict log-transformed housing prices from age and distance to employment centers, and standard errors are estimated using bootstrapping.
This document provides instructions for analyzing a dataset on Boston neighborhoods using k-nearest neighbors regression and linear regression models. It involves standardizing variables, fitting models to the full dataset and using cross-validation to evaluate model performance. Key steps include using k-nearest neighbors with k=25 to predict housing prices, computing mean squared error and leave-one-out cross-validation, then using 10-fold cross-validation and comparing models with different k values. Linear regression is also fitted to predict log-transformed housing prices from age and distance to employment centers, and standard errors are estimated using bootstrapping.
every time we split the data, we must re-standardize the training
Important Note: in order to match the solutions, please useli- data. brary(FNN) for any k-nearest-neighbors predictions, throughout this assign- a. Which model assessment measure must have the lower value? ment. A. MSE B. CV(n) We will be continuing (from Webwork Lesson 1) to analyze a b. Which model assessment measure do you expect to be more data set about neighborhoods in Boston.We will begin by using accurate? these data to fit the model to the full data set. A. MSE a. In Webwork Lesson 1, you saved the data file Boston- B. CV(n) Std.csv(with variables age, rad, age.std, rad.std, and crim).You c. Compute the leave-one-out cross-validated (LOOCV) mea- may either open the dataset if you still have it,or re-make the set sure, CV(506) , for 25-nearest-neighbors model, reporting that of variables from the Boston data set in the library MASS. value (out to 5 decimal places) here: (*remember to restandard- ize the data within each fold) Save the matrix of variables as BostonStd. CV(506) = A. Show me how Answer(s) submitted: B. OK, got it!
b. We want to fit the model for future use in making pre- dictions.How many data points are going to be used to fit the model? (incorrect)
# data points for fitting = 3. (2 points)
c. Use BostonStd to define a matrix x.std containingthe variables A third measure for model assessment is m-fold** cross- age.std and rad.std, and define y to be the variable crim. validated measure CV(m) . We will use 10-fold cross-validation A. Show me how to compute the value of the measure CV(10) . B. OK, got it! *IMPORTANT NOTE for any model requiring data stan- Fit the 25-nearest-neighbors model on the full data set, using dardization: Since the entire model-fitting process includes the two variables in x.std (age.std and rad.std) to predict crim. both standardizing the data and fitting the model to the data, Compute MSE (using predicted values for the full data set) and every time we split the data, we must re-standardize the training report that value (out to 5 decimal places) here: data. MSE = **We use "m" to denote the number of folds, to avoid confusion Answer(s) submitted: with the "k" in nearest neighbors.
a. Starting with set.seed(100), use the sample function to come
up with cvgroups, containing the labels for a 10-fold split of the
data. (incorrect) A. OK, got it! 2. (2 points) B. Show me how We will now use the standardized Boston data to assess the b. Compute the 10-fold cross-validated measure, CV(10) , for 25- model (via cross-validation). Two measures for model assess- nearest-neighbors model, reporting that value (out to 5 decimal ment are the mean-squared error, MSE, for all the data and places) here: (*remember to restandardize the data within each leave-one-out cross-validated measure CV(n) . fold)
*IMPORTANT NOTE for any model requiring data stan- CV(10) =
dardization: Since the entire model-fitting process includes c. Identify the correct choice and reason for selecting between both standardizing the data and fitting the model to the data, LOOCV and m-fold CV for model assessment. 1 A. We should use LOOCV, since it results in a smaller value of the CV measure. 5. (2 points) B. We prefer to use LOOCV, since it results in a less We will continue using the data in matrix BostonStd. We will be variable estimate of error than 10-fold cross-validation. using these data to fit a multiple linear regression model and to C. We use 10-fold cross-validation, as a good compro- compute standard errors via bootstrapping. mise for bias-variance trade-off. Answer(s) submitted: a. Define a new data frame, BostonTrans, that contains the vari- ables age, rad, and log.crim (the natural log transformation of the variable crim), in that order. Note that log command refers
to the natural log transformation.
A. Show me how (incorrect) B. OK, got it! b. Fit the multiple linear regression model of log.crim on the 4. (1 point) predictors age and rad. Produce a summary of the regression We now focus on the using of m-fold cross-validation for select- model, and report the standard errors (out to 6 decimal places) ing between models. of the coefficients:
a. Using the same type of loops as in problem 3, we could SE(intercept) =
compute assessment measure CV(10) for each model fit us- ing number of nearest neighbors, k = 1, 2, ..., 50. SE(coefficient for age) = Doing so results in the information displayed on the fol- lowing graph. SE(coefficient for rad) = c. Define a function called beta.fn that takes inputdata (the data set to use for model-fitting) and index (of which data points to use) as inputs and returns the coefficients fit for the multiple linear regression model of log.crim on predictors age and rad. A. Show me how One line represents the MSE values for using the fitted model to B. OK, got it! predict the same data originally used to fit the model; the other d. Use the boot function from the library boot to compute stan- line represents the CV(10) values calculated via 10-fold cross- dard error based on bootstrap sample estimates of the coeffi- validation (which predicts truly new data). Which measures cients. Set the seed to be 100, and use 5000 bootstrap samples. values are represented by the black solid line? Report the bootstrapped standard errors (rounded to 6 decimal A. MSE places): B. CV(10) a. Using the graph from part a., choose the preferred value for SEB (intercept) = the number of nearest neighbors k to make the best predic- tions. SEB (coefficient for age) = A. 10 SEB (coefficient for rad) = B. 5 Answer(s) submitted: C. 20
D. 25 E. 50 Answer(s) submitted:
(incorrect) (incorrect)
Generated by
WeBWorK, c http://webwork.maa.org, Mathematical Association of America