Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

M1 Miage

Mars 2021

Exercises about linear regression


Exercise 1 : Simple regression, choice between 2 variables We have collected some data
about 40 individuals. For each of these individuals, we have the values of 3 numerical variables x1 ,
x2 and y. We would like, using this data, to build simple regression models in order to predict y. A
snapshot of this dataset is given in the table below :

x1 x2 y
-2.3 -8.8 9
-0.6 1.8 -1
-1.2 -4.3 5
... ... ...

We have created two different simple regression models : the first one to predict y using x1 (model
1) and the other one to predict y using x2 (model 2). The equations of the obtained models are :

y = 3.7 + 1.2 × x1 and y = 1 − 0.9 × x2


40
X
We also know that It = (yi − y)2 = 1298
i=1
1. For model 1, we have Im1 = 778. Compute the coefficient of determination of model 1, and Ir1 .
2. For model 2, we have Im2 = 1255. Compute the coefficient of determination of model 2, and Ir2 .
3. According to these results, which model seems more adapted to predict y ?
4. Given that for model 2, σ̂β1 = 0.03, perform a statistical test to know if the variable x2 is
influencing significatively y (the quantile of the Student’s law is here equal to 1.98)
We now have the values of the 3 variables (x1 , x2 and y) for 3 new individuals :
x1 x2 y
-5 3.3 -3
- 0.8 0 1
2.7 -11.4 11
5. Which model (model 1 or model 2) is better to predict y for these 3 new individuals ? (Note :
usually, in a prediction setting, you are not given the values of the target variable for new
individuals. Here, I gave it to you in order to check if our model choices seem correct).
Exercise 2 : multiple regression
Using the same dataset, we are now going to use x1 and x2 to predict y.
The obtained multiple regression model (model 3) is :

y = 1.3 + 0.2 × x1 − 0.9 × x2

1. Given that for model 3, σ̂β1 = 0.06, perform a statistical test to know if x1 is significantly
influencing y when x2 is alreay used ? (same quantile as in exercise 1).
2. Given that Im3 = 1263, does the model 3 seem more interesting than using just one variable ?
3. Predict y using this model 3 for the 3 new individuals given above. What do you conclude ?

1
Exercise 3 : Train / Test split
We have collected some data about 8 individuals that are described by 3 variables x1 , x2 , and y.
We aim at predicting y using x1 and x2 , and we would like to have an estimation of the generalization
error of a multiple regression model to predict y.
For this, we apply the train / test split strategy. The dataset is split into a training set of 6 indivduals
and a test set with 2 individuals. A multiple regresssion model is learned using the training set only.
The equation of the obtained model is : y = −89 + 16 x1 + 2 x2
The 2 individuals of the test set are :

Individuals x1 x2 y
1 1 54.5 38
2 1.5 49 36

Question : Give an estimation of the generaliation error of this regression model


Exercise 4 : K-fold cross validation
We have a dataset of 4 individuals described by two variables x1 and y. We aim at predicting y
using x1 by a simple linear regression, and we would like to have an estimation of the generalization
error of such a model. For, this we apply the K-fold cross validation strategy with K = 2. The dataset
is hence split into 2 folds. The first fold is composed of the 2 first individuals, and the second fold of
the other 2 individuals.
The complete dataset is given below :
Individuals x1 y
1 1.2 49
2 0.9 37
3 1.5 52
4 0.5 20
The regression model obtained with the first fold is : y = 1 + 40 x1 .
The regression model obtained with the second fold is : y = 4 + 32 x1 .

Question : Estimate the generalization error of a regression model that predicts y using x1 by a
2-fold cross validation.

Exercise 5 : Variable selection


In this exercise, we will apply the different variable selection methods we have seen : exhaus-
tive, forward, and backward. We have a dataset with individuals described by 4 predictive variables
x1 , x2 , x3 , x4 and a target one y. We want to select the best model to predict y. The estimation of the
generalization error is our criterion for the selection.
In the following table, you will find the estimations of the generalization error of all the possible
sub-models with 1 to 4 predictive variables :
1 variable 2 variables 3 variables 4 variables
{1} : 7.77 {1,2} : 5.84 {1,2,3} : 5.75 {1,2,3,4} : 5.77
{2} : 10.63 {1,3} : 7.22 {1,2,4} : 5.80
{3} : 9.77 {1,4} : 5.81 {1,3,4} : 7.23
{4} : 8.41 {2,3} : 8.60 {2,3,4} : 8.42
{2,4} : 8.76
{3,4} : 9.68
1. Which model is selected by the exhaustive procedure ? How many models need to be compared ?
2. Which model is selected by the forward selection procedure ? How many models need to be
compared ?
3. Which model is selected by the backward selection procedure ? How many models need to be
compared ?

You might also like