Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Machine Learning and Data Analytics SS 2023

Alina Theiß mlda@dpo.rwth-aachen.de

Live Exercise Session: 26.04.2023

Exercise 1 – Linear Regression

Task 1 (Simple linear regression):

In the file Auto.csv you can find data of 397 different cars. Read the csv file with
the read csv function of the Pandas package. In some rows we have question marks
indicating missing data. Mark them as NaN by using the following command to read
the data frame:
>>> auto df = pd.read csv(’../Data/Auto.csv’,na values=’?’)
Then remove all rows that are not complete with the dropna() function.
a) Perform a simple linear regression with ’mpg’ as the response y and ’horsepower’
as the predictor X with the following functions included in the statsmodels.api
package.
>>> model = sm.OLS(y,X)
>>> estimate = model.fit()
Print the results and comment on the output. What is the predicted mpg associ-
ated with a horsepower of 98?
b) Plot the data and the estimate. You can obtain the estimate values with the
following command:
>>> fitted values = estimate.fittedvalues
c) Produce diagnostic plots of the least squares regression fit. You can obtain the
residuals, the studentized residuals and the leverages with the following functions:
>>> residuals = estimate.resid.values
>>> studentized residuals =
OLSInfluence(estimate).resid studentized internal
>>> leverages = OLSInfluence(estimate).influence
Comment on any problem you see with the fit.

Task 2 (Multiple linear regression):

Consider again the data from Auto.csv.

1/4
MLDA Exercise 1 – Linear Regression SS 2023

a) Produce a scatterplot matrix which includes all the variables in the dataset. To
do this use the scatter matrix command included in the Pandas package.
b) Compute the matrix of correlations between the variables by using the following
command:
>>> auto df.corr()
Remark: auto df is the Auto dataframe. The variable ’name’ is excluded auto-
matically out of the correlation matrix, as it is a qualitative variable.
c) Perform a multiple linear regression with mpg as the response y and all other
variables (except ’name’) as the predictors X. Print the results and comment on
outputs.
d) Compute the variance inflation factors with the following command:
>>> VIFs = [(predictor, variance inflation factor(X.values, ))
for ,predictor in enumerate(list(X))]
Comment the output.
e) Produce diagnostic plots of the linear regression fit. Comment on any problems
you see with the fit. Do the residual plot suggests any unusually large outliers?
Does the leverage plot identify any observations with unusually high leverage?
f) Fit linear regression models with interaction effects. Do any interactions appear
to be statistically significant? Use the example:

mgp = β0 + β1 · weight + β2 · year + β3 · (weight · year). (1)

g) Try a few different transformations of the variables such as:


i) Add a weight2 variable to (1).
1
ii) Add a weight 2 variable to (1).
iii) Add a log(weight) variable to (1).
Comment your findings.

Task 3 (Multiple linear regression with categorical variables):

Consider the data from the file carseats.csv.


a) Fit a multiple regression model to predict ’Sales’ using ’Population’, ’Urban’, and
’US’ with the following commands:
>>> model = smf.ols(’Sales ∼ Population + Urban + US’,data=df)
>>> estimate = model.fit()
where df corresponds to your dataframe. Since we are using categorical variables
(Urban and US), it is easier to switch to the formula version. If you want to still
use the design matrix approach you need to create a dummy matrix.

2/4
MLDA Exercise 1 – Linear Regression SS 2023

b) Provide an interpretation of each coefficient in the model.


c) On the basis of your response to the previous question, fit a smaller model that
only uses the predictors for which there is evidence of association with the out-
come.
d) How well do the models fit the data? Produce the diagnostic plots of the reduced
model.

Task 4 (Multiple linear regression - hypothesis testing, outliers, high-leverage points):

Consider the following model

y = 2 + 2x1 + 0.3x2 + ϵ

with x1 ∼ Uni(0, 1), x2 = 0.5x1 + 10ϵ and ϵ ∼ N (0, 1).


Note: Here we have [β0 , β1 , β2 ] = [2, 2, 0.3].
a) Generate the population with a size of 100. Use a random seed of 0 for the
randomly created values.
b) What is the correlation between x1 and x2 ? You can use the function
corrcoef(x1,x2) included in the numpy package. Create a scatterplot displaying
the relationship between the variables.
c) – Fit a least squares regression to predict y using x1 and x2. Describe the
results obtained.
– What are β̂0 , β̂1 and β̂2 ? How do these relate to the true values?
– Can you reject the null hypothesis H0 : β1 = 0? How about the null hypoth-
esis H0 : β2 = 0?
d) – Fit a least squares regression to predict y using only x1 . Comment on your
results. Can you reject the null hypothesis H0 : β1 = 0?
– Fit a least squares regression to predict y using only x2 . Comment on your
results. Can you reject the null hypothesis H0 : β2 = 0?
– Do the previous results contraddict each other?
e) Now assume that we obtain an additional observation, which is unfortunately
mismeasured. (x1 = 0.1, x2 = 0.8, y = 6). Re-fit the linear models using this new
data.
– What effect does this new observation have on the previous model containing
both variables?
– In each model, is this observation an outlier? A high-leverage point? Both?
Explain your answer.

3/4
MLDA Exercise 1 – Linear Regression SS 2023

Further documentation to the packages and functions used in this exercise:

ˆ Pandas

ˆ Matplotlib

ˆ Numpy

ˆ Statsmodels

ˆ OLS in the statsmodels.api

ˆ OLS in the statsmodels.formula.api

ˆ OLSInfluence function

ˆ Variance inflation factors function

4/4

You might also like