Linear Regression

Foundations of Probability and Statistics
linear regression
Matteo Borrotti
Università Milano-Bicocca
Matteo Borrotti (Milano-Bicocca) 1 / 23

Unit 8
Topic addressed
Simple linear regression
Least square estimation
Mean and residual variance; coefficient of determination (R 2 )

Problem
We have for n = 31 black cherry trees, the measurements of the tree trunk diameter
(measured at about 1m from the ground) and the amount of wood (volume) obtained
from the tree after cutting down the tree.
We want to use the data to obtain an equation that allows us to predict the volume
having available the diameter, which is easily measurable.
In other words, we are searching for a function f (·) that is
(volume) ≈ f (diametro).
Such an equantion has different application.
For example, it can be used to decide how many and which trees to cut to obtain a
certain amount of wood, or to determine the "price" of a forest.

Raw data
Diameter
[1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 20.6 11.3
[13] 11.4 11.4 11.7 12.0 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5
[25] 16.0 16.3 17.3 17.5 17.9 18.0 18.0
Volume
[1] 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 77.0 24.2
[13] 21.0 21.4 21.3 19.1 22.2 33.8 27.4 25.7 24.9 34.5 31.7 36.3
[25] 38.3 42.6 55.4 55.7 58.3 51.5 51.0

Scatterplot
80
60
Volume
40
20
8 12 16 20
Diametro
We can compute the correlation:
cor(diameter, volume) = 0.967.
It is clear a strong linear relationship.

First model
Initial hyphotesis: linear relationship.
We can now define a linear model:
(volume) = α + β (diameter) + (error).
The last component expresses the part of the volume fluctuations not related to
diameter or which is not captured by the linear relationship.
If y1 , . . . , yn are the volume values and x1 , . . . , xn are the diameter values, then:
yi = α + βxi + ϵi , i = 1, . . . , n,
where ϵ1 , . . . , ϵn are the errors.

Linear regression model
The model that we have seen in the previous slide is commonly called simple linear
regression model.
In general, we want to describe the relationship between the variable y and the other
variable x , by using a model:
y = α + βx + ϵ.
The variable y is tipically called response variable or dependent variable.
The variable x is called explanatory variable, regressor, covariate oppure

indipendent variable.
The values α, β ∈ R are parameters of the model.

Linear regression model
The term regression derives from the famous application

made in 1886 by the biologist and statistician Francis
Galton.
Galton examined the heights of the children (response variable y) as a function of the
height of the parents (explanatory variable x).
In his analysis, tall children came from tall parents and conversely short children came
from short parents.
Galton also noted a tendency in parental heights to shift towards average height in the
next generation.
Galton called this phenomenon “regression towards mediocrity”.

Least squared method
In practive, is necessary to find a way to determine the parameters α and β.
If we had a reasonable value of the parameters, say α̂ and β̂, we could predict the
volume of wood using
(volume) ≈ α̂ + β̂(diameter).
It seems reasonable to try to determine α̂ and β̂ in such a way as to obtain good

predictions on the observed dataset.
We want to find values for the parameters such that
y1 ≈ α̂ + β̂x1 ,
y2 ≈ α̂ + β̂x2 ,
..
.
yn ≈ α̂ + β̂xn .

Different choices of parameters
80
60
Volume
40
20
8 12 16 20
Diametro
The orange lines represent suboptimal choices for forecasting purposes.
The black line crosses the cloud points and looks like an appropriate choice

Least square method
To make the previous insight operational, we need to decide what precisely is meant by
yi ≈ α̂ + β̂xi , i = 1, . . . , n.
One possible solution is to choose the parameters that minimize the loss function
n n
X X
ℓ(α, β) = (yi − α − βxi )2 = ϵ2i ,
i=1 i=1
that is α̂ e β̂ such that

(α̂, β̂) = arg min ℓ(α, β).
α,β
This criterion is called the least squares method, since it minimizes the sum of squared
deviations, or the sum of squared errors.

Least square method
The least squares criterion is very popular because the solution to the minimization
problem is simple to compute.
Least square
The only solution to the problem
n
X
(α̂, β̂) = arg min (yi − α − βxi )2
α,β
i=1
that is
cov(x , y )
α̂ = ȳ − β̂x̄ , β̂ = .
var(x )
The solution of the problem is well defined only if var var(x ) > 0.
This is very reasonable: the parameter β indicates how much the response varies as the
explanatory varies, but if var(x ) = 0 then the explanatory does not vary at all.

Proof
For each β, we already know the solution of the following problem

n n
X X
arg min (yi − α − βxi )2 = arg min (wi − α)2 ,
α∈R α∈R
i=1 i=1
having wi = yi − βxi for each i = 1, . . . , n. We know that the value that minimizes this
function is the arithmetic mean.
For any value of β, we get that

n n
1X 1X
α̂(β) = wi = (yi − βxi ) = ȳ − x̄ β.
n n
i=1 i=1
From the defition of α̂(β) follows that for each α, β
ℓ(α, β) ≥ ℓ(α̂(β), β).

Proof
AWe have therefore reduced the initial problem to the following sub-problem
n
X
β̂ = arg min ℓ(α̂(β), β) = arg min [(yi − ȳ ) − β(xi − x̄ )]2
β∈R β∈R
i=1
and obviously we will set α̂ = α̂(β̂) = ȳ − β̂x̄ .
Taking the derivative with respect to ÿ and setting it equal to 0, we obtain that
n
X
−2 (xi − x̄ )[(yi − ȳ ) − β(xi − x̄ )] = 0,
i=1
that we can rewrite as

n n
X X
(xi − x̄ )(yi − ȳ ) = β (xi − x̄ )2 .
i=1 i=1

Proof
Pn
Then, if i=1
(xi − x̄ )2 > 0 the solution of the problem is
Pn
i=1
(xi − x̄ )(yi − ȳ ) cov(x , y )
β̂ = Pn = ,
i=1
(xi − x̄ )
2 var(x )
where the last step is obtained by multiplying the numerator and denominator by n.
Note. Finally, to conclude the proof, it is necessary to verify that the solution found is
a minimum point and not, for example, a maximum.

Parameters computation: cherry trees
In this case, we have

n n
X X
yi = 935.3, xi = 410.7,
i=1 i=1
X n n
X
xi2 = 5736.55, xi yi = 13887.86.
i=1 i=1
So we can calculate means, variance and covariance

935.5 410.7
ȳ = = 30.17, x̄ = = 13.25,
31 31
5736.55 13887.86
var(x ) = − 13.252 = 9.53, cov(x , y ) = − 13.25 × 30.17 = 48.24.
31 31
We can then determine the parameters

48.24
β̂ = = 5.06, α̂ = 30.17 − 5.06 × 13.25 = −36.88.
9.53

Scatterplot with regression line
80
60
Volume
40
20
8 12 16 20
Diametro
The ability to describe volume changes seems good, with the exception perhaps of the
most external observations.

Residual: mean and variance
The differences between the observed values of the response variable and the values
predicted by the model, that is
ri = yi − (α̂ + β̂xi ), i = 1, . . . , n,
are called residuals.
Property. The mean of the residual is 0:

n n n
X X X
ri = yi − nα̂ − β̂ xi = nȳ − n(ȳ − β̂x̄ ) − nβ̂x̄ = 0.
i=1 i=1 i=1
The variance of the residuals can be used to evaluate the goodness of fit of the model
to the data.
In fact, the smaller the variance of the residuals, the closer the regression line is to the
observations.

Residuals: mean and variance
Property. The variance of the residuals is always less than that of the response
variable. So:
n n
1X 1X
var(y ) = min (yi − α)2 ≥ min (yi − α − βxi )2 = var(r ).
α∈R n (α,β)∈R2 n
i=1 i=1
Property. The variance of the residuals is equal to
cov(x , y )2
var(r ) = var(y ) − .
var(x )
In fact, using the properties of variance, we get that
n
1X
var(r ) = [(yi − β̂xi ) − (ȳ − β̂x̄ )]2 = var(y − β̂x )
n
i=1
= var(y ) + β̂ 2 var(x ) − 2β̂cov(x , y )

cov(x , y )2 cov(x , y )2 cov(x , y )2
= var(y ) + −2 = var(y ) − .
var(x ) var(x ) var(x )

Coefficient of determination R 2
The variance of the residuals depends on the scale of the observed phenomenon.
Therefore R 2 index is often used to evaluate the goodness of fit.
Coefficient of determination R 2 . The coefficient R 2 is defined as:

var(r )
R2 = 1 − .
var(y )
Index R 2 measures the fraction of variance of the response variable (total variance)
explained by the mode. Then 0 ≤ R 2 ≤ 1.
We have that R 2 = 0 if var(r ) = var(y ), that is when the model does not “explain” the
response.
We have that R 2 = 1 if var(r ) = 0, that is when the model does “explain” the response.

Coefficient of determination: cherry trees
Previously, we computed the following quantities:
ȳ = 30.17, x̄ = 13.25,
var(x ) = 9.53, cov(x , y ) = 48.24.
Pn
We also know that i=1
yi2 = 36324.99.
Then
36324.99 48.242
var(y ) = − 30.172 = 261.54, var(r ) = 261.54 − = 17.35.
31 9.53
The coefficient of determination is

17.35
R2 = 1 − = 0.934,
261.54
the model does explain ovvero just under 95% of the explained variance.

Correlation and coefficient of determination
Property. The determination coefficient is equal to the squared correlation coefficient,

in fact:
var(r ) cov(x , y )2
R2 = 1 − = = cor(x , y )2 .
var(y ) var(x )var(y )
This equivalence makes it clear that the correlation coefficient (and therefore the
covariance) measures a linear relationship.
In fact, the coefficient R 2 and therefore cor(x , y ) capture the proximity of the data to a
straight line.
Note. In the case of cherry trees, we have R 2 = 0.934 and cor(x , y ) = 0.9672 = 0.935.
This slight discrepancy is due to the various numerical approximations made.
If we had kept track of more decimals, we would have got it
cor(x , y ) = 0.9671194, R 2 = 0.9353199.

Regression and correlation
The similarities with covariance and correlation are many.
The basic problem is the same (study of the relationships between variables) and the
"ingredients" we have handled as well (means, variances and covariances).
Despite this, note that there is an important difference.
In this unit we considered the effect of an explanatory variable on a response variable.

The variables were placed asymmetrically, as we were interested in a relationship of the
type diametro → volume.
While speakign about covariance and correlation, we have placed ourselves in a

symmetrical way with respect to the variables. We have not tried to explain one on the
basis of another but we have simply evaluated the relationships between them.

Linear Regression

Uploaded by

Copyright:

Available Formats

You might also like

Linear Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression

Uploaded by

Copyright:

Available Formats

Foundations of Probability and Statistics

Matteo Borrotti (Milano-Bicocca) 1 / 23

Least square estimation

Mean and residual variance; coefficient of determination (R 2 )

Matteo Borrotti (Milano-Bicocca) 2 / 23

In other words, we are searching for a function f (·) that is

Such an equantion has different application.

Matteo Borrotti (Milano-Bicocca) 3 / 23

Matteo Borrotti (Milano-Bicocca) 4 / 23

We can compute the correlation:

cor(diameter, volume) = 0.967.

It is clear a strong linear relationship.

Initial hyphotesis: linear relationship.

We can now define a linear model:

(volume) = α + β (diameter) + (error).

where ϵ1 , . . . , ϵn are the errors.

Matteo Borrotti (Milano-Bicocca) 6 / 23

The variable y is tipically called response variable or dependent variable.

The variable x is called explanatory variable, regressor, covariate oppure

The values α, β ∈ R are parameters of the model.

Matteo Borrotti (Milano-Bicocca) 7 / 23

The term regression derives from the famous application

Galton called this phenomenon “regression towards mediocrity”.

Matteo Borrotti (Milano-Bicocca) 8 / 23

In practive, is necessary to find a way to determine the parameters α and β.

It seems reasonable to try to determine α̂ and β̂ in such a way as to obtain good

We want to find values for the parameters such that

Matteo Borrotti (Milano-Bicocca) 9 / 23

The orange lines represent suboptimal choices for forecasting purposes.

Matteo Borrotti (Milano-Bicocca) 10 / 23

that is α̂ e β̂ such that

Matteo Borrotti (Milano-Bicocca) 11 / 23

Matteo Borrotti (Milano-Bicocca) 12 / 23

For each β, we already know the solution of the following problem

For any value of β, we get that

From the defition of α̂(β) follows that for each α, β

ℓ(α, β) ≥ ℓ(α̂(β), β).

Matteo Borrotti (Milano-Bicocca) 13 / 23

and obviously we will set α̂ = α̂(β̂) = ȳ − β̂x̄ .

that we can rewrite as

Matteo Borrotti (Milano-Bicocca) 14 / 23

Matteo Borrotti (Milano-Bicocca) 15 / 23

In this case, we have

So we can calculate means, variance and covariance

We can then determine the parameters

Matteo Borrotti (Milano-Bicocca) 16 / 23

Matteo Borrotti (Milano-Bicocca) 17 / 23

Property. The mean of the residual is 0:

Matteo Borrotti (Milano-Bicocca) 18 / 23

Property. The variance of the residuals is equal to

= var(y ) + β̂ 2 var(x ) − 2β̂cov(x , y )

Matteo Borrotti (Milano-Bicocca) 19 / 23

Coefficient of determination R 2 . The coefficient R 2 is defined as:

Matteo Borrotti (Milano-Bicocca) 20 / 23

Previously, we computed the following quantities:

The coefficient of determination is

Matteo Borrotti (Milano-Bicocca) 21 / 23

Property. The determination coefficient is equal to the squared correlation coefficient,