Linear Regression

Foundations of Probability and Statistics

linear regression

Matteo Borrotti

Università Milano-Bicocca

Matteo Borrotti (Milano-Bicocca)

Unit 8

Topic addressed
Simple linear regression

Least square estimation

Mean and residual variance; coefficient of determination (R 2 )

Matteo Borrotti (Milano-Bicocca)


We have for n = 31 black cherry trees, the measurements of the tree trunk diameter
(measured at about 1m from the ground) and the amount of wood (volume) obtained
from the tree after cutting down the tree.

We want to use the data to obtain an equation that allows us to predict the volume
having available the diameter, which is easily measurable.

In other words, we are searching for a function f (·) that is

(volume) ≈ f (diametro).

Such an equantion has different application.

For example, it can be used to decide how many and which trees to cut to obtain a
certain amount of wood, or to determine the "price" of a forest.

Matteo Borrotti (Milano-Bicocca)

Raw data

[1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 20.6 11.3
[13] 11.4 11.4 11.7 12.0 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5
[25] 16.0 16.3 17.3 17.5 17.9 18.0 18.0

[1] 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 77.0 24.2
[13] 21.0 21.4 21.3 19.1 22.2 33.8 27.4 25.7 24.9 34.5 31.7 36.3
[25] 38.3 42.6 55.4 55.7 58.3 51.5 51.0

Matteo Borrotti (Milano-Bicocca)





8 12 16 20

We can compute the correlation:

cor(diameter, volume) = 0.967.

It is clear a strong linear relationship.

Matteo Borrotti (Milano-Bicocca)
First model

Initial hyphotesis: linear relationship.

We can now define a linear model:

(volume) = α + β (diameter) + (error).

The last component expresses the part of the volume fluctuations not related to
diameter or which is not captured by the linear relationship.

If y1 , . . . , yn are the volume values and x1 , . . . , xn are the diameter values, then:

yi = α + βxi + ϵi , i = 1, . . . , n,

where ϵ1 , . . . , ϵn are the errors.

Matteo Borrotti (Milano-Bicocca)

Linear regression model

The model that we have seen in the previous slide is commonly called simple linear
regression model.

In general, we want to describe the relationship between the variable y and the other
variable x , by using a model:
y = α + βx + ϵ.

The variable y is tipically called response variable or dependent variable.

The variable x is called explanatory variable, regressor, covariate oppure

indipendent variable.

The values α, β ∈ R are parameters of the model.

Matteo Borrotti (Milano-Bicocca)

Linear regression model

The term regression derives from the famous application

made in 1886 by the biologist and statistician Francis

Galton examined the heights of the children (response variable y) as a function of the
height of the parents (explanatory variable x).

In his analysis, tall children came from tall parents and conversely short children came
from short parents.

Galton also noted a tendency in parental heights to shift towards average height in the
next generation.

Galton called this phenomenon “regression towards mediocrity”.

Matteo Borrotti (Milano-Bicocca)

Least squared method

In practive, is necessary to find a way to determine the parameters α and β.

If we had a reasonable value of the parameters, say α̂ and β̂, we could predict the
volume of wood using

(volume) ≈ α̂ + β̂(diameter).

It seems reasonable to try to determine α̂ and β̂ in such a way as to obtain good

predictions on the observed dataset.

We want to find values for the parameters such that

y1 ≈ α̂ + β̂x1 ,
y2 ≈ α̂ + β̂x2 ,
yn ≈ α̂ + β̂xn .

Matteo Borrotti (Milano-Bicocca)

Different choices of parameters





8 12 16 20

The orange lines represent suboptimal choices for forecasting purposes.

The black line crosses the cloud points and looks like an appropriate choice

Matteo Borrotti (Milano-Bicocca)

Least square method

To make the previous insight operational, we need to decide what precisely is meant by

yi ≈ α̂ + β̂xi , i = 1, . . . , n.

One possible solution is to choose the parameters that minimize the loss function
n n
ℓ(α, β) = (yi − α − βxi )2 = ϵ2i ,
i=1 i=1

that is α̂ e β̂ such that

(α̂, β̂) = arg min ℓ(α, β).

This criterion is called the least squares method, since it minimizes the sum of squared
deviations, or the sum of squared errors.

Matteo Borrotti (Milano-Bicocca)

Least square method

The least squares criterion is very popular because the solution to the minimization
problem is simple to compute.

Least square
The only solution to the problem
(α̂, β̂) = arg min (yi − α − βxi )2

that is
cov(x , y )
α̂ = ȳ − β̂x̄ , β̂ = .
var(x )

The solution of the problem is well defined only if var var(x ) > 0.

This is very reasonable: the parameter β indicates how much the response varies as the
explanatory varies, but if var(x ) = 0 then the explanatory does not vary at all.

Matteo Borrotti (Milano-Bicocca)


For each β, we already know the solution of the following problem

n n
arg min (yi − α − βxi )2 = arg min (wi − α)2 ,
α∈R α∈R
i=1 i=1

having wi = yi − βxi for each i = 1, . . . , n. We know that the value that minimizes this
function is the arithmetic mean.

For any value of β, we get that

n n
1X 1X
α̂(β) = wi = (yi − βxi ) = ȳ − x̄ β.
n n
i=1 i=1

From the defition of α̂(β) follows that for each α, β

ℓ(α, β) ≥ ℓ(α̂(β), β).

Matteo Borrotti (Milano-Bicocca)


AWe have therefore reduced the initial problem to the following sub-problem
β̂ = arg min ℓ(α̂(β), β) = arg min [(yi − ȳ ) − β(xi − x̄ )]2
β∈R β∈R

and obviously we will set α̂ = α̂(β̂) = ȳ − β̂x̄ .

Taking the derivative with respect to ÿ and setting it equal to 0, we obtain that
−2 (xi − x̄ )[(yi − ȳ ) − β(xi − x̄ )] = 0,

that we can rewrite as

n n
(xi − x̄ )(yi − ȳ ) = β (xi − x̄ )2 .
i=1 i=1

Matteo Borrotti (Milano-Bicocca)


Then, if i=1
(xi − x̄ )2 > 0 the solution of the problem is
(xi − x̄ )(yi − ȳ ) cov(x , y )
β̂ = Pn = ,
(xi − x̄ )
2 var(x )

where the last step is obtained by multiplying the numerator and denominator by n.

Note. Finally, to conclude the proof, it is necessary to verify that the solution found is
a minimum point and not, for example, a maximum.

Matteo Borrotti (Milano-Bicocca)

Parameters computation: cherry trees

In this case, we have

n n
yi = 935.3, xi = 410.7,
i=1 i=1
X n n
xi2 = 5736.55, xi yi = 13887.86.
i=1 i=1

So we can calculate means, variance and covariance

935.5 410.7
ȳ = = 30.17, x̄ = = 13.25,
31 31
5736.55 13887.86
var(x ) = − 13.252 = 9.53, cov(x , y ) = − 13.25 × 30.17 = 48.24.
31 31

We can then determine the parameters

β̂ = = 5.06, α̂ = 30.17 − 5.06 × 13.25 = −36.88.

Matteo Borrotti (Milano-Bicocca)

Scatterplot with regression line





8 12 16 20

The ability to describe volume changes seems good, with the exception perhaps of the
most external observations.

Matteo Borrotti (Milano-Bicocca)

Residual: mean and variance

The differences between the observed values of the response variable and the values
predicted by the model, that is

ri = yi − (α̂ + β̂xi ), i = 1, . . . , n,
are called residuals.

Property. The mean of the residual is 0:

n n n
ri = yi − nα̂ − β̂ xi = nȳ − n(ȳ − β̂x̄ ) − nβ̂x̄ = 0.
i=1 i=1 i=1

The variance of the residuals can be used to evaluate the goodness of fit of the model
to the data.

In fact, the smaller the variance of the residuals, the closer the regression line is to the

Matteo Borrotti (Milano-Bicocca)

Residuals: mean and variance

Property. The variance of the residuals is always less than that of the response
variable. So:
n n
1X 1X
var(y ) = min (yi − α)2 ≥ min (yi − α − βxi )2 = var(r ).
α∈R n (α,β)∈R2 n
i=1 i=1

Property. The variance of the residuals is equal to

cov(x , y )2
var(r ) = var(y ) − .
var(x )
In fact, using the properties of variance, we get that
var(r ) = [(yi − β̂xi ) − (ȳ − β̂x̄ )]2 = var(y − β̂x )

= var(y ) + β̂ 2 var(x ) − 2β̂cov(x , y )

cov(x , y )2 cov(x , y )2 cov(x , y )2
= var(y ) + −2 = var(y ) − .
var(x ) var(x ) var(x )

Matteo Borrotti (Milano-Bicocca)

Coefficient of determination R 2

The variance of the residuals depends on the scale of the observed phenomenon.
Therefore R 2 index is often used to evaluate the goodness of fit.

Coefficient of determination R 2 . The coefficient R 2 is defined as:

var(r )
R2 = 1 − .
var(y )

Index R 2 measures the fraction of variance of the response variable (total variance)
explained by the mode. Then 0 ≤ R 2 ≤ 1.

We have that R 2 = 0 if var(r ) = var(y ), that is when the model does not “explain” the

We have that R 2 = 1 if var(r ) = 0, that is when the model does “explain” the response.

Matteo Borrotti (Milano-Bicocca)

Coefficient of determination: cherry trees

Previously, we computed the following quantities:

ȳ = 30.17, x̄ = 13.25,
var(x ) = 9.53, cov(x , y ) = 48.24.
We also know that i=1
yi2 = 36324.99.

36324.99 48.242
var(y ) = − 30.172 = 261.54, var(r ) = 261.54 − = 17.35.
31 9.53

The coefficient of determination is

R2 = 1 − = 0.934,
the model does explain ovvero just under 95% of the explained variance.

Matteo Borrotti (Milano-Bicocca)

Correlation and coefficient of determination

Property. The determination coefficient is equal to the squared correlation coefficient,

in fact:
var(r ) cov(x , y )2
R2 = 1 − = = cor(x , y )2 .
var(y ) var(x )var(y )

This equivalence makes it clear that the correlation coefficient (and therefore the
covariance) measures a linear relationship.

In fact, the coefficient R 2 and therefore cor(x , y ) capture the proximity of the data to a
straight line.

Note. In the case of cherry trees, we have R 2 = 0.934 and cor(x , y ) = 0.9672 = 0.935.
This slight discrepancy is due to the various numerical approximations made.

If we had kept track of more decimals, we would have got it

cor(x , y ) = 0.9671194, R 2 = 0.9353199.

Matteo Borrotti (Milano-Bicocca)

Regression and correlation

The similarities with covariance and correlation are many.

The basic problem is the same (study of the relationships between variables) and the
"ingredients" we have handled as well (means, variances and covariances).

Despite this, note that there is an important difference.

In this unit we considered the effect of an explanatory variable on a response variable.

The variables were placed asymmetrically, as we were interested in a relationship of the
type diametro → volume.

While speakign about covariance and correlation, we have placed ourselves in a

symmetrical way with respect to the variables. We have not tried to explain one on the
basis of another but we have simply evaluated the relationships between them.

Matteo Borrotti (Milano-Bicocca)

