Professional Documents
Culture Documents
Linear Regression
Linear Regression
Linear Regression
linear regression
Matteo Borrotti
Università Milano-Bicocca
Topic addressed
Simple linear regression
We have for n = 31 black cherry trees, the measurements of the tree trunk diameter
(measured at about 1m from the ground) and the amount of wood (volume) obtained
from the tree after cutting down the tree.
We want to use the data to obtain an equation that allows us to predict the volume
having available the diameter, which is easily measurable.
(volume) ≈ f (diametro).
For example, it can be used to decide how many and which trees to cut to obtain a
certain amount of wood, or to determine the "price" of a forest.
Diameter
[1] 8.3 8.6 8.8 10.5 10.7 10.8 11.0 11.0 11.1 11.2 20.6 11.3
[13] 11.4 11.4 11.7 12.0 12.9 12.9 13.3 13.7 13.8 14.0 14.2 14.5
[25] 16.0 16.3 17.3 17.5 17.9 18.0 18.0
Volume
[1] 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 77.0 24.2
[13] 21.0 21.4 21.3 19.1 22.2 33.8 27.4 25.7 24.9 34.5 31.7 36.3
[25] 38.3 42.6 55.4 55.7 58.3 51.5 51.0
60
Volume
40
20
8 12 16 20
Diametro
The last component expresses the part of the volume fluctuations not related to
diameter or which is not captured by the linear relationship.
If y1 , . . . , yn are the volume values and x1 , . . . , xn are the diameter values, then:
yi = α + βxi + ϵi , i = 1, . . . , n,
The model that we have seen in the previous slide is commonly called simple linear
regression model.
In general, we want to describe the relationship between the variable y and the other
variable x , by using a model:
y = α + βx + ϵ.
Galton examined the heights of the children (response variable y) as a function of the
height of the parents (explanatory variable x).
In his analysis, tall children came from tall parents and conversely short children came
from short parents.
Galton also noted a tendency in parental heights to shift towards average height in the
next generation.
If we had a reasonable value of the parameters, say α̂ and β̂, we could predict the
volume of wood using
(volume) ≈ α̂ + β̂(diameter).
y1 ≈ α̂ + β̂x1 ,
y2 ≈ α̂ + β̂x2 ,
..
.
yn ≈ α̂ + β̂xn .
80
60
Volume
40
20
8 12 16 20
Diametro
The black line crosses the cloud points and looks like an appropriate choice
To make the previous insight operational, we need to decide what precisely is meant by
yi ≈ α̂ + β̂xi , i = 1, . . . , n.
One possible solution is to choose the parameters that minimize the loss function
n n
X X
ℓ(α, β) = (yi − α − βxi )2 = ϵ2i ,
i=1 i=1
This criterion is called the least squares method, since it minimizes the sum of squared
deviations, or the sum of squared errors.
The least squares criterion is very popular because the solution to the minimization
problem is simple to compute.
Least square
The only solution to the problem
n
X
(α̂, β̂) = arg min (yi − α − βxi )2
α,β
i=1
that is
cov(x , y )
α̂ = ȳ − β̂x̄ , β̂ = .
var(x )
The solution of the problem is well defined only if var var(x ) > 0.
This is very reasonable: the parameter β indicates how much the response varies as the
explanatory varies, but if var(x ) = 0 then the explanatory does not vary at all.
having wi = yi − βxi for each i = 1, . . . , n. We know that the value that minimizes this
function is the arithmetic mean.
AWe have therefore reduced the initial problem to the following sub-problem
n
X
β̂ = arg min ℓ(α̂(β), β) = arg min [(yi − ȳ ) − β(xi − x̄ )]2
β∈R β∈R
i=1
Taking the derivative with respect to ÿ and setting it equal to 0, we obtain that
n
X
−2 (xi − x̄ )[(yi − ȳ ) − β(xi − x̄ )] = 0,
i=1
Pn
Then, if i=1
(xi − x̄ )2 > 0 the solution of the problem is
Pn
i=1
(xi − x̄ )(yi − ȳ ) cov(x , y )
β̂ = Pn = ,
i=1
(xi − x̄ )
2 var(x )
where the last step is obtained by multiplying the numerator and denominator by n.
Note. Finally, to conclude the proof, it is necessary to verify that the solution found is
a minimum point and not, for example, a maximum.
80
60
Volume
40
20
8 12 16 20
Diametro
The ability to describe volume changes seems good, with the exception perhaps of the
most external observations.
The differences between the observed values of the response variable and the values
predicted by the model, that is
ri = yi − (α̂ + β̂xi ), i = 1, . . . , n,
are called residuals.
The variance of the residuals can be used to evaluate the goodness of fit of the model
to the data.
In fact, the smaller the variance of the residuals, the closer the regression line is to the
observations.
Property. The variance of the residuals is always less than that of the response
variable. So:
n n
1X 1X
var(y ) = min (yi − α)2 ≥ min (yi − α − βxi )2 = var(r ).
α∈R n (α,β)∈R2 n
i=1 i=1
cov(x , y )2
var(r ) = var(y ) − .
var(x )
In fact, using the properties of variance, we get that
n
1X
var(r ) = [(yi − β̂xi ) − (ȳ − β̂x̄ )]2 = var(y − β̂x )
n
i=1
The variance of the residuals depends on the scale of the observed phenomenon.
Therefore R 2 index is often used to evaluate the goodness of fit.
Index R 2 measures the fraction of variance of the response variable (total variance)
explained by the mode. Then 0 ≤ R 2 ≤ 1.
We have that R 2 = 0 if var(r ) = var(y ), that is when the model does not “explain” the
response.
We have that R 2 = 1 if var(r ) = 0, that is when the model does “explain” the response.
ȳ = 30.17, x̄ = 13.25,
var(x ) = 9.53, cov(x , y ) = 48.24.
Pn
We also know that i=1
yi2 = 36324.99.
Then
36324.99 48.242
var(y ) = − 30.172 = 261.54, var(r ) = 261.54 − = 17.35.
31 9.53
This equivalence makes it clear that the correlation coefficient (and therefore the
covariance) measures a linear relationship.
In fact, the coefficient R 2 and therefore cor(x , y ) capture the proximity of the data to a
straight line.
Note. In the case of cherry trees, we have R 2 = 0.934 and cor(x , y ) = 0.9672 = 0.935.
This slight discrepancy is due to the various numerical approximations made.
The basic problem is the same (study of the relationships between variables) and the
"ingredients" we have handled as well (means, variances and covariances).