Linear Regression

Nathanaël Carraz Rakotonirina

Mathématiques Informatique et Statistique Appliquées (MISA)

Université d’Antananarivo
The linear regression model is of the form

f (x; θ) = w1 x1 + ... + wD xD + b = w > x + b

I θ = (w , b) : parameters
I w : weights
I b : bias
b can be absorbed into w by defining w = [b, w1 , ..., wD ] and x = [1, x1 , ..., xD ], so
f (x; θ) = w > x
x can be replaced by a non-linear function of the inputs φ(x) called basis expansion
f (x; θ) = w > φ(x)
The general form of the linear regression model with all observations:

ŷ = Xw + b

I N : Number of observations
I D : number of features
I ŷ ∈ RN : predictions
I X ∈ RN×D : inputs (design matrix)
I w ∈ RD : weights
I b ∈ R : bias
When the bias b is absorbed
ŷ = Xw
Loss function - Least squares

Find the parameters w that minimize the residual sum of squares (loss)
1X 1X
RSS(w ) = (yi − f (xi ))2 = (yi − w > xi )2
2 2
i=1 i=1

We can minimize it analytically or iteratively using gradient descent.

Probabilistic Interpretation
The targets and inputs are related as follows

y = w >x + 

where  is the residual error between the predictions and the true response (unmodeled
effects/random noise). We assume  has a Gaussian distribution  ∼ N (0, σ 2 ).

p(y |x; θ) = N (y ; w > x, σ 2 )

where θ = (w , σ 2 ).
We estimate the parameters using MaximumQN Likelihood Estimation. We want the
parameters that maximizes the likelihood i=1 p(yi |xi ; θ). It is easier to minimize the
Negative log likelihood
NLL(θ) = − log p(yi |xi ; θ)

It can be shown that minimizing the NLL is equivalent to minimizing the RSS.
Ordinary Least Squares
Our loss function is
1X 1 1
J(w ) = RSS(w ) = (yi − w > xi )2 = ||Xw − y ||22 = (Xw − y )> (Xw − y )
2 2 2

The gradient is given by

∇w RSS(w ) = X > Xw − X > y

Setting the gradient to zero
X > Xw = X > y
called the normal equations.
The solution ŵ called the ordinary least squares solution is given by

ŵ = (X > X )−1 X > y

Is it a unique global minimum ?
We check if the Hessian is positive definite. It is given by

H(x) = RSS(w ) = X > X
If the columns of x are linearly independent, then H is positive definite and ŵ is a
unique global minimum.
Numerical issues

The inverse should not be computed directly. X > X can be singular or ill-conditioned.
There are alternatives:
I QR decomposition

Explore further
I Polynomial regression (other basis expansions)
I Weighted linear regression
I Bayesian linear regression

