Professional Documents
Culture Documents
05 Linear Regression Computation
05 Linear Regression Computation
Ignacio Sarmiento-Barbieri
1 Intro
2 Numerical Properties
3 OLS Computation
Traditional Computation
Gradient Descent
4 Further Readings
2 Numerical Properties
3 OLS Computation
Traditional Computation
Gradient Descent
4 Further Readings
y = Xβ + u (1)
1 Intro
2 Numerical Properties
3 OLS Computation
Traditional Computation
Gradient Descent
4 Further Readings
I Numerical properties have nothing to do with how the data was generated
I These properties hold for every data set, just because of the way that β̂ was
calculated
e = y − ŷ (3)
= y − X β̂ (4)
replacing β̂
e = y − X (X 0 X ) −1 X 0 y (5)
0 −1 0
= (I − X (X X ) X )y (6)
I PX = X ( X 0 X ) − 1 X 0
I MX = (I − PX )
I Both are symmetric
I Both are idempotent (A0 A) = A
I PX X = X hence projection matrix
I MX X = 0 hence annihilator matrix
We can write
SSR = e0 e = u0 MX u (7)
I Lineal Model: Y = Xβ + u
I Split it: Y = X1 β 1 + X2 β 2 + u
I X = [X1 X2 ], X is n × k, X1 n × k1 , X2 n × k2 , k = k1 + k2
I β = [ β1 β2 ]
Theorem
1 The OLS estimates of β 2 from these equations
y = X1 β 1 + X2 β 2 + u (8)
MX1 y = MX1 X2 β 2 + residuals (9)
2 the OLS residuals from these regressions are also numerically identical
Y = Xβ + D1 λ + D2 θ + D3 γ + D4 δ + u (11)
I Data set 31.6 million observations, with 6.4 million individuals (i), 624 thousand firms
(f), and 115 thousand occupations (j), 11 years (t).
I Storing the required indicator matrices would require 23.4 terabytes of memory
I From their paper
“In our application, we first make use of the Frisch-Waugh-Lovell theorem to remove the influence of the
three high- dimensional fixed effects from each individual variable, and, in a second step, implement the
final regression using the transformed variables. With a correction to the degrees of freedom, this approach
yields the exact least squares solution for the coefficients and standard errors”
each element of the vector of parameter estimates β̂ is simply a weighted average of the
elementes of the vector y
Let’s call cj the j-th row of the matrix (X0 X)−1 X0 then
β̂ j = cj y (13)
80
70
60
y
50
40
30
2 4 6 8 10
x
80
y=19.38+5.44x
70
60
y
50
40
30
2 4 6 8 10
x
80
70
y=19.38+5.44x
60
y
50
40
30
2 4 6 8 10
x
80
70
y=19.38+5.44x
60
y
50
y=23.37+5.57x
40
30
2 4 6 8 10
x
y = Xβ + αej + u (14)
I Me y = y − e j ( e 0 e j ) − 1 e 0 y = y − y j e j
j j j
I Me X is X with the j row replaced by zeros
j
y = PZ y + Mz y (18)
(j)
= X β̂ + α̂ej + MZ y (19)
Pre-multiply by PX (remember MZ PX = 0)
ûj
α̂ = (25)
1 − hj
where hj is the j diagonal element of PX
Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 17 / 41
Applications: Influential Observations
Finally we get
1
( β̂(j) − β̂) = − (X0 X)−1 Xj0 ûj (26)
1 − hj
70
y=19.38+5.44x
60
y
50
y=23.37+5.57x
40
30
2 4 6 8 10
x
1 Intro
2 Numerical Properties
3 OLS Computation
Traditional Computation
Gradient Descent
4 Further Readings
beta<-solve(t(X)%*%X)%*%t(X)%*%y
Theorem If A ∈ Rn×k then there exists an orthogonal Q ∈ Rn×k and an upper triangular
R ∈ Rk×k so that A = QR
I Orthogonal Matrices:
I Def: Q0 Q = QQ0 = I and Q0 = Q−1
I Prop: product of orthogonal is orthogonal, e.g A0 A = I and B0 B = I then (AB)0 (AB) = B0 (A0 A)B = B0 B = I
I (Thin QR) If A ∈ Rn×k has full column rank then A = Q1 R1 the QR factorization is unique, where
Q1 ∈ Rn×k and R is upper triangular with positive diagonal entries
Can use it these to get β̂
(X0 X) β̂ = X0 y (28)
0 0 0 0
(R Q QR) β̂ = R Q y (29)
0 0 0
(R R) β̂ = R Q y (30)
R β̂ = Q0 y (31)
Solve by back substitution
Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 20 / 41
QR decomposition
1 2 1
X= 1 2 y= 4 (32)
1 3 2
1. QR factorization X=QR
−0.57 −0.41
−1.73 −4.04
Q = −0.57 −0.41 R = (33)
0 0.81
−0.57 0.82
−1.73 −4.04 β1 −4.04
= (34)
0 0.81 β2 −0.41
Note that R’s lm also returns many objects that have the same size as X and Y
1 Intro
2 Numerical Properties
3 OLS Computation
Traditional Computation
Gradient Descent
4 Further Readings
I To implement Gradient Descent, you need to compute the gradient of the cost
function with regards to each model parameter
β0 = β − e∇ β f ( β) (35)
I In other words, you need to calculate how much the cost function will change if you
change β just a little bit.
I You also need to define the learning step e
I Notice that this formula involves calculations over the full data set X, at each
Gradient Descent step!
I This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of
training data at every step.
I As a result it is terribly slow on very large data sets.
I However, Gradient Descent scales well with the number of variables; estimating a
Linear Regression model when there are hundreds of thousands of features is much
faster using Gradient Descent than using the Normal Equations or any
decomposition.
10
Ln(Wage)
5
−5
0 5 10 15 20
Education
15
10
β̂ = (X0 X)−1 X0 y
Ln(Wage)
5
beta<-solve(t(X)%*%X)%*%t(X)%*%y 0
lm(y~x,data) −5
0 5 10 15 20
Education
The Gradient
∂f
!
−2 ∑ni=1 (yi − α − βxi )
∇ fθ ( θ ) = ∂α =
∂f
∂β
−2 ∑ni=1 xi (yi − α − βxi )
Updating
∂f
α0 = α − e
∂α
∂f
β0 = β − e
∂β
Start with an initial guess: α = −1; β = 2 , and a learning rate (e = 0.005). Then we have
α0 = (−1) − 0.005 (−2 ((5 − (−1) − 2 × 8) + (10 − (−1) − 2 × 12) + (12.5 − (−1) − 2 × 16))
β0 = 2 + 0.005 (−2 (8(5 − (−1) − 2 × 8) + 12(10 − (−1) − 2 × 12) + 16(12.5 − (−1) − 2 × 16))
α0 = −1.1384
β0 = 0.2266
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education
Start with an initial guess: α = −1; β = 2 , and a learning rate (e = 0.005). Then we have
α2 = (−1.1384) − 0.005 (−2 ((5 − (−1.1384) − (0.2266) × 8) + (10 − (−1.1384) − (0.2266) × 12) + (12.5 − (−1.1384) − (0.2266) × 16))
β2 = (0.2266) + 0.005 (−2 (8(5 − (−1.1384) − (0.2266) × 8) + 12(10 − (−1.1384) − (0.2266) × 12) + 16(12.5 − (−1.1384) − (0.2266) × 16))
α2 = −1.0624
β2 = 1.212689
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education
α3 = −1.0624
β3 = 1.212689
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education
α4 = −1.082738
β4 = 0.9693922
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education
α7211 = −2.076246
β7211 = 0.9369499
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education
I Computing the gradient can be very time consuming. It uses the full data set at each
step!!
I At the opposite extreme, Stochastic Gradient Descent just picks a random
observation at every step and computes the gradients based only on that single
observation.
I It make easier to use distributed computing
I At each step, instead of computing the gradients based on the full dataset (as in
Batch GD) or based on just one ovservation (as in Stochastic GD),
I Mini- batch GD computes the gradients on small random sets of observatinos called
mini- batches.
I The main advantage of Mini-batch GD over Stochastic GD is that you can get a
performance boost from distributed computing and hardware optimization of matrix
operations, especially when using GPUs.
https://tinyurl.com/y3nzvkwh