05 Linear Regression Computation

Linear Regression Numerical Properties and Computation
Big Data and Machine Learning for Applied Economics
Ignacio Sarmiento-Barbieri
Universidad de los Andes
Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 0 / 41

Agenda
1 Intro
2 Numerical Properties
3 OLS Computation
Traditional Computation
Gradient Descent
4 Further Readings

1 Intro
3 OLS Computation
Gradient Descent
4 Further Readings

Intro
I Linear regression is the “work horse” of econometrics and (supervised) machine

learning.
I Very powerful in many contexts.
I Big ‘payday’ to study this model in detail.

Linear Regression Model
I If f (X) = Xβ, obtaining f (.) boils down to obtaining β
y = Xβ + u (1)
I where we can obtain β minimizing SSR (e0 e)
β̂ = (X0 X)−1 X0 y (2)

Agenda
1 Intro
3 OLS Computation
Gradient Descent
4 Further Readings

Numerical Properties
I Numerical properties have nothing to do with how the data was generated
I These properties hold for every data set, just because of the way that β̂ was
calculated
I Davidson & MacKinnon, Greene y Ruud have nice geometric interpretations
I Helps in computing with big data

Projection
OLS Residuals:
e = y − ŷ (3)
= y − X β̂ (4)
replacing β̂
e = y − X (X 0 X ) −1 X 0 y (5)
0 −1 0
= (I − X (X X ) X )y (6)
Define two matrices

I Projection matrix PX = X(X0 X)−1 X0
I Annihilator (residual maker) matrix MX = (I − PX )
Projection
I PX = X ( X 0 X ) − 1 X 0
I MX = (I − PX )
I Both are symmetric
I Both are idempotent (A0 A) = A
I PX X = X hence projection matrix
I MX X = 0 hence annihilator matrix
We can write
SSR = e0 e = u0 MX u (7)
So we can relate SSR to the true error term u

Frisch-Waugh-Lovell (FWL) Theorem
I Lineal Model: Y = Xβ + u
I Split it: Y = X1 β 1 + X2 β 2 + u
I X = [X1 X2 ], X is n × k, X1 n × k1 , X2 n × k2 , k = k1 + k2
I β = [ β1 β2 ]
Theorem
1 The OLS estimates of β 2 from these equations
y = X1 β 1 + X2 β 2 + u (8)
MX1 y = MX1 X2 β 2 + residuals (9)
are numerically identical
2 the OLS residuals from these regressions are also numerically identical

Applications
I Why FWL is useful in the context of big volume of data?
I An computationally inexpensive way of

I Removing nuisance parameters
I E.g. the case of multiple fixed effects. The traditional way is either apply the within
transformation with respect to the FE with more categories then add one dummy for each
category for all the subsequent FE
I Not feasible in certain instances.
I Computing certain diagnostic statistics: Influential Observations, R2 , LOOCV.

I Way to add more data without having to compute everything again

Applications: Fixed Effects
I For example: Carneiro, Guimarães, & Portugal (2012) AEJ: Macroeconomics
ln wijft = xit β + λi + θj + γf + δt + uijft (10)
Y = Xβ + D1 λ + D2 θ + D3 γ + D4 δ + u (11)
I Data set 31.6 million observations, with 6.4 million individuals (i), 624 thousand firms
(f), and 115 thousand occupations (j), 11 years (t).
I Storing the required indicator matrices would require 23.4 terabytes of memory
I From their paper
“In our application, we first make use of the Frisch-Waugh-Lovell theorem to remove the influence of the
three high- dimensional fixed effects from each individual variable, and, in a second step, implement the
final regression using the transformed variables. With a correction to the degrees of freedom, this approach
yields the exact least squares solution for the coefficients and standard errors”

Applications: Influential Observations
Note the following
β̂ = (X0 X)−1 X0 y (12)
each element of the vector of parameter estimates β̂ is simply a weighted average of the
elementes of the vector y
Let’s call cj the j-th row of the matrix (X0 X)−1 X0 then
β̂ j = cj y (13)

80
70
60
y
50
40
30
2 4 6 8 10
x

80
y=19.38+5.44x
70
60
y
50
40
30
2 4 6 8 10
x

80
70
y=19.38+5.44x
60
y
50
40
30
2 4 6 8 10
x

80
70
y=19.38+5.44x
60
y
50
y=23.37+5.57x
40
30
2 4 6 8 10
x

Consider a dummy variable ej which is an n − vector with element j equal to 1 and the rest
is 0. Include it as a regressor
y = Xβ + αej + u (14)
using FWL we can do
Mej y = Mej Xβ + r (15)
I β and residuals from both regressions are identical

I Same estimates as those that would be obtained if we deleted observation j from the sample. We are
going to denote this as β(j)
Note:
I Me = I − ej (e0 ej )−1 e0
j j j
I Me y = y − e j ( e 0 e j ) − 1 e 0 y = y − y j e j
j j j
I Me X is X with the j row replaced by zeros
j

Let’s define a new matrix Z = [X, ej ]
y = Xβ + αej + u (16)
y = Zθ + u (17)
we can write it as
y = PZ y + Mz y (18)
(j)
= X β̂ + α̂ej + MZ y (19)
Pre-multiply by PX (remember MZ PX = 0)
PX y = X β̂(j) + α̂PX ej (20)

X β̂ = X β̂(j) + α̂PX ej (21)
X( β̂ − β(j) ) = α̂PX ej (22)
How to calculate α? FWL once again
MX y = α̂MX ej + res (23)
α̂ = (ej0 MX ej )−1 ej0 MX y (24)

I ej0 MX y is the j element of MX y, the vector of residuals from the regression including
all observations
I ej0 Mx ej is just a scalar, the diagonal element of MX
Then
ûj
α̂ = (25)
1 − hj
where hj is the j diagonal element of PX
Finally we get
1
( β̂(j) − β̂) = − (X0 X)−1 Xj0 ûj (26)
1 − hj
Influence depends on two factors

I ûj large residual → related to y coordinate
I ĥj → related to x coordinate
80
70
y=19.38+5.44x
60
y
50
y=23.37+5.57x
40
30
2 4 6 8 10
x
HW. case of y = α + βx + u (ISLR)

Agenda
1 Intro
3 OLS Computation
Gradient Descent
4 Further Readings

QR decomposition
I Linear Model: Min Risk ⇐⇒ Min SSR
β̂ = (X0 X)−1 X0 y (27)
I Involves inverting a k × k matrix X0 X

I requires allocating O(nk + k2 ) if n is ”big” we cannot store this in memory
Solving directly
beta<-solve(t(X)%*%X)%*%t(X)%*%y
may not be the smartest move

QR decomposition
Most software use a QR decomposition
Theorem If A ∈ Rn×k then there exists an orthogonal Q ∈ Rn×k and an upper triangular
R ∈ Rk×k so that A = QR
I Orthogonal Matrices:
I Def: Q0 Q = QQ0 = I and Q0 = Q−1
I Prop: product of orthogonal is orthogonal, e.g A0 A = I and B0 B = I then (AB)0 (AB) = B0 (A0 A)B = B0 B = I
I (Thin QR) If A ∈ Rn×k has full column rank then A = Q1 R1 the QR factorization is unique, where
Q1 ∈ Rn×k and R is upper triangular with positive diagonal entries
Can use it these to get β̂
(X0 X) β̂ = X0 y (28)
0 0 0 0
(R Q QR) β̂ = R Q y (29)
0 0 0
(R R) β̂ = R Q y (30)
R β̂ = Q0 y (31)
Solve by back substitution
QR decomposition
   
1 2 1
X= 1 2  y= 4  (32)
1 3 2
1. QR factorization X=QR
 
−0.57 −0.41
−1.73 −4.04
Q =  −0.57 −0.41  R = (33)
0 0.81
−0.57 0.82
2. Calculate Q0 y = [−4.04, −0.41]0

3. Solve

−1.73 −4.04 β1 −4.04
= (34)
0 0.81 β2 −0.41
Solution is (3.5, −0.5)

QR decomposition
This is actually what R does under the hood
Note that R’s lm also returns many objects that have the same size as X and Y

Agenda
1 Intro
3 OLS Computation
Gradient Descent
4 Further Readings

Gradient-Based Descent
I Gradient Descent is a very generic optimization algorithm capable of finding optimal

solutions to a wide range of problems.
I The general idea of Gradient Descent is to tweak parameters iteratively in order to
minimize a cost function.
Source: Boehmke, B., & Greenwell, B. (2019)

Batch Gradient Descent


I To implement Gradient Descent, you need to compute the gradient of the cost
function with regards to each model parameter
β0 = β − e∇ β f ( β) (35)
I In other words, you need to calculate how much the cost function will change if you
change β just a little bit.
I You also need to define the learning step e

I We can choose e in several different ways:

I A popular approach is to set e to a small constant.
I Sometimes, we can solve for the step size that makes the directional derivative vanish.
I Another approach is to evaluate f (θ − e∇θ f (θ )) for several values of e and choose the one that
results in the smallest objective function value. This is called a line search.
I Notice that this formula involves calculations over the full data set X, at each
Gradient Descent step!
I This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of
training data at every step.
I As a result it is terribly slow on very large data sets.
I However, Gradient Descent scales well with the number of variables; estimating a
Linear Regression model when there are hundreds of thousands of features is much
faster using Gradient Descent than using the Normal Equations or any
decomposition.

Batch Gradient Descent: Example
log(wage) Education (years)

5 8
10 12
20
12.5 16
15
10
Ln(Wage)
5
−5
0 5 10 15 20
Education

5 8
10 12
12.5 16 20
15
10
β̂ = (X0 X)−1 X0 y
Ln(Wage)
5
beta<-solve(t(X)%*%X)%*%t(X)%*%y 0
lm(y~x,data) −5
0 5 10 15 20
Education
y = −2.0833 + 0.9375 × Educ

The Risk Function
n
SSR = f (θ ) = ∑ (yi − (α + βxi ))2
i=1
The Gradient
∂f
!
−2 ∑ni=1 (yi − α − βxi )

∇ fθ ( θ ) = ∂α =
∂f
∂β
−2 ∑ni=1 xi (yi − α − βxi )
Updating
∂f
α0 = α − e
∂α
∂f
β0 = β − e
∂β

First Iteration

5 8
10 12
12.5 16
Start with an initial guess: α = −1; β = 2 , and a learning rate (e = 0.005). Then we have
α0 = (−1) − 0.005 (−2 ((5 − (−1) − 2 × 8) + (10 − (−1) − 2 × 12) + (12.5 − (−1) − 2 × 16))
β0 = 2 + 0.005 (−2 (8(5 − (−1) − 2 × 8) + 12(10 − (−1) − 2 × 12) + 16(12.5 − (−1) − 2 × 16))
α0 = −1.1384
β0 = 0.2266
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education

Second Iteration

5 8
10 12
12.5 16
Start with an initial guess: α = −1; β = 2 , and a learning rate (e = 0.005). Then we have
α2 = (−1.1384) − 0.005 (−2 ((5 − (−1.1384) − (0.2266) × 8) + (10 − (−1.1384) − (0.2266) × 12) + (12.5 − (−1.1384) − (0.2266) × 16))
β2 = (0.2266) + 0.005 (−2 (8(5 − (−1.1384) − (0.2266) × 8) + 12(10 − (−1.1384) − (0.2266) × 12) + 16(12.5 − (−1.1384) − (0.2266) × 16))
α2 = −1.0624
β2 = 1.212689
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education

Third Iteration

5 8
10 12
12.5 16
α3 = −1.0624
β3 = 1.212689
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education

Fourth Iteration

5 8
10 12
12.5 16
α4 = −1.082738
β4 = 0.9693922
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education

7211 Iteration

5 8
10 12
12.5 16
α7211 = −2.076246
β7211 = 0.9369499
yols = −2.0833 + 0.9375 × Educ
20
15
10
Ln(Wage)
−5
0 5 10 15 20
Education

Stochastic Gradient-Based Optimization
I Computing the gradient can be very time consuming. It uses the full data set at each
step!!
I At the opposite extreme, Stochastic Gradient Descent just picks a random
observation at every step and computes the gradients based only on that single
observation.
I It make easier to use distributed computing

Stochastic Gradient-Based Optimization
I This makes the algorithm faster but

I Adds some random nature in descending the risk function’s gradient.
I Although this randomness does not allow the algorithm to find the absolute global minimum, it can actually help the algorithm jump out of local
minima and off plateaus to get sufficiently near the global minimum.

Mini-batch Gradient Descent
I At each step, instead of computing the gradients based on the full dataset (as in
Batch GD) or based on just one ovservation (as in Stochastic GD),
I Mini- batch GD computes the gradients on small random sets of observatinos called
mini- batches.
I The main advantage of Mini-batch GD over Stochastic GD is that you can get a
performance boost from distributed computing and hardware optimization of matrix
operations, especially when using GPUs.

Parallel vs Distributed
I An algorithm is parallel if it does many computations at once.
I It needs to see all of the data
I It is distributed if you can work with subsets of data
I Stata-mp is parallel. (license charges by core)
I R and Python can be parallel and distributed
https://tinyurl.com/y3nzvkwh

Further Readings
I Boehmke, B., & Greenwell, B. (2019). Hands-on machine learning with R. Chapman
and Hall/CRC.
I Carneiro, A., Guimarães, P., & Portugal, P. (2012). Real Wages and the Business
Cycle: Accounting for Worker, Firm, and Job Title Heterogeneity. American
Economic Journal: Macroeconomics, 4 (2): 133-52.
I Davidson, R., & MacKinnon, J. G. (2004). Econometric theory and methods (Vol. 5).
New York: Oxford University Press.
I Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
I Greene, W. H. (2003). Econometric analysis fifth edition. New Yersey: Prentice Hall.
I Ruud, P. A. (2000). An introduction to classical econometric theory. OUP Catalogue
I Taddy, M. (2019). Business data science: Combining machine learning and economics
to optimize, automate, and accelerate business decisions. McGraw Hill Professional.

05 Linear Regression Computation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 Linear Regression Computation

Uploaded by

Copyright:

Available Formats

Linear Regression Numerical Properties and Computation

Big Data and Machine Learning for Applied Economics

Universidad de los Andes

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 0 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 1 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 1 / 41

I Linear regression is the “work horse” of econometrics and (supervised) machine

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 2 / 41

I If f (X) = Xβ, obtaining f (.) boils down to obtaining β

I where we can obtain β minimizing SSR (e0 e)

β̂ = (X0 X)−1 X0 y (2)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 3 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 3 / 41

I Davidson & MacKinnon, Greene y Ruud have nice geometric interpretations

I Helps in computing with big data

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 4 / 41

Define two matrices

So we can relate SSR to the true error term u

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 6 / 41

are numerically identical

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 7 / 41

I Why FWL is useful in the context of big volume of data?

I An computationally inexpensive way of

I Computing certain diagnostic statistics: Influential Observations, R2 , LOOCV.

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 8 / 41

I For example: Carneiro, Guimarães, & Portugal (2012) AEJ: Macroeconomics

ln wijft = xit β + λi + θj + γf + δt + uijft (10)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 9 / 41

Note the following

β̂ = (X0 X)−1 X0 y (12)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 10 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 11 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 12 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 13 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 14 / 41

using FWL we can do

Mej y = Mej Xβ + r (15)

I β and residuals from both regressions are identical

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 15 / 41

PX y = X β̂(j) + α̂PX ej (20)

MX y = α̂MX ej + res (23)

α̂ = (ej0 MX ej )−1 ej0 MX y (24)

Influence depends on two factors

HW. case of y = α + βx + u (ISLR)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 18 / 41

I Linear Model: Min Risk ⇐⇒ Min SSR

β̂ = (X0 X)−1 X0 y (27)

I Involves inverting a k × k matrix X0 X

may not be the smartest move

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 19 / 41

2. Calculate Q0 y = [−4.04, −0.41]0

Solution is (3.5, −0.5)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 21 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 22 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 22 / 41

I Gradient Descent is a very generic optimization algorithm capable of finding optimal

Source: Boehmke, B., & Greenwell, B. (2019)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 23 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 24 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 25 / 41

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 26 / 41

Source: Boehmke, B., & Greenwell, B. (2019)