Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

Linear Regression Numerical Properties and Computation

Big Data and Machine Learning for Applied Economics

Ignacio Sarmiento-Barbieri

Universidad de los Andes

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 0 / 41


Agenda

1 Intro

2 Numerical Properties

3 OLS Computation
Traditional Computation
Gradient Descent

4 Further Readings

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 1 / 41


1 Intro

2 Numerical Properties

3 OLS Computation
Traditional Computation
Gradient Descent

4 Further Readings

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 1 / 41


Intro

I Linear regression is the “work horse” of econometrics and (supervised) machine


learning.
I Very powerful in many contexts.
I Big ‘payday’ to study this model in detail.

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 2 / 41


Linear Regression Model

I If f (X) = Xβ, obtaining f (.) boils down to obtaining β

y = Xβ + u (1)

I where we can obtain β minimizing SSR (e0 e)

β̂ = (X0 X)−1 X0 y (2)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 3 / 41


Agenda

1 Intro

2 Numerical Properties

3 OLS Computation
Traditional Computation
Gradient Descent

4 Further Readings

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 3 / 41


Numerical Properties

I Numerical properties have nothing to do with how the data was generated

I These properties hold for every data set, just because of the way that β̂ was
calculated

I Davidson & MacKinnon, Greene y Ruud have nice geometric interpretations

I Helps in computing with big data

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 4 / 41


Projection
OLS Residuals:

e = y − ŷ (3)
= y − X β̂ (4)

replacing β̂

e = y − X (X 0 X ) −1 X 0 y (5)
0 −1 0
= (I − X (X X ) X )y (6)

Define two matrices


I Projection matrix PX = X(X0 X)−1 X0
I Annihilator (residual maker) matrix MX = (I − PX )
Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 5 / 41
Projection

I PX = X ( X 0 X ) − 1 X 0
I MX = (I − PX )
I Both are symmetric
I Both are idempotent (A0 A) = A
I PX X = X hence projection matrix
I MX X = 0 hence annihilator matrix
We can write

SSR = e0 e = u0 MX u (7)

So we can relate SSR to the true error term u

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 6 / 41


Frisch-Waugh-Lovell (FWL) Theorem

I Lineal Model: Y = Xβ + u
I Split it: Y = X1 β 1 + X2 β 2 + u
I X = [X1 X2 ], X is n × k, X1 n × k1 , X2 n × k2 , k = k1 + k2
I β = [ β1 β2 ]
Theorem
1 The OLS estimates of β 2 from these equations

y = X1 β 1 + X2 β 2 + u (8)
MX1 y = MX1 X2 β 2 + residuals (9)

are numerically identical

2 the OLS residuals from these regressions are also numerically identical

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 7 / 41


Applications

I Why FWL is useful in the context of big volume of data?

I An computationally inexpensive way of


I Removing nuisance parameters
I E.g. the case of multiple fixed effects. The traditional way is either apply the within
transformation with respect to the FE with more categories then add one dummy for each
category for all the subsequent FE
I Not feasible in certain instances.

I Computing certain diagnostic statistics: Influential Observations, R2 , LOOCV.


I Way to add more data without having to compute everything again

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 8 / 41


Applications: Fixed Effects

I For example: Carneiro, Guimarães, & Portugal (2012) AEJ: Macroeconomics

ln wijft = xit β + λi + θj + γf + δt + uijft (10)

Y = Xβ + D1 λ + D2 θ + D3 γ + D4 δ + u (11)

I Data set 31.6 million observations, with 6.4 million individuals (i), 624 thousand firms
(f), and 115 thousand occupations (j), 11 years (t).
I Storing the required indicator matrices would require 23.4 terabytes of memory
I From their paper
“In our application, we first make use of the Frisch-Waugh-Lovell theorem to remove the influence of the
three high- dimensional fixed effects from each individual variable, and, in a second step, implement the
final regression using the transformed variables. With a correction to the degrees of freedom, this approach
yields the exact least squares solution for the coefficients and standard errors”

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 9 / 41


Applications: Influential Observations

Note the following

β̂ = (X0 X)−1 X0 y (12)

each element of the vector of parameter estimates β̂ is simply a weighted average of the
elementes of the vector y
Let’s call cj the j-th row of the matrix (X0 X)−1 X0 then

β̂ j = cj y (13)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 10 / 41


Applications: Influential Observations

80

70

60
y

50

40

30
2 4 6 8 10
x

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 11 / 41


Applications: Influential Observations

80
y=19.38+5.44x
70

60
y

50

40

30
2 4 6 8 10
x

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 12 / 41


Applications: Influential Observations

80

70
y=19.38+5.44x

60
y

50

40

30

2 4 6 8 10
x

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 13 / 41


Applications: Influential Observations

80

70
y=19.38+5.44x

60
y

50
y=23.37+5.57x

40

30

2 4 6 8 10
x

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 14 / 41


Applications: Influential Observations
Consider a dummy variable ej which is an n − vector with element j equal to 1 and the rest
is 0. Include it as a regressor

y = Xβ + αej + u (14)

using FWL we can do

Mej y = Mej Xβ + r (15)

I β and residuals from both regressions are identical


I Same estimates as those that would be obtained if we deleted observation j from the sample. We are
going to denote this as β(j)
Note:
I Me = I − ej (e0 ej )−1 e0
j j j

I Me y = y − e j ( e 0 e j ) − 1 e 0 y = y − y j e j
j j j
I Me X is X with the j row replaced by zeros
j

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 15 / 41


Applications: Influential Observations
Let’s define a new matrix Z = [X, ej ]
y = Xβ + αej + u (16)
y = Zθ + u (17)
we can write it as

y = PZ y + Mz y (18)
(j)
= X β̂ + α̂ej + MZ y (19)
Pre-multiply by PX (remember MZ PX = 0)

PX y = X β̂(j) + α̂PX ej (20)


X β̂ = X β̂(j) + α̂PX ej (21)
X( β̂ − β(j) ) = α̂PX ej (22)
Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 16 / 41
Applications: Influential Observations
How to calculate α? FWL once again

MX y = α̂MX ej + res (23)

α̂ = (ej0 MX ej )−1 ej0 MX y (24)


I ej0 MX y is the j element of MX y, the vector of residuals from the regression including
all observations
I ej0 Mx ej is just a scalar, the diagonal element of MX
Then

ûj
α̂ = (25)
1 − hj
where hj is the j diagonal element of PX
Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 17 / 41
Applications: Influential Observations
Finally we get

1
( β̂(j) − β̂) = − (X0 X)−1 Xj0 ûj (26)
1 − hj

Influence depends on two factors


I ûj large residual → related to y coordinate
I ĥj → related to x coordinate
80

70
y=19.38+5.44x

60
y

50
y=23.37+5.57x

40

30

2 4 6 8 10
x

HW. case of y = α + βx + u (ISLR)


Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 18 / 41
Agenda

1 Intro

2 Numerical Properties

3 OLS Computation
Traditional Computation
Gradient Descent

4 Further Readings

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 18 / 41


QR decomposition

I Linear Model: Min Risk ⇐⇒ Min SSR

β̂ = (X0 X)−1 X0 y (27)

I Involves inverting a k × k matrix X0 X


I requires allocating O(nk + k2 ) if n is ”big” we cannot store this in memory
Solving directly

beta<-solve(t(X)%*%X)%*%t(X)%*%y

may not be the smartest move

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 19 / 41


QR decomposition
Most software use a QR decomposition

Theorem If A ∈ Rn×k then there exists an orthogonal Q ∈ Rn×k and an upper triangular
R ∈ Rk×k so that A = QR
I Orthogonal Matrices:
I Def: Q0 Q = QQ0 = I and Q0 = Q−1
I Prop: product of orthogonal is orthogonal, e.g A0 A = I and B0 B = I then (AB)0 (AB) = B0 (A0 A)B = B0 B = I
I (Thin QR) If A ∈ Rn×k has full column rank then A = Q1 R1 the QR factorization is unique, where
Q1 ∈ Rn×k and R is upper triangular with positive diagonal entries
Can use it these to get β̂

(X0 X) β̂ = X0 y (28)
0 0 0 0
(R Q QR) β̂ = R Q y (29)
0 0 0
(R R) β̂ = R Q y (30)
R β̂ = Q0 y (31)
Solve by back substitution
Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 20 / 41
QR decomposition

   
1 2 1
X= 1 2  y= 4  (32)
1 3 2

1. QR factorization X=QR
 
−0.57 −0.41  
−1.73 −4.04
Q =  −0.57 −0.41  R = (33)
0 0.81
−0.57 0.82

2. Calculate Q0 y = [−4.04, −0.41]0


3. Solve

    
−1.73 −4.04 β1 −4.04
= (34)
0 0.81 β2 −0.41

Solution is (3.5, −0.5)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 21 / 41


QR decomposition
This is actually what R does under the hood

Note that R’s lm also returns many objects that have the same size as X and Y

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 22 / 41


Agenda

1 Intro

2 Numerical Properties

3 OLS Computation
Traditional Computation
Gradient Descent

4 Further Readings

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 22 / 41


Gradient-Based Descent

I Gradient Descent is a very generic optimization algorithm capable of finding optimal


solutions to a wide range of problems.
I The general idea of Gradient Descent is to tweak parameters iteratively in order to
minimize a cost function.

Source: Boehmke, B., & Greenwell, B. (2019)

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 23 / 41


Batch Gradient Descent

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 24 / 41


Batch Gradient Descent

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 25 / 41


Batch Gradient Descent

I To implement Gradient Descent, you need to compute the gradient of the cost
function with regards to each model parameter

β0 = β − e∇ β f ( β) (35)

I In other words, you need to calculate how much the cost function will change if you
change β just a little bit.
I You also need to define the learning step e

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 26 / 41


Batch Gradient Descent

Source: Boehmke, B., & Greenwell, B. (2019)

I We can choose e in several different ways:


I A popular approach is to set e to a small constant.
I Sometimes, we can solve for the step size that makes the directional derivative vanish.
I Another approach is to evaluate f (θ − e∇θ f (θ )) for several values of e and choose the one that
results in the smallest objective function value. This is called a line search.
Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 27 / 41
Batch Gradient Descent

I Notice that this formula involves calculations over the full data set X, at each
Gradient Descent step!
I This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of
training data at every step.
I As a result it is terribly slow on very large data sets.
I However, Gradient Descent scales well with the number of variables; estimating a
Linear Regression model when there are hundreds of thousands of features is much
faster using Gradient Descent than using the Normal Equations or any
decomposition.

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 28 / 41


Batch Gradient Descent: Example

log(wage) Education (years)


5 8
10 12
20
12.5 16
15

10

Ln(Wage)
5

−5
0 5 10 15 20
Education

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 29 / 41


Batch Gradient Descent: Example
log(wage) Education (years)
5 8
10 12
12.5 16 20

15

10
β̂ = (X0 X)−1 X0 y

Ln(Wage)
5

beta<-solve(t(X)%*%X)%*%t(X)%*%y 0

lm(y~x,data) −5
0 5 10 15 20
Education

y = −2.0833 + 0.9375 × Educ


Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 30 / 41
Batch Gradient Descent: Example
The Risk Function
n
SSR = f (θ ) = ∑ (yi − (α + βxi ))2
i=1

The Gradient

∂f
!
−2 ∑ni=1 (yi − α − βxi )
 
∇ fθ ( θ ) = ∂α =
∂f
∂β
−2 ∑ni=1 xi (yi − α − βxi )

Updating
∂f
α0 = α − e
∂α
∂f
β0 = β − e
∂β

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 31 / 41


Batch Gradient Descent: Example
First Iteration

log(wage) Education (years)


5 8
10 12
12.5 16

Start with an initial guess: α = −1; β = 2 , and a learning rate (e = 0.005). Then we have

α0 = (−1) − 0.005 (−2 ((5 − (−1) − 2 × 8) + (10 − (−1) − 2 × 12) + (12.5 − (−1) − 2 × 16))
β0 = 2 + 0.005 (−2 (8(5 − (−1) − 2 × 8) + 12(10 − (−1) − 2 × 12) + 16(12.5 − (−1) − 2 × 16))
α0 = −1.1384
β0 = 0.2266

20

15

10
Ln(Wage)

−5
0 5 10 15 20
Education

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 32 / 41


Batch Gradient Descent: Example
Second Iteration

log(wage) Education (years)


5 8
10 12
12.5 16

Start with an initial guess: α = −1; β = 2 , and a learning rate (e = 0.005). Then we have

α2 = (−1.1384) − 0.005 (−2 ((5 − (−1.1384) − (0.2266) × 8) + (10 − (−1.1384) − (0.2266) × 12) + (12.5 − (−1.1384) − (0.2266) × 16))
β2 = (0.2266) + 0.005 (−2 (8(5 − (−1.1384) − (0.2266) × 8) + 12(10 − (−1.1384) − (0.2266) × 12) + 16(12.5 − (−1.1384) − (0.2266) × 16))
α2 = −1.0624
β2 = 1.212689

20

15

10
Ln(Wage)

−5
0 5 10 15 20
Education

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 33 / 41


Batch Gradient Descent: Example
Third Iteration

log(wage) Education (years)


5 8
10 12
12.5 16

α3 = −1.0624
β3 = 1.212689

20

15

10
Ln(Wage)

−5
0 5 10 15 20
Education

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 34 / 41


Batch Gradient Descent: Example
Fourth Iteration

log(wage) Education (years)


5 8
10 12
12.5 16

α4 = −1.082738

β4 = 0.9693922

20

15

10
Ln(Wage)

−5
0 5 10 15 20
Education

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 35 / 41


Batch Gradient Descent: Example
7211 Iteration

log(wage) Education (years)


5 8
10 12
12.5 16

α7211 = −2.076246

β7211 = 0.9369499

yols = −2.0833 + 0.9375 × Educ

20

15

10
Ln(Wage)

−5
0 5 10 15 20
Education

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 36 / 41


Stochastic Gradient-Based Optimization

I Computing the gradient can be very time consuming. It uses the full data set at each
step!!
I At the opposite extreme, Stochastic Gradient Descent just picks a random
observation at every step and computes the gradients based only on that single
observation.
I It make easier to use distributed computing

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 37 / 41


Stochastic Gradient-Based Optimization

Source: Boehmke, B., & Greenwell, B. (2019)

I This makes the algorithm faster but


I Adds some random nature in descending the risk function’s gradient.
I Although this randomness does not allow the algorithm to find the absolute global minimum, it can actually help the algorithm jump out of local
minima and off plateaus to get sufficiently near the global minimum.

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 38 / 41


Mini-batch Gradient Descent

I At each step, instead of computing the gradients based on the full dataset (as in
Batch GD) or based on just one ovservation (as in Stochastic GD),
I Mini- batch GD computes the gradients on small random sets of observatinos called
mini- batches.
I The main advantage of Mini-batch GD over Stochastic GD is that you can get a
performance boost from distributed computing and hardware optimization of matrix
operations, especially when using GPUs.

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 39 / 41


Parallel vs Distributed
I An algorithm is parallel if it does many computations at once.
I It needs to see all of the data
I It is distributed if you can work with subsets of data
I Stata-mp is parallel. (license charges by core)
I R and Python can be parallel and distributed

https://tinyurl.com/y3nzvkwh

Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 40 / 41


Further Readings
I Boehmke, B., & Greenwell, B. (2019). Hands-on machine learning with R. Chapman
and Hall/CRC.
I Carneiro, A., Guimarães, P., & Portugal, P. (2012). Real Wages and the Business
Cycle: Accounting for Worker, Firm, and Job Title Heterogeneity. American
Economic Journal: Macroeconomics, 4 (2): 133-52.
I Davidson, R., & MacKinnon, J. G. (2004). Econometric theory and methods (Vol. 5).
New York: Oxford University Press.
I Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
I Greene, W. H. (2003). Econometric analysis fifth edition. New Yersey: Prentice Hall.
I Ruud, P. A. (2000). An introduction to classical econometric theory. OUP Catalogue
I Taddy, M. (2019). Business data science: Combining machine learning and economics
to optimize, automate, and accelerate business decisions. McGraw Hill Professional.
Sarmiento-Barbieri (Uniandes) Linear Regression Numerical Properties and Computation 41 / 41

You might also like