Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Instrumental Variables: 2-Stage and 3Stage Least Squares Regression of a

Linear Systems of Equations


2009 LPGA Performance Statistics and
Prize Winnings
www.lpga.com
S.J. Callan and J.M. Thomas (2007). Modeling the Determinants of a Professional
Golfers Tournament Earnings, Journal of Sports Economics, Vol. 8, No. 4, pp. 394411

Data Description
Prize Winnings and Performance Statistics for n = 146
professional women (LPGA) golfers for 2009 season
Exogenous Performance Variables:
Average Driving Distance
Percentage of Fairways reached on Drive
Percentage of Greens Reached in Regulation
Percentage of Sand Saves (in hole in 2 shots from close
traps)
Average Putts per hole on greens reached in regulation
Numbers of Events, Events Completed, Rounds

Endogenous Result (Dependent & Independent)


Variables:
Average Score per Round
Average Rank (Percentile in Tournaments)
Log(Prize Winnings)

Variables in Systems of Equations


Endogenous Variables Jointly dependent
(response) variables that are system
determined. They can also appear as
predictor variables in other equations
Exogenous Variables Independent variables
that do not depend on the endogenous
variables
Predetermined Variables Exogenous and
lagged Endogenous variables
Instrumental Variables Predetermined
variables used to predict endogenous
variables in first-stage regressions, with

System of Equations (Callan and Thomas, 2007)

1. Average Score (per 18 holes) is related to


the golfers skills and experience (number
of rounds played)
2. Average Rank (transformed to percentile)
in tournaments is related to average
score and the number of events she
competed in
3. Season Earnings is related to average
rank and the number of tournaments she
SCORE i 0 D Di F Fi G Gi S Si P Pi R Ri 1i
completed
Rank i 0 SCORESCORE i E Ei 2i
ln Prizei 0 RANK Rank i C Ci 3i

Potential Problems with Endogenous Predictors

When endogenous variables are included as


predictors, they can be correlated with error
terms for that equation, particularly when there
are omitted variables that may be related to the
outcome. This causes Ordinary Least Squares
Estimates to be biased and inconsistent.
In equation 2, SCORE may be correlated with the error
term without a variable measuring average course
difficulty (Callan and Thomas, p. 402).
In equation 3, Rank may be correlated with the error
term without a variable measuring golfers human
capital investment such as diet and concentration level
(Callan and Thomas, p. 402).

Model Building Process


1. Regress all endogenous variables (Score,
Rank, and ln(Prize)) on all exogenous
variables
2. Obtain the predicted values for each
endogenous variable, based on the
Regressions from 1.
3. In the system of equations, replace any
right hand side endogenous predictors
with their fitted values from 2.
4. Note that software (e.g. SAS and STATA)
will fit all the regressions in 1., even if
that variable does not appear as a
predictor (ln(Prize) in this example).

First Stage Regressions for Score and Rank

The fitted (predicted) values for SCORE will be used in equation 2 in place of
SCORE, and the fitted values for RANK in equation 3. Equation 1 has no right

Equation 1) - SCORE is related to SKILLS and experience


All variables except
average driving distance
are significant.
All else equal:
Average SCORE
decreases as Percent
Fairways Hit Increases (a
10% increase in fairways
hit corresponds to a 0.19
decrease in SCORE)
Average SCORE
decreases by 1.36 with a
10% increase in Greens
in regulation
Average SCORE
decreases by 0.16 with a
10% increase in Sand
Saves
Average SCORE
increases by 1.32 with a
0.1 increase in putts per

Equation 2) - Rank is related to SCORE and Events


Rank (as Percentile, with
100 meaning golfer won
every tournament she
played in) is:
Negative associated with
predicted SCORE
(decreases by 12.5 with
unit increase in average
SCORE)
Positively associated
with number of Events
(increases by 0.28 with a
unit increase in # of
EVENTS played)
Note: The estimated
coefficients are correct,
but the standard errors,
t-tests, and Analysis of
Variance are incorrect

Equation 3) ln(Prize) is related to Rank and


Completed Events
Prize Winnings (in log
form):
Increase with
(Predicted) Rank. A
10% increase in
Rank (percentile)
increases ln(Prize)
by 0.56
Increase with
Completed Events.
For each tournament
completed, ln(Prize)
increases by 0.080.
Note: The estimated
coefficients are
correct, but the
standard errors, ttests, and Analysis
of Variance are
incorrect (see slide

Matrix Approach: Models w/ Endogenous Predictors


Z Matrix of Instrumental Variables: Intercept and 8 Exogenous variables
Intercept, Drive, Fairway, Greens, SandSave, Putts, Rounds, Events, Completed
X Matrix of Predictors for Model:
Model 2: Intercept, Score (Actual, not predicted), Events
Model 3: Intercept, Rank, Completed
Y Vector of Responses:
Model 2: Rank

Model 3: ln(Prize)

2-Stage Least Squares Estimator and Estimated Variance-Covariance Matrix:

2SLS = X'Z Z'Z Z'X


-1

-1

X'Z Z'Z Z'Y = X'PZ X X'PZ Y


-1

-1

PZ = Z Z'Z Z'
-1

V 2SLS s X'PZ X
2

-1

'
^

SSE Y - X 2SLS Y - X 2SLS

SSR
R2
SSR SSE

SSE
s
n rank ( X )
2

1
-1

SSR Y' PZ X X'PZ X X'PZ J n Y


n

Model 2 Rank = f(Score, Events)

Model 3: ln(Prize) = f(Rank,Completed)

Robust Estimate of Variance of 2SLS Estimator


V 2i 22i

21


V 22
M
2 n

V 2SLS V

2
21

2
0 22
L

M M O

0 L
0

X'PZ X X'PZY X'PZ X


1

22n
X'PZ PZ X X'PZ X

1
1
-1
-1
X'PZ X X'Z Z'Z Z'Z Z'Z Z'X X'PZ X

Replacing Z'Z with its estimator:


2
e21

0
Z
M

e22n

2
0 e22
L

S = Z'
M M O

0 L
0
^

e z z'
i 1

2
2i i i

^
'
e2i Y2i xi 2SLS

1
-1

V 2SLS X'PZ X X'Z Z'Z S Z'Z Z'X X'PZ X

-1

Exact same method for equation 3

z'

z'

2
Z = X =
M

'
z n

x'

x'

x'

Results for Model 2: Rank = f(Score, Events)

Results for Model 3: ln(Prize) = f(Rank,Completed)

3-Stage Least Squares


Extension of 2-Stage Least Squares that
allows for a covariance structure among
the system of equations
Errors from 2SLS are obtained, and used to
estimate the within individual (golfer)
variance-covariance structure among the
equations
The response vector is stacked with the n
responses from model 1, being stacked
over the n responses from model 2, which
are stacked over the n responses from
model 3.

Model Description - I

Model 1: SCORE i 0 D Di F Fi G Gi S Si P Pi R Ri 1i Y1i


Model 2: Rank i 0 SCORESCORE i E Ei 2i Y2i

Model 3: ln Prizei 0 RANK Rank i C Ci 3i Y3i


Y11
Y
12

Y 21
Y 22

Y1
Y2
M

Y
Y
1,146
2,146

F1
G1
1 D1
1 D
F2
G2
2
X1
MM
M
M

1 D146 F146 G146


0
X1 0
X 0 X 2 0
0
0 X 3
^

eki Yki Y ki

S12
S 22
S32

Y1
Y2
Y3

1 SC1

1 SC
2

X2
MM

1 SC146

E 1
E 2
M

E 146

1 RA1

S13
S 23
S33

146
1
S12 =
e1ie2i
146 (7 3) / 2 i 1

and so on for S13 , S 22 , S 23 , S33

W S 1 Z Z'Z Z' S 1 PZ
1

C1
C
2
M

1 RA2
X 3

MM

1 RA146 C146

k 1, 2,3 are residuals from 2-Stage Least Squares Regressions

1 146 2
S11 =
e1i
146 7 i 1
S11
S S 21
S31

Y31

Y
32

Y3
Y

Y
3,146

S1
P1
R1
S2
P2
R2
M
M M

S146 P146 R146

Model Description - II
^

3SLS X'WX X'WY X'S Z Z'Z Z'X


^

-1

V 3SLS X'WX

-1

-1

X'S Z Z'Z Z'X


-1

-1

X'S -1 Z Z'Z Z'Y

where:
S 11

S 1 S 21
S 31

S 12
S 22
S 32

S 13

S 23
S 33

S 11X1'PZ X1

X'WX S 21X 2'PZ X1


S 31X 3'PZ X1

S 12 PZ
S 22 PZ
S 32 PZ

S 12 X1'PZ X 2
S 22 X 2'PZ X 2
S 32 X 3'PZ X 2

S 13 X1'PZ X 3

S 23 X 2'PZ X 3
S 33 X 3'PZ X 3

S 11PZ

21
W
S
PZ

S 31PZ

S 13 PZ

S 23 PZ
S 33 PZ

S 11X1'PZ Y1 S 12 X1'PZ Y2 S 13 X1'PZ Y3

X'WY S 21X 2'PZ Y1 S 22 X 2'PZ Y2 S 23 X 2'PZ Y3


S 31X 3'PZ Y1 S 32 X 3'PZ Y2 S 33 X 3'PZ Y3

-1

Estimation Results

EQ
1

EQ
2
EQ
3

SAS Program
data lpga2009;
infile 'lpga2009.dat';
input golfer drive fairway green putts sandsv prize lnprize
events girputts complete aveposrank rounds strokes;
lnprize1=log(prize);
run;
proc syslin 2sls out=regout;
instruments drive fairway green girputts sandsv rounds events complete;
strokes: model strokes = drive fairway green girputts sandsv rounds; output
residual=e1;
rank: model aveposrank = strokes events; output residual=e2;
prize: model lnprize1 = aveposrank complete; output residual=e3;
run;
proc syslin 3sls data=lpga2009 itprint out=regout3;
instruments drive fairway green girputts sandsv rounds events complete;
strokes: model strokes = drive fairway green girputts sandsv rounds / xpx;
output residual=e1;
rank: model aveposrank = strokes events / xpx;
output residual=e2;
prize: model lnprize1 = aveposrank complete / xpx;
output residual=e3;

STATA Program
insheet using lpga_2009_meq.csv

generate lnprize=ln(prize)

reg3 (avestrokes=drive fairway green sandsvpct girputtshole rounds) ///


(averagepospct=avestrokes events) (lnprize=averagepospct
completed), ///
2sls

reg3 (avestrokes=drive fairway green sandsvpct girputtshole rounds) ///


(averagepospct=avestrokes events) (lnprize=averagepospct
completed), ///
3sls

You might also like