Professional Documents
Culture Documents
TS101 PDF
TS101 PDF
(ver. 1.5)
Oscar Torres-Reyna
Data Consultant
otorres@princeton.edu
http://dss.princeton.edu/training/
PU/DSS/OTR
Date variable
-----STATA 10.x/11.x:
gen datevar = date(date1,"DMY", 2012)
format datevar %td /*For daily data*/
-----STATA 9.x:
gen datevar = date(date1,"dmy", 2012)
format datevar %td
/*For daily data*/
If you have a format like date2 type
-----STATA 10.x/11.x:
gen datevar = date(date2,"MDY", 2012)
format datevar %td /*For daily data*/
------STATA 9.x:
gen datevar = date(date2,"mdy", 2012)
format datevar %td
/*For daily data*/
If you have a format like date3 type
------STATA 10.x/11.x:
tostring date3, gen(date3a)
gen datevar=date(date3a,"YMD")
format datevar %td
/*For daily data*/
------STATA 9.x:
tostring date3, gen(date3a)
gen year=substr(date3a,1,4)
gen month=substr(date3a,5,2)
gen day=substr(date3a,7,2)
destring year month day, replace
gen datevar1 = mdy(month,day,year)
format datevar1 %td /*For daily data*/
If you have a format like date4 type
See http://www.princeton.edu/~otorres/Stata/
2
PU/DSS/OTR
week= weekly(stringvar,"wy")
month= monthly(stringvar,"my")
quarter= quarterly(stringvar,"qy")
half = halfyearly(stringvar,"hy")
year= yearly(stringvar,"y")
NOTE: Remember to format the date variable accordingly. After creating it type:
format datevar %t?
Change ? with the correct format: w (week), m (monthly), q (quarterly), h (half), y (yearly).
If the components of the original date are in different numeric variables (i.e. color black):
gen
gen
gen
gen
gen
daily = mdy(month,day,year)
week = yw(year, week)
month = ym(year,month)
quarter = yq(year,quarter)
half = yh(year,half-year)
NOTE: Remember to format the date variable accordingly. After creating it type:
format datevar %t?
Change ? with the correct format: w (week), m (monthly), q (quarterly), h (half), y (yearly).
To extract days of the week (Monday, Tuesday, etc.) use the function dow()
gen dayofweek= dow(date)
Replace date with the date variable in your dataset. This will create the variable dayofweek where 0 is Sunday, 1 is
Monday, etc. (type help dow for more details)
To specify a range of dates (or integers in general) you can use the tin() and twithin() functions. tin() includes the
first and last date, twithin() does not. Use the format of the date variable in your dataset.
/* Make sure to set your data as time series before using tin/twithin */
tsset date
regress y x1 x2 if tin(01jan1995,01jun1995)
regress y x1 x2 if twithin(01jan2000,01jan2001)
3
PU/DSS/OTR
********************************************************
. tsset datevar
time variable:
delta:
If you have gaps in your time series, for example there may not be data available for weekends. This
complicates the analysis using lags for those missing dates. In this case you may want to create a continuous
time trend as follows:
gen time = _n
Then use it to set the time series:
tsset time
In the case of cross-sectional time series type:
sort panel date
by panel: gen time = _n
xtset panel time
6
PU/DSS/OTR
7
PU/DSS/OTR
Subsetting tin/twithin
With tsset (time series set) you can use two time series commands: tin (times in, from a to b) and
twithin (times within, between a and b, it excludes a and b). If you have yearly data just include the years.
. list datevar unemp if tin(2000q1,2000q4)
173.
174.
175.
176.
datevar
unemp
2000q1
2000q2
2000q3
2000q4
4.033333
3.933333
4
3.9
174.
175.
datevar
unemp
2000q2
2000q3
3.933333
4
/* Make sure to set your data as time series before using tin/twithin */
tsset date
regress y x1 x2 if tin(01jan1995,01jun1995)
regress y x1 x2 if twithin(01jan2000,01jan2001)
8
PU/DSS/OTR
Merge/Append
See
http://dss.princeton.edu/training/Merge101.pdf
PU/DSS/OTR
1.
2.
3.
4.
5.
datevar
unemp
unempL1
unempL2
1957q1
1957q2
1957q3
1957q4
1958q1
3.933333
4.1
4.233333
4.933333
6.3
.
3.933333
4.1
4.233333
4.933333
.
.
3.933333
4.1
4.233333
. generate unempF1=F1.unemp
(1 missing value generated)
. generate unempF2=F2.unemp
(2 missing values generated)
. list datevar unemp
1.
2.
3.
4.
5.
datevar
unemp
unempF1
unempF2
1957q1
1957q2
1957q3
1957q4
1958q1
3.933333
4.1
4.233333
4.933333
6.3
4.1
4.233333
4.933333
6.3
7.366667
4.233333
4.933333
6.3
7.366667
7.333333
11
PU/DSS/OTR
1.
2.
3.
4.
5.
datevar
unemp
unempD1
unempD2
1957q1
1957q2
1957q3
1957q4
1958q1
3.933333
4.1
4.233333
4.933333
6.3
.
.1666665
.1333332
.7000003
1.366667
.
.
-.0333333
.5666671
.6666665
1.
2.
3.
4.
5.
datevar
unemp
unempS1
unempS2
1957q1
1957q2
1957q3
1957q4
1958q1
3.933333
4.1
4.233333
4.933333
6.3
.
.1666665
.1333332
.7000003
1.366667
.
.
.2999997
.8333335
2.066667
Correlograms: autocorrelation
To explore autocorrelation, which is the correlation between a variable and its previous values,
use the command corrgram. The number of lags depend on theory, AIC/BIC process or
experience. The output includes autocorrelation coefficient and partial correlations coefficients
used to specify an ARIMA model.
corrgram unemp, lags(12)
. corrgram unemp, lags(12)
LAG
1
2
3
4
5
6
7
8
9
10
11
12
AC
PAC
0.9641
0.8921
0.8045
0.7184
0.6473
0.5892
0.5356
0.4827
0.4385
0.3984
0.3594
0.3219
0.9650
-0.6305
0.1091
0.0424
0.0836
-0.0989
-0.0384
0.0744
0.1879
-0.1832
-0.1396
0.0745
Prob>Q
182.2
339.02
467.21
569.99
653.86
723.72
781.77
829.17
868.5
901.14
927.85
949.4
-1
0
1 -1
0
1
[Autocorrelation] [Partial Autocor]
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
Graphic view of AC
which shows a slow
decay in the trend,
suggesting nonstationarity.
See also the ac
command.
14
PU/DSS/OTR
-1.00
-0.50
0.00
0.50
1.00
Cross-correlogram
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1
Lag
-1.00
9 10
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
CORR
table
-1
0
1
[Cross-correlation]
-0.1080
-0.1052
-0.1075
-0.1144
-0.1283
-0.1412
-0.1501
-0.1578
-0.1425
-0.1437
-0.1853
-0.1828
-0.1685
-0.1177
-0.0716
-0.0325
-0.0111
-0.0038
0.0168
0.0393
0.0419
At lag 0 there is a negative immediate correlation between GDP growth rate and unemployment. This means that a drop
in GDP causes an immediate increase in unemployment.
15
PU/DSS/OTR
-0.50
0.00
0.50
1.00
Cross-correlogram
-1.00
-1.00
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1
Lag
9 10
CORR
-1
0
1
[Cross-correlation]
0.3297
0.3150
0.2997
0.2846
0.2685
0.2585
0.2496
0.2349
0.2323
0.2373
0.2575
0.3095
0.3845
0.4576
0.5273
0.5850
0.6278
0.6548
0.6663
0.6522
0.6237
16
PU/DSS/OTR
Lag selection
Too many lags could increase the error in the forecasts, too few could leave out relevant information*.
Experience, knowledge and theory are usually the best way to determine the number of lags needed. There
are, however, information criterion procedures to help come up with a proper number. Three commonly used
are: Schwarz's Bayesian information criterion (SBIC), the Akaike's information criterion (AIC), and the
Hannan and Quinn information criterion (HQIC). All these are reported by the command varsoc in Stata.
. varsoc gdp cpi, maxlag(10)
Selection-order criteria
Sample: 1959q4 - 2005q1
lag
0
1
2
3
4
5
6
7
8
9
10
LL
LR
-1294.75
-467.289
-401.381
-396.232
-385.514
-383.92
-381.135
-379.062
-375.483
-370.817
-370.585
Endogenous:
Exogenous:
1654.9
131.82
10.299
21.435*
3.1886
5.5701
4.1456
7.1585
9.3311
.46392
Number of obs
df
4
4
4
4
4
4
4
4
4
4
p
0.000
0.000
0.036
0.000
0.527
0.234
0.387
0.128
0.053
0.977
FPE
5293.32
.622031
.315041
.311102
.288988*
.296769
.300816
.307335
.308865
.306748
.319888
AIC
14.25
5.20098
4.52067
4.50804
4.43422*
4.46066
4.47401
4.49519
4.49981
4.4925
4.53391
HQIC
14.2642
5.2438
4.59204
4.60796
4.56268*
4.61766
4.65956
4.70929
4.74246
4.76369
4.83364
182
SBIC
14.2852
5.30661
4.69672*
4.75451
4.7511
4.84796
4.93173
5.02332
5.09836
5.16147
5.27329
gdp cpi
_cons
When all three agree, the selection is clear, but what happens when getting conflicting results? A paper from
the CEPR suggests, in the context of VAR models, that AIC tends to be more accurate with monthly data,
HQIC works better for quarterly data on samples over 120 and SBIC works fine with any sample size for
quarterly data (on VEC models)**. In our example above we have quarterly data with 182 observations,
HQIC suggest a lag of 4 (which is also suggested by AIC).
* See Stock & Watson for more details and on how to estimate BIC and SIC
** Ivanov, V. and Kilian, L. 2001. 'A Practitioner's Guide to Lag-Order Selection for Vector Autoregressions'. CEPR Discussion Paper no. 2685. London, Centre for Economic Policy
Research. http://www.cepr.org/pubs/dps/DP2685.asp.
17
PU/DSS/OTR
Unit roots
Having a unit root in a series mean that there is more than one trend in the series.
. regress
Source
SS
df
MS
Model
Residual
36.1635247
124.728158
1
66
36.1635247
1.88982058
Total
160.891683
67
2.4013684
unemp
Coef.
gdp
_cons
-.4435909
7.087789
. regress
Std. Err.
.1014046
.3672397
t
-4.37
19.30
Number of obs
F( 1,
66)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.000
0.000
=
=
=
=
=
=
68
19.14
0.0000
0.2248
0.2130
1.3747
-.2411302
7.821007
Source
SS
df
MS
Model
Residual
8.83437339
180.395848
1
74
8.83437339
2.43778172
Total
189.230221
75
2.52306961
unemp
Coef.
gdp
_cons
.3306551
5.997169
Std. Err.
.173694
.2363599
t
1.90
25.37
Number of obs
F( 1,
74)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.061
0.000
=
=
=
=
=
=
76
3.62
0.0608
0.0467
0.0338
1.5613
.6767479
6.468126
18
PU/DSS/OTR
Unit roots
Unemployment Rate
6
8
10
12
Unemployment rate.
line unemp datevar
1957q1
1969q1
1981q1
datevar
1993q1
2005q1
19
PU/DSS/OTR
unemp, lag(5)
Unit root
Z(t)
Test
Statistic
-2.597
Number of obs
187
Interpolated Dickey-Fuller
1% Critical
5% Critical
10% Critical
Value
Value
Value
-3.481
-2.884
-2.574
. dfuller
unempD1, lag(5)
No unit root
Z(t)
Test
Statistic
-5.303
Number of obs
186
Interpolated Dickey-Fuller
1% Critical
5% Critical
10% Critical
Value
Value
Value
-3.481
-2.884
-2.574
20
PU/DSS/OTR
unemp gdp
predict e, resid
dfuller e, lags(10)
Unit root*
Z(t)
Test
Statistic
1% Critical
Value
-2.535
-3.483
Number of obs
181
Interpolated Dickey-Fuller
5% Critical
10% Critical
Value
Value
-2.885
-2.575
21
*Critical value for one independent variable in the OLS regression, at 5% is -3.41 (Stock & Watson)
PU/DSS/OTR
. regress
Source
df
MS
Model
Residual
373.501653
12.5037411
8
179
46.6877066
.069853302
Total
386.005394
187
2.06419997
Std. Err.
Number of obs
F( 8,
179)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
=
=
=
=
=
=
188
668.37
0.0000
0.9676
0.9662
.2643
unemp
Coef.
unemp
L1.
L2.
L3.
L4.
1.625708
-.7695503
.0868131
.0217041
.0763035
.1445769
.1417562
.0726137
21.31
-5.32
0.61
0.30
0.000
0.000
0.541
0.765
1.475138
-1.054845
-.1929152
-.1215849
1.776279
-.484256
.3665415
.1649931
gdp
L1.
L2.
L3.
L4.
.0060996
-.0189398
.0247494
.003637
.0136043
.0128618
.0130617
.0129079
0.45
-1.47
1.89
0.28
0.654
0.143
0.060
0.778
-.0207458
-.0443201
-.0010253
-.0218343
.0329451
.0064405
.0505241
.0291083
_cons
.1702419
.096857
1.76
0.081
-.0208865
.3613704
1)
2)
3)
4)
L.gdp = 0
L2.gdp = 0
L3.gdp = 0
L4.gdp = 0
F(
4,
179) =
Prob > F =
1.67
0.1601
22
PU/DSS/OTR
. quietly var
. vargranger
Excluded
chi2
unemp
unemp
gdp
ALL
6.9953
6.9953
4
4
0.136
0.136
gdp
gdp
unemp
ALL
6.8658
6.8658
4
4
0.143
0.143
The null hypothesis is var1 does not Granger-cause var2. In both cases, we cannot reject
the null that each variable does not Granger-cause the other
23
PU/DSS/OTR
Change tq with the correct date format: tw (week), tm (monthly), tq (quarterly), th (half), ty (yearly) and the
corresponding date format in the parenthesis
Step 2. Create interaction terms between the lags of the independent variables and the lag of the dependent variables. We will
assume lag 1 for this example (the number of lags depends on your theory/data)
generate break_unemp = break*l1.unemp
generate break_gdp = break*l1.gdp
Step 3. Run a regression between the outcome variables (in this case unemp) and the independent along with the interactions
and the dummy for the break.
reg unemp l1.unemp l1.gdp break break_unemp break_gdp
Step 4. Run an F-test on the coefficients for the interactions and the dummy for the break
test break break_unemp break_gdp
. test break break_unemp break_gdp
( 1)
( 2)
( 3)
break = 0
break_unemp = 0
break_gdp = 0
F(
3,
185) =
Prob > F =
1.14
0.3351
24
PU/DSS/OTR
Replace the words in bold with your own variables, do not change anything else*/
The log file qlrtest.log will have the list for QLR statistics (use Word to read it)*/
See next page for a graph*/
STEP 1. Copy-and-paste-run the code below to a do-file, double-check the quotes (re-type them if necessary)*/
25
PU/DSS/OTR
Replace the words in bold with your own variables, do not change anything else*/
The code will produce the graph shown in the next page*/
The critical value 3.66 is for q=5 (constant and four lags) and 5% significance*/
STEP 2. Copy-and-paste-run the code below to a do-file, double-check the quotes (re-type them if necessary)*/
sum qlr`var'
local maxvalue=r(max)
gen maxdate=datevar if qlr`var'==`maxvalue'
local maxvalue1=round(`maxvalue',0.01)
local critical=3.66
/*Replace with the appropriate critical value (see Stock & Watson)*/
sum datevar
local mindate=r(min)
sum maxdate
local maxdate=r(max)
gen break=datevar if qlr`var'>=`critical' & qlr`var'!=.
dis "Below are the break dates..."
list datevar qlr`var' if break!=.
levelsof break, local(break1)
twoway tsline qlr`var', title(Testing for breaks in GDP per-capita (1957-2005)) ///
xlabel(`break1', angle(90) labsize(0.9) alternate) ///
yline(`critical') ytitle(QLR statistic) xtitle(Time) ///
ttext(`critical' `mindate' "Critical value 5% (`critical')", placement(ne)) ///
ttext(`maxvalue' `maxdate' "Max QLR = `maxvalue1'", placement(e))
datevar
qlrgdp
129.
131.
132.
133.
134.
1989q1
1989q3
1989q4
1990q1
1990q2
3.823702
6.285852
6.902882
5.416068
7.769114
135.
136.
137.
138.
139.
1990q3
1990q4
1991q1
1991q2
1991q3
8.354294
5.399252
8.524492
6.007093
4.44151
140.
141.
160.
161.
1991q4
1992q1
1996q4
1997q1
4.946689
3.699911
3.899656
3.906271
26
PU/DSS/OTR
1996q4
1997q1
1989q1
1989q3
1989q4
1990q1
1990q2
1990q3
1990q4
1991q1
1991q2
1991q3
1991q4
1992q1
QLR statistic
4
Time
27
PU/DSS/OTR
. wntestq
unemp
1044.6341
0.0000
unemp, lags(10)
Serial correlation
901.1399
0.0000
If your variable is not white noise then see the page on correlograms to see the order of the
autocorrelation.
28
PU/DSS/OTR
unempd gdp
Source
SS
df
MS
Model
Residual
.205043471
24.0656991
1
190
.205043471
.126661574
Total
24.2707425
191
.12707195
unempd1
Coef.
gdp
_cons
-.015947
.0401123
Std. Err.
.0125337
.036596
Number of obs
F( 1,
190)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
-1.27
1.10
0.205
0.274
=
=
=
=
=
=
192
1.62
0.2048
0.0084
0.0032
.3559
.0087761
.1122989
. estat dwatson
Durbin-Watson d-statistic(
2,
192) =
.7562744
. estat durbinalt
Durbin's alternative test for autocorrelation
lags(p)
1
chi2
118.790
df
1
Serial correlation
chi2
74.102
df
1
29
PU/DSS/OTR
unemp
Iteration
Iteration
Iteration
Iteration
Iteration
0:
1:
2:
3:
4:
gdp, corc
rho
rho
rho
rho
rho
=
=
=
=
=
0.0000
0.9556
0.9660
0.9661
0.9661
SS
df
MS
Model
Residual
.308369041
23.4694088
1
189
.308369041
.124176766
Total
23.7777778
190
.125146199
unemp
Coef.
gdp
_cons
-.020264
6.105931
rho
.966115
Std. Err.
.0128591
.7526023
t
-1.58
8.11
Number of obs
F( 1,
189)
Prob > F
R-squared
Adj R-squared
Root MSE
P>|t|
0.117
0.000
=
=
=
=
=
=
191
2.48
0.1167
0.0130
0.0077
.35239
.0051018
7.590511
Introduction to Stata (PDF), Christopher F. Baum, Boston College, USA. A 67-page description of Stata, its key
features and benefits, and other useful information. http://fmwww.bc.edu/GStat/docs/StataIntro.pdf
Books
Introduction to econometrics / James H. Stock, Mark W. Watson. 2nd ed., Boston: Pearson Addison
Wesley, 2007.
Data analysis using regression and multilevel/hierarchical models / Andrew Gelman, Jennifer Hill.
Cambridge ; New York : Cambridge University Press, 2007.
Econometric analysis / William H. Greene. 6th ed., Upper Saddle River, N.J. : Prentice Hall, 2008.
Designing Social Inquiry: Scientific Inference in Qualitative Research / Gary King, Robert O.
Keohane, Sidney Verba, Princeton University Press, 1994.
Unifying Political Methodology: The Likelihood Theory of Statistical Inference / Gary King, Cambridge
University Press, 1989
Statistics with Stata (updated for version 9) / Lawrence Hamilton, Thomson Books/Cole, 2006
31
PU/DSS/OTR