Stata Step by Step

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Setting up Stata

We are going to allocate 10 megabites to the dataset. You do not want to allocate
to much memory to the dataset because the more memory you allocate to the
dataset, the less memory will be available to perform the commands. You could
reduce the speed of Stata or even kill it.

set mem 10m

we can also decide to have the “more” separation line on the screen or not when
the software displays results:

set more on
set more off

Setting up a panel

Now, we have to instruct Stata that we have a panel dataset. We do it with the
command tsset, or iis and tis
iis idcode
tis year

or

tsset idcode year

In the previous command, idcode is the variable that identifies individuals in our
dataset. Year is the variable that identifies time periods. This is always the rule.

The commands refering to panel data in Stata almost always start with the prefix
xt. You can check for these commands by calling the help file for xt.

help xt

© Thierry Warin, 2006-2007 1


You should describe and summarize the dataset as usually before you perform
estimations. Stata has specific commands for describing and summarizing panel
datasets.

xtdes

xtsum

xtdes permits you to observe the pattern of the data, like the number of individuals
with different patterns of observations across time periods. In our case, we have
an unbalanced panel because not all individuals have observations to all years.

The xtsum command gives you general descriptive statistics of the variables in the
dataset, considering the overall, the between and the within variations. Overall
refers to the whole dataset.

Between refers to the variation of the means to each individual (across time
periods). Within refers to the variation of the deviation from the respective mean
to each individual.

You may be interested in applying the panel data tabulate command to a variable.
For instance, to the variable south, in order to obtain a one-way table.

xttab south

As in the previous commands, Stata will report the tabulation for the overall
variation, the within and the between variation.

How to generate variables

Generating variables

gen age2=age^2

gen ttl_exp2=ttl_exp^2

gen tenure2=tenure^2

© Thierry Warin, 2006-2007 2


Now, let's compute the average wage for each individual (across time periods).

bysort idcode: egen meanw=mean(ln_wage)

In this case, we did not apply the sort command previously and then the by prefix
command. We could have done it, but with this only command, you can always
abreviate the implementation of the by prefix command.

The command egen is an extension of the gen command to generate new


variables. The general rule to apply egen is when you want to generate a new
variable that is created using a function inside Stata.

In our case, we used the function mean.

You can apply the command list to list the first 10 observations of the new
variable mwage.

list meanw in 1/10

And then apply the xtsum command to summarize the new variable.

xtsum meanw

You may want to obtain the average of the logarithm of wages to each year in the
panel.

bysort year: egen meanw1=mean(ln_wage)

And then you can apply the xttab command.

xttab meanw1

Generating dates

Let’s generate dates:

Gen varname2 = date(varname1, “dmy”)

© Thierry Warin, 2006-2007 3


And format:

Format varname2 %d

How to generate dummies

Generating general dummies

Let's generate the dummy variable black, which is not in our dataset.

gen black=1 if race==2


replace black=0 if black==.

Suppose you want to generate a new variable called tenure1 that is equal to the
variable tenure lagged one period. Than you would use a time series operator (l).

First, you would need to sort the dataset according to idcode and year, and then
generate the new variable with the "by" prefix on the variable idcode.

sort idcode year


by idcode: gen tenure1=l.tenure

If you were interested in generating a new variable tenure3 equal to one


difference of the variable tenure, you would use the time series d operator.

by idcode: gen tenure3=d.tenure

If you would like to generate a new variable tenure4 equal to two lags of the
variable tenure, you would type:

by idcode: gen tenure4=l2.tenure

The same principle would apply to the operator d.

Let's just save our data file with the changes that we made to it.

© Thierry Warin, 2006-2007 4


save, replace

Another way would be to use the xi command. It takes the items (string of letters,
for instance) of a designated variable (category, for instance) and create a dummy
variable for each item. You need to change the base anyway:

char _dta[omit] “prevalent”


xi: i.category
tabulate category

Generating time dummies

In order to do this, let's first generate our time dummies. We use the "tabulate"
command with the option "gen" in order to generate time dummies for each year
of our dataset.

We will name the time dummies as "y",

• and we will get a first time dummy called "y1" which takes the value 1 if
year=1980, 0 otherwise,

• a second time dummy "y2" which assumes the value 1 if year=1982, 0


otherwise, and similarly for the remaining years. You could give any other
name to your time dummies.

tab year, g(y)

© Thierry Warin, 2006-2007 5


Running OLS regressions

Let's now turn to estimation commands for panel data.

The first type of regression that you may run is a pooled OLS regression, which is
simply an OLS regression applied to the whole dataset. This regression is not
considering that you have different individuals across time periods, and so, it is
not considering for the panel nature of the dataset.

reg ln_wage grade age ttl_exp tenure black not_smsa south

In the previous command, you do not need to type age1 or age2. You just need to
type age. When you do this, you are instructing Stata to include all the variables
starting with the expression age to be included in the regression.

Suppose you want to observe the internal results saved in Stata associated with
the last estimation. This is valid for any regression that you perform. In order to
observe them, you would type:

ereturn list

If you want to control for some categories:

xi: reg dependent ind1 ind2 i.category1 i.category2 i.time

Let's perform a regression where only the variation of the means across
individuals is considered.

This is the between regression.

xtreg ln_wage grade age ttl_exp tenure black not_smsa south, be

© Thierry Warin, 2006-2007 6


Running Panel regressions

In empirical work in panel data, you are always concerned in choosing between
two alternative regressions. This choice is between fixed effects (or within, or least
squares dummy variables - LSDV) estimation and random effects (or feasible
generalized least squares - FGLS) estimation.

In panel data, in the two-way model, the error term can be the result of the sum of
three components:
1. The two-way model assumes the error term as having a specific individual
term effect,
2. a specific time effect
3. and an additional idiosyncratic term.

In the one-way model, the error term can be the result of the sum of one
component:
1. assumes the error term as having a specific individual term effect

It is absolutely fundamental that the error term is not


correlated with the independent variables.

• If you have no correlation, then the random effects model should be used
because it is a weighted average of between and within estimations.

• But, if there is correlation between the individual and/or time effects and
the independent variables, then the individual and time effects (fixed
effects model) must be estimated as dummy variables in order to solve for
the endogeneity problem.

The fixed effects (or within regression) is an OLS regression of the form:

(yit - yi. - y.t + y..) = (xit - xi. - x.t + x..)B + (vit - vi. - v.t + v..)

© Thierry Warin, 2006-2007 7


where yi., xi. and vi. are the means of the respective variables (and the error)
within the individual across time, y.t, x.t and v.t are the means of the respective
variables (and the error) within each time period across individuals and y.., x..
and v.. is the overall mean of the respective variables (and the error).

Choosing between Fixed effects and Random effects? The


Hausman test

The generally accepted way of choosing between fixed and random effects is
running a Hausman test.

Statistically, fixed effects are always a reasonable thing to do with panel data
(they always give consistent results) but they may not be the most efficient model
to run. Random effects will give you better P-values as they are a more efficient
estimator, so you should run random effects if it is statistically justifiable to do so.

© Thierry Warin, 2006-2007 8


The Hausman test checks a more efficient model against a less efficient but
consistent model to make sure that the more efficient model also gives consistent
results.

To run a Hausman test comparing fixed with random effects in Stata, you need to
first estimate the fixed effects model, save the coefficients so that you can
compare them with the results of the next model, estimate the random effects
model, and then do the comparison.

1. xtreg dependentvar independentvar1 independentvar2... , fe


2. estimates store fixed
3. xtreg dependentvar independentvar1 independentvar2... , re
4. estimates store random
5. hausman fixed random

The hausman test tests the null hypothesis that the coefficients estimated by the
efficient random effects estimator are the same as the ones estimated by the
consistent fixed effects estimator. If they are insignificant (P-value, Prob>chi2
larger than .05) then it is safe to use random effects. If you get a significant P-
value, however, you should use fixed effects.

If you want a fixed effects model with robust standard errors, you can use the
following command:

areg ln_wage grade age ttl_exp tenure black not_smsa south, absorb(idcode)
robust

You may be interested in running a maximum likelihood estimation in panel data.


You would type:

xtreg ln_wage grade age ttl_exp tenure black not_smsa south, mle

If you qualify for a fixed effects model, should you include


time effects?

© Thierry Warin, 2006-2007 9


Other important question, when you are doing empirical work in panel data is to
choose for the inclusion or not of time effects (time dummies) in your fixed
effects model.

In order to perform the test for the inclusion of time dummies in our fixed effects
regression,
1. first we run fixed effects including the time dummies. In the next fixed
effects regression, the time dummies were abbreviated to "y" (see
“Generating time dummies”, but you could type them all if you prefer.

xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, fe

2. Second, we apply the "testparm" command. It is the test for time


dummies, which assumes the null hypothesis that the time dummies are
not jointly significant.

testparm y

3. We reject the null hypothesis that the time dummies are not jointly
significant if p-value smaller than 10%, and as a consequence our fixed
effects regression should include time effects.

Fixed effects or random effects when time dummies are


involved: a test

What about if the inclusion of time dummies in our regression would permit us to
use a random effects model in the individual effects?

[This question is not usually considered in typical empirical work- the purpose
here is to show you an additional test for random effects in panel data.)

1. First, we will run a random effects regression including our time


dummies,

xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, re

© Thierry Warin, 2006-2007 10


2. and then we will apply the "xttest0" command to test for random effects in
this case, which assumes the null hypothesis of random effects.

xttest0

3. The null hypothesis of random effects is again rejected if p-value smaller


than 10%, and thus we should use a fixed effects model with time effects.

© Thierry Warin, 2006-2007 11


GMM estimations

Two additional commands that are very usefull in empirical work are the Arellano
and Bond estimator (GMM estimator) and the Arellano and Bover estimator
(system GMM).

Both commands permit you do deal with dynamic panels (where you want to use
as independent variable lags of the dependent variable) as well with problems of
endogeneity.

You may want to have a look at them The commands are respectively "xtabond"
and "xtabond2". "xtabond" is a built in command in Stata, so in order to check
how it works, just type:

help xtabond

"xtabond2" is not a built in command in Stata. If you want to look at it,


previously, you must get it from the net (this is another feature of Stata- you can
always get additional commands from the net). You type the following:

findit xtabond2

The next steps to install the command should be obvious.

How does it work?

The xtabond2 commands allows to estimate dynamic models either with the
GMM estimator in difference or the GMM estimator in system.

xtabond2 dep_variable ind_variables (if, in), noleveleq gmm(list1, options1)


iv(list2, options2) two robust small

1. When noleveleq is specified, it is the GMM estimator in difference that’s used.


Otherwise, if noleveleq is not specified, it is the GMM estimator in system that’s
used.

© Thierry Warin, 2006-2007 12


2. gmm(list1, options):
• list1 is the list of the non-exogenous independent variables
• options1 may take the following values: lag(a,b), eq(diff), eq(level),
eq(both) and collapse
o lag(a,b) means that for the equation in difference, the lagged
variables (in level) of each variable from list1, dated from t-a to t-
b, will be used as instruments; whereas for the equation in level,
the first differences dated t-a+1 will be used as instruments. If
b=●, it means b is infinite. By default, a=1, and b=●. Example:
gmm(x y, lag(2 .)) ⇒ all the lagged variables of x and y, lagged by
at least two periods, will be used as instruments. Example 2:
gmm(x, lag(1 2)) gmm (y, lag (2 3)) ⇒ for variable x, the lagged
values of one period and two periods will be used as instruments,
whereas for variable y, the lagged values of two and three periods
will be used as instruments.
o Options eq(diff), eq(level) or eq(both) mean that the instruments
must be used respectively for the equation in first difference, the
equation in level, or for both. By default, the option is eq(both).
o Option collapse reduces the size of the instruments matrix and
aloow to prevent the overestimation bias in small samples when
the number of instruments is close to the number of observations.
But it reduces the statistical efficiency of the estimator in large
samples.

3. iv(list2, options2):
• List2 is the list of variables that are strictly exogenous, and options2 may
take the following values: eq(diff), eq(level), eq(both), pass and mz.
o Eq(diff), eq(level), and eq(both): see above
o By default, the exogenous variables are differentiated to serve as
instruments in the equations in first difference, and are used un-
differentiated to serve as instruments in the equations in level. The
pass option allows to prevent that exogenous variables are
differentiated to serve as instruments in equations in first
difference. Example: gmm(z, eq(level)) gmm(x, eq(diff) pass)
allows to use variable x in level as an instrument in the equation in
level as well as in the equation in difference.
o Option mz replaces the missing values of the exogenous variables
by zero, allowing thus to include in the regression the observations
whose data on exogenous variables are missing. This option
impacts the coefficients only if the variables are exogenous.

© Thierry Warin, 2006-2007 13


4. Option two:
• This option specifies the use of the GMM estimation in two steps. But
although this two-step estimation is asymptotically more efficient, leads to
biased results. To fix this issue, the xtabond2 command proceeds to a
correction of the covariance matrix for finite samples. So far, there is no
test to know whether the on-step GMM estimator or two-step GMM
estimator should be used.

5. Option robust:
• This option allows to correct the t-test for heteroscedasticity.

6. Option small:
• This option replaces the z-statistics by the t-test results.

© Thierry Warin, 2006-2007 14

You might also like