Stata Step by Step

Setting up Stata
We are going to allocate 10 megabites to the dataset. You do not want to allocate
to much memory to the dataset because the more memory you allocate to the
dataset, the less memory will be available to perform the commands. You could
reduce the speed of Stata or even kill it.
set mem 10m
we can also decide to have the “more” separation line on the screen or not when
the software displays results:
set more on
set more off
Setting up a panel
Now, we have to instruct Stata that we have a panel dataset. We do it with the
command tsset, or iis and tis
iis idcode
tis year
or
tsset idcode year
In the previous command, idcode is the variable that identifies individuals in our
dataset. Year is the variable that identifies time periods. This is always the rule.
The commands refering to panel data in Stata almost always start with the prefix
xt. You can check for these commands by calling the help file for xt.
help xt
© Thierry Warin, 2006-2007 1

You should describe and summarize the dataset as usually before you perform
estimations. Stata has specific commands for describing and summarizing panel
datasets.
xtdes
xtsum
xtdes permits you to observe the pattern of the data, like the number of individuals
with different patterns of observations across time periods. In our case, we have
an unbalanced panel because not all individuals have observations to all years.
The xtsum command gives you general descriptive statistics of the variables in the
dataset, considering the overall, the between and the within variations. Overall
refers to the whole dataset.
Between refers to the variation of the means to each individual (across time
periods). Within refers to the variation of the deviation from the respective mean
to each individual.
You may be interested in applying the panel data tabulate command to a variable.
For instance, to the variable south, in order to obtain a one-way table.
xttab south
As in the previous commands, Stata will report the tabulation for the overall
variation, the within and the between variation.
How to generate variables
Generating variables
gen age2=age^2
gen ttl_exp2=ttl_exp^2
gen tenure2=tenure^2

Now, let's compute the average wage for each individual (across time periods).
bysort idcode: egen meanw=mean(ln_wage)
In this case, we did not apply the sort command previously and then the by prefix
command. We could have done it, but with this only command, you can always
abreviate the implementation of the by prefix command.
The command egen is an extension of the gen command to generate new

variables. The general rule to apply egen is when you want to generate a new
variable that is created using a function inside Stata.
In our case, we used the function mean.
You can apply the command list to list the first 10 observations of the new
variable mwage.
list meanw in 1/10
And then apply the xtsum command to summarize the new variable.
xtsum meanw
You may want to obtain the average of the logarithm of wages to each year in the
panel.
bysort year: egen meanw1=mean(ln_wage)
And then you can apply the xttab command.
xttab meanw1
Generating dates
Let’s generate dates:
Gen varname2 = date(varname1, “dmy”)

And format:
Format varname2 %d
How to generate dummies
Generating general dummies
Let's generate the dummy variable black, which is not in our dataset.
gen black=1 if race==2

replace black=0 if black==.
Suppose you want to generate a new variable called tenure1 that is equal to the
variable tenure lagged one period. Than you would use a time series operator (l).
First, you would need to sort the dataset according to idcode and year, and then
generate the new variable with the "by" prefix on the variable idcode.
sort idcode year

by idcode: gen tenure1=l.tenure
If you were interested in generating a new variable tenure3 equal to one

difference of the variable tenure, you would use the time series d operator.
by idcode: gen tenure3=d.tenure
If you would like to generate a new variable tenure4 equal to two lags of the
variable tenure, you would type:
by idcode: gen tenure4=l2.tenure
The same principle would apply to the operator d.
Let's just save our data file with the changes that we made to it.

save, replace
Another way would be to use the xi command. It takes the items (string of letters,
for instance) of a designated variable (category, for instance) and create a dummy
variable for each item. You need to change the base anyway:
char _dta[omit] “prevalent”

xi: i.category
tabulate category
Generating time dummies
In order to do this, let's first generate our time dummies. We use the "tabulate"
command with the option "gen" in order to generate time dummies for each year
of our dataset.
We will name the time dummies as "y",
• and we will get a first time dummy called "y1" which takes the value 1 if
year=1980, 0 otherwise,
• a second time dummy "y2" which assumes the value 1 if year=1982, 0

otherwise, and similarly for the remaining years. You could give any other
name to your time dummies.
tab year, g(y)

Running OLS regressions
Let's now turn to estimation commands for panel data.
The first type of regression that you may run is a pooled OLS regression, which is
simply an OLS regression applied to the whole dataset. This regression is not
considering that you have different individuals across time periods, and so, it is
not considering for the panel nature of the dataset.
reg ln_wage grade age ttl_exp tenure black not_smsa south
In the previous command, you do not need to type age1 or age2. You just need to
type age. When you do this, you are instructing Stata to include all the variables
starting with the expression age to be included in the regression.
Suppose you want to observe the internal results saved in Stata associated with
the last estimation. This is valid for any regression that you perform. In order to
observe them, you would type:
ereturn list
If you want to control for some categories:
xi: reg dependent ind1 ind2 i.category1 i.category2 i.time
Let's perform a regression where only the variation of the means across
individuals is considered.
This is the between regression.
xtreg ln_wage grade age ttl_exp tenure black not_smsa south, be

Running Panel regressions
In empirical work in panel data, you are always concerned in choosing between
two alternative regressions. This choice is between fixed effects (or within, or least
squares dummy variables - LSDV) estimation and random effects (or feasible
generalized least squares - FGLS) estimation.
In panel data, in the two-way model, the error term can be the result of the sum of
three components:
1. The two-way model assumes the error term as having a specific individual
term effect,
2. a specific time effect
3. and an additional idiosyncratic term.
In the one-way model, the error term can be the result of the sum of one
component:
1. assumes the error term as having a specific individual term effect
It is absolutely fundamental that the error term is not

correlated with the independent variables.
• If you have no correlation, then the random effects model should be used
because it is a weighted average of between and within estimations.
• But, if there is correlation between the individual and/or time effects and
the independent variables, then the individual and time effects (fixed
effects model) must be estimated as dummy variables in order to solve for
the endogeneity problem.
The fixed effects (or within regression) is an OLS regression of the form:
(yit - yi. - y.t + y..) = (xit - xi. - x.t + x..)B + (vit - vi. - v.t + v..)

where yi., xi. and vi. are the means of the respective variables (and the error)
within the individual across time, y.t, x.t and v.t are the means of the respective
variables (and the error) within each time period across individuals and y.., x..
and v.. is the overall mean of the respective variables (and the error).
Choosing between Fixed effects and Random effects? The

Hausman test
The generally accepted way of choosing between fixed and random effects is
running a Hausman test.
Statistically, fixed effects are always a reasonable thing to do with panel data
(they always give consistent results) but they may not be the most efficient model
to run. Random effects will give you better P-values as they are a more efficient
estimator, so you should run random effects if it is statistically justifiable to do so.

The Hausman test checks a more efficient model against a less efficient but
consistent model to make sure that the more efficient model also gives consistent
results.
To run a Hausman test comparing fixed with random effects in Stata, you need to
first estimate the fixed effects model, save the coefficients so that you can
compare them with the results of the next model, estimate the random effects
model, and then do the comparison.
1. xtreg dependentvar independentvar1 independentvar2... , fe

2. estimates store fixed
3. xtreg dependentvar independentvar1 independentvar2... , re
4. estimates store random
5. hausman fixed random
The hausman test tests the null hypothesis that the coefficients estimated by the
efficient random effects estimator are the same as the ones estimated by the
consistent fixed effects estimator. If they are insignificant (P-value, Prob>chi2
larger than .05) then it is safe to use random effects. If you get a significant P-
value, however, you should use fixed effects.
If you want a fixed effects model with robust standard errors, you can use the
following command:
areg ln_wage grade age ttl_exp tenure black not_smsa south, absorb(idcode)
robust
You may be interested in running a maximum likelihood estimation in panel data.

You would type:
xtreg ln_wage grade age ttl_exp tenure black not_smsa south, mle
If you qualify for a fixed effects model, should you include

time effects?

Other important question, when you are doing empirical work in panel data is to
choose for the inclusion or not of time effects (time dummies) in your fixed
effects model.
In order to perform the test for the inclusion of time dummies in our fixed effects
regression,
1. first we run fixed effects including the time dummies. In the next fixed
effects regression, the time dummies were abbreviated to "y" (see
“Generating time dummies”, but you could type them all if you prefer.
xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, fe
2. Second, we apply the "testparm" command. It is the test for time

dummies, which assumes the null hypothesis that the time dummies are
not jointly significant.
testparm y
3. We reject the null hypothesis that the time dummies are not jointly
significant if p-value smaller than 10%, and as a consequence our fixed
effects regression should include time effects.
Fixed effects or random effects when time dummies are

involved: a test
What about if the inclusion of time dummies in our regression would permit us to
use a random effects model in the individual effects?
[This question is not usually considered in typical empirical work- the purpose
here is to show you an additional test for random effects in panel data.)
1. First, we will run a random effects regression including our time

dummies,
xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, re

2. and then we will apply the "xttest0" command to test for random effects in
this case, which assumes the null hypothesis of random effects.
xttest0
3. The null hypothesis of random effects is again rejected if p-value smaller

than 10%, and thus we should use a fixed effects model with time effects.

GMM estimations
Two additional commands that are very usefull in empirical work are the Arellano
and Bond estimator (GMM estimator) and the Arellano and Bover estimator
(system GMM).
Both commands permit you do deal with dynamic panels (where you want to use
as independent variable lags of the dependent variable) as well with problems of
endogeneity.
You may want to have a look at them The commands are respectively "xtabond"
and "xtabond2". "xtabond" is a built in command in Stata, so in order to check
how it works, just type:
help xtabond
"xtabond2" is not a built in command in Stata. If you want to look at it,

previously, you must get it from the net (this is another feature of Stata- you can
always get additional commands from the net). You type the following:
findit xtabond2
The next steps to install the command should be obvious.
How does it work?
The xtabond2 commands allows to estimate dynamic models either with the
GMM estimator in difference or the GMM estimator in system.
xtabond2 dep_variable ind_variables (if, in), noleveleq gmm(list1, options1)

iv(list2, options2) two robust small
1. When noleveleq is specified, it is the GMM estimator in difference that’s used.

Otherwise, if noleveleq is not specified, it is the GMM estimator in system that’s
used.

2. gmm(list1, options):
• list1 is the list of the non-exogenous independent variables
• options1 may take the following values: lag(a,b), eq(diff), eq(level),
eq(both) and collapse
o lag(a,b) means that for the equation in difference, the lagged
variables (in level) of each variable from list1, dated from t-a to t-
b, will be used as instruments; whereas for the equation in level,
the first differences dated t-a+1 will be used as instruments. If
b=●, it means b is infinite. By default, a=1, and b=●. Example:
gmm(x y, lag(2 .)) ⇒ all the lagged variables of x and y, lagged by
at least two periods, will be used as instruments. Example 2:
gmm(x, lag(1 2)) gmm (y, lag (2 3)) ⇒ for variable x, the lagged
values of one period and two periods will be used as instruments,
whereas for variable y, the lagged values of two and three periods
will be used as instruments.
o Options eq(diff), eq(level) or eq(both) mean that the instruments
must be used respectively for the equation in first difference, the
equation in level, or for both. By default, the option is eq(both).
o Option collapse reduces the size of the instruments matrix and
aloow to prevent the overestimation bias in small samples when
the number of instruments is close to the number of observations.
But it reduces the statistical efficiency of the estimator in large
samples.
3. iv(list2, options2):
• List2 is the list of variables that are strictly exogenous, and options2 may
take the following values: eq(diff), eq(level), eq(both), pass and mz.
o Eq(diff), eq(level), and eq(both): see above
o By default, the exogenous variables are differentiated to serve as
instruments in the equations in first difference, and are used un-
differentiated to serve as instruments in the equations in level. The
pass option allows to prevent that exogenous variables are
differentiated to serve as instruments in equations in first
difference. Example: gmm(z, eq(level)) gmm(x, eq(diff) pass)
allows to use variable x in level as an instrument in the equation in
level as well as in the equation in difference.
o Option mz replaces the missing values of the exogenous variables
by zero, allowing thus to include in the regression the observations
whose data on exogenous variables are missing. This option
impacts the coefficients only if the variables are exogenous.

4. Option two:
• This option specifies the use of the GMM estimation in two steps. But
although this two-step estimation is asymptotically more efficient, leads to
biased results. To fix this issue, the xtabond2 command proceeds to a
correction of the covariance matrix for finite samples. So far, there is no
test to know whether the on-step GMM estimator or two-step GMM
estimator should be used.
5. Option robust:
• This option allows to correct the t-test for heteroscedasticity.
6. Option small:
• This option replaces the z-statistics by the t-test results.

Stata Step by Step

Uploaded by

Copyright:

Available Formats

You might also like

Stata Step by Step

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stata Step by Step

Uploaded by

Copyright:

Available Formats

Setting up Stata

set mem 10m

tsset idcode year

© Thierry Warin, 2006-2007 1

How to generate variables

© Thierry Warin, 2006-2007 2

bysort idcode: egen meanw=mean(ln_wage)

The command egen is an extension of the gen command to generate new

In our case, we used the function mean.

list meanw in 1/10

bysort year: egen meanw1=mean(ln_wage)

And then you can apply the xttab command.

Let’s generate dates:

Gen varname2 = date(varname1, “dmy”)

© Thierry Warin, 2006-2007 3

How to generate dummies

Generating general dummies

gen black=1 if race==2

sort idcode year

If you were interested in generating a new variable tenure3 equal to one

by idcode: gen tenure3=d.tenure

by idcode: gen tenure4=l2.tenure

The same principle would apply to the operator d.

© Thierry Warin, 2006-2007 4

char _dta[omit] “prevalent”

Generating time dummies

We will name the time dummies as "y",

• a second time dummy "y2" which assumes the value 1 if year=1982, 0

tab year, g(y)

© Thierry Warin, 2006-2007 5

Let's now turn to estimation commands for panel data.

reg ln_wage grade age ttl_exp tenure black not_smsa south

If you want to control for some categories:

xi: reg dependent ind1 ind2 i.category1 i.category2 i.time

This is the between regression.

xtreg ln_wage grade age ttl_exp tenure black not_smsa south, be

© Thierry Warin, 2006-2007 6

It is absolutely fundamental that the error term is not

© Thierry Warin, 2006-2007 7

Choosing between Fixed effects and Random effects? The

© Thierry Warin, 2006-2007 8

1. xtreg dependentvar independentvar1 independentvar2... , fe

You may be interested in running a maximum likelihood estimation in panel data.

If you qualify for a fixed effects model, should you include

© Thierry Warin, 2006-2007 9

xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, fe

2. Second, we apply the "testparm" command. It is the test for time

Fixed effects or random effects when time dummies are

1. First, we will run a random effects regression including our time

xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, re

© Thierry Warin, 2006-2007 10

3. The null hypothesis of random effects is again rejected if p-value smaller

© Thierry Warin, 2006-2007 11

"xtabond2" is not a built in command in Stata. If you want to look at it,

The next steps to install the command should be obvious.

How does it work?

xtabond2 dep_variable ind_variables (if, in), noleveleq gmm(list1, options1)

1. When noleveleq is specified, it is the GMM estimator in difference that’s used.