Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Dr.

Jeremiah Dittmar
j.e.dittmar@lse.ac.uk

Quantitative Approaches and Policy Analysis (PP455)


2022 - 2023
Important STATA commands

STATA is a tool to we can use to think about policy questions.

While STATA has an enormous wealth of statistical routines, most empi-


rical studies rely on a few core commands. This summary provides a brief
overview of several of the most important commands of STATA. Note that this
summary is not a substitute for the online help, the internet or the manuals,
which are the ultimate source for detailed descriptions of each command. In-
stead this summary is intended as a short reference for the most important
and useful commands that you will regularly encounter during the course.
For this purpose the commands are organized into four groups. The first
group are estimation commands and commands to compute standard errors.
These two types of commands are the only commands that actually perform
statistical analyzes. All other commands are subsidiary commands to mani-
pulate data, present output and the like. In many empirical projects the task
off getting the data into the form necessary to estimate relationships (“run a
regression”) takes time and effort – and particularly for these data manipula-
tion steps it is absolutely crucial to document each step in a do-file.
One of the reasons why STATA is such a popular program is that its data
manipulation and and graphing commands are very flexible and powerful.

Graphing Commands
graph One of STATA’s most powerful features is its ability to generate an end-
less variety of graphs. For our purposes we most of time just need scatter
plots of two variables. To generate a scatter plot with the variables var1 and
var2 type graph twoway scatter var1 var2. Another typical applica-
tion is that you want to also include the fitted regression line of a regressi-
on of var1 on var2 in your scatter lot. To do this you first regress var1 on
var2 and then generate the predicted values for var1 by typing predict
fit var1 which will save the fitted values for the dependent variable in the
variable fit var1. To generate the scatter plot with the fitted regression line
you need to type graph twoway (scatter var1 var2, sort) (line
fit var1 var2). To visualise the distribution of the data use the histogram
and/or kdensity commands. See the internet and the manuals for a myriad

1
of options to add labels and comments to graphs, change the color and ap-
pearance and many other features.
Note that graphs can be exported and saved as delightful image files auto-
matically. If you explore a type-setting software like LaTex (free!), you can au-
tomate significant amounts of the work involved in preparing presentations
and documents. For example, the image of the graph which your Stata code
saves will automatically appear in your document when you compile it...

Estimation Commands
summarize creates basic descriptive statistics for each variable in your data-
set (number of observations, mean, standard deviation, minimum and ma-
ximum). To get more detailed descriptive statistics for a particular variable
var1 type summarize var1, detail.
correlate computes the correlation coefficient between two variables. To ob-
tain the correlation coefficient between the variables var1 and var2 type
correlate var1 var2.
regress runs an OLS regression. The first variable listed after regress is the
dependent variable and any further variables (separated by a space) are the
explanatory variables. The regression automatically includes a constant term
in the regression unless you specify the option nocon. Note that both in this
case and for all other commands such option choices have to be separated
from the main command line by a comma. If you wanted to regress the va-
riable outcome on the variable policy and exclude the constant, you would
therefore type regress outcome policy, nocon.
Note that some (earlier) versions of STATA only allow you to include at
most 40 explanatory variables in a regression. With the command set matsize
800 this limit can be increased to 800 variables (you can also increase it to
some other number below 800), which is the current capacity constraint for
Intercooled STATA (STATA SE can deal with even more variables).
areg is one command to estimate a panel data model with the within trans-
formation. The somewhat more complicated alternative is xtreg. The only
difference between regress and areg is that you have to specify with the
option absorb which fixed effects should be removed with the within trans-
formation. Consider the following example. You have a dataset that contains
information for several states over a number of years. The dataset is in the
long form and contains a variable state which indicates to which state the
observation refers and also a variable year which indicates to which year
each observation refers. You want to regress the variable outcome on the va-
riable policy and control for state fixed effects. To do this you would type
areg outcome policy, absorb(state).
In STATA the within transformation can only be applied once in a regres-

2
sion (unless you program it manually, which would be quite a bit of work). If
you want to estimate a model with both cross-section and time fixed effects,
most people use areg to eliminate either the cross-section or the time fixed
effects and control for the other fixed effects by including a set of dummy
variables as regular explanatory variables in the regression estimated with
areg. In the example above suppose your dataset contains information for
five years. You have used the command tabulate (described below) to ge-
nerate dummy variables year2 for the second year, year3 for the third ye-
ar and so on. To estimate the regression of outcome on policy with both
state and time fixed effects you would now type areg outcome policy
year2-year5, absorb(state). The statement year2-year5 is a (par-
ticularly in large datasets very convenient) shorthand, which tells STATA to
include all variables from year2 to year5 in the regression.
ivreg estimates an instrumental variables (or two-stage least squares) regres-
sion. It again operates in exactly the same way as regress with one diffe-
rence. Instead of the endogenous variable dodgy, that you want to instru-
ment for, you include the expression (dodgy=instrument) in the list of
variables. If you are lucky enough to have two instruments you would wri-
te (dodgy=instrument1 instrument2). It is also possible to instrument
for several endogenous variables, but we will try to avoid such cases.
probit estimates a probit regression. Even though this regression is concep-
tually very different from an ordinary least squares regression it works in ex-
actly the same way as the command regress. If you use dprobit STATA
immediately produces the marginal effects of the explanatory variables at the
sample mean.
robust is an option for any linear regression (including instrumental varia-
bles regression) and specifies that STATA should use heteroskedasticity ro-
bust standard errors.
cluster is also an option for any linear regression (including instrumental va-
riable regression) and is one approach to obtaining standard errors which
are not only robust against heteroskedasticity but also against autocorrela-
tion. The estimation allows the error terms to be correlated in an arbitrary
way within each cluster, but assumes that there is no correlation of the er-
ror terms across clusters. A typical example would be a panel of states. In
many applications it would be plausible to assume that the error term is not
correlated across different states, but might well be correlated within each
state over time. To control for this type of autocorrelation you would type
cluster(state) where state is the name of the variable that contains
your state identifier. If you use cluster there is no need to also specify ro-
bust as STATA will assume this automatically.

3
test is the command to test linear hypothesis with an F-test after a regressi-
on. Suppose, for example, that you have regressed the variable outcome on
the variables policy1 and policy2 and want to test the hypothesis that the
coefficients of the two explanatory variables are of the same magnitude. To
do this you would type test policy1=policy2 after you have estimated
the regression. STATA will return both the value of the F-statistic and the as-
sociated p-value. If the regression uses robust or clustered standard errors,
the reported F-statistic will also be robust against heteroskedasticity and the
form of autocorrelation assumed in cluster.

Data Management
cd is the change directory command which is (or should be) used at the be-
ginning of do-files. The command tells STATA what the working directory
is in which it will by default look for datasets and also save log files. If you,
for example, wanted to change to the directory C:\mi460\exercise01 you
would type cd ‘‘C:\mi460\exercise01’’ (The inverted commas are not
necessary if the directory name does not contain spaces, but it does not do
any harm to always use them).
use is the command to open a dataset. If you type use mydata.dta STATA
will look for the dataset mydata.dta in the working directory (which you
have specified with cd). If you keep your datasets in the subdirectory data
of your working directory rather than the working directory itself, you would
type use data\mydata.dta. Note that for STATA to be able to work with
a dataset it needs to load it into the RAM of the computer. For this purpose
STATA reserves a small part of the RAM of the computer (the default is 1 MB).
If you are trying to open a dataset which is larger than this default, you first
need to increase the amount of memory that is allocated to STATA. If you
wanted to increase the memory allocated to STATA to 40MB (assuming that
your computer has a lot more than 40MB of RAM these days) you would need
to type set memory 40m before you open the dataset.
clear closes the dataset that STATA currently has open without saving any
changes to the data. STATA can only work on one dataset at a time. STATA will
complain if you try to open a new dataset unless you either save the dataset
currently open or type clear.
save is the command to save a dataset. It works in exactly the same way as
use. Suppose you want to save a dataset in the subdirectory of your working
directory called data. To save the dataset newdata.dta to this directory you
would type save data\newdata.dta.
describe will generate a list of the variables in your dataset. This list will be
particularly useful, if your variables have labels which contain a short des-
cription/definition of the variable. To attach a label to the variable policy,

4
for example, type label variable policy ‘‘text of your label’’.
list can be used to list the individual observations in your dataset. If you wan-
ted to display variables var1 and var2 you would type list var1 var2. If
you type list without any arguments, STATA will list the observations for all
variables in the dataset. It is often useful to use this command in combination
with the if clause, which is described further below.
sort sorts your observations into a particular order. Before you list data (and
also before you can use certain commands such as, for example, merge which
is described below), is is often necessary to sort your data. Typing sort var1
var2 will sort your observations in increasing order of var1. Observations
with the same value of var1 will be sorted in increasing order of variable
var2. Observations that have the same values of var1 and var2 will be ran-
domly selected into a order. An extension of sort is gsort which has a num-
ber of additional options.
generate is possibly the most widely used command apart from regress
and generates a new variable. The syntax is generate=expression were
expression can be any function of existing variables. The mathematical
operations that you can use in expression have fairly logical symbols: +
for plus, - for minus, / for divide, * for multiply, ˆ for exponents (i.e. varˆ2
implements var2 ) and ln(var) or log(var) for the natural logarithm of
var. If is also possible to generate lagged values of variables. To generate a
new variable lagvar1 which is equal to the first lag of var1 you have to type
lagvar1=var1[ n-1]. Note that STATA will mechanically use the value one
row above the current observation to create the lagged value. For this opera-
tion to produce something sensible you may first have to sort your data into
the appropriate order.
replace can be used to replace a variable with some other variable or function
of variables. As in the case of generate the syntax is replace=expression
were expression can be any function of existing variables. Replace is often
used in combination with the if clause, which is described further below.
drop can be used to drop either variables or observations from the dataset.
Drop is often used in combination with the if clause, which is described fur-
ther below. Suppose, for example, that you want to drop the variables temp1
and temp2 from the dataset. To do so you type drop temp1 temp2. Alter-
natively, suppose that you want to drop all observations for which the va-
riable state takes on the value Alaska. To do so you would type drop if
state==‘‘Alaska’’.
keep is the mirror image of drop. If you type keep var1 var2 STATA will
drop all variables other than var1 and var2.
tabulate can be used to create a table that lists all realizations of a variable

5
and the frequency with which they occur. If you type, for example, tabulate
year you will see how many separate values the variable year takes on and
how often each value occurs. You can also generate a two-way table of two
variables var1 and var2 by typing tabulate var1 var2.
A very convenient way to generate dummy variables for each year in your
dataset is to type tab year, gen(yeardum). This command will generate
a set of dummy variables for each year in your dataset. These dummy varia-
bles will have the names yeardum1, yeardum2 and so on. You can use this
command to generate dummy variables for any categorical variable in your
dataset.
egen is one of the most powerful commands that STATA has and is a genera-
lization of generate. It has a wealth of options which are described in de-
tail in the STATA manuals. If you, for example, wanted to create a variable
average that contains the mean of the variable var1 you would type egen
average=mean(var1). This command is often used in combination with
the by clause, which is described further below.
merge is a very powerful command to combine two (or more) datasets in-
to one dataset. Suppose you want to combine two datasets dataset1 and
dataset2, which contain different variables about a common set of peo-
ple. Suppose that in both datasets these people are uniquely identified by the
variable personid. That is each person (i.e. each row in the dataset) has a
unique value of personid. In this case first sort both datasets by personid.
Then open dataset1 with the use command and type merge personid
using dataset2.dta. The resulting dataset will combine the variables from
both datasets for each person.
Consider one more example. Suppose that dataset1 and dataset2 are
panel datasets which contain data for a number of states over several years.
Suppose that both datasets contain a variable state with the name of each
state and a variable year with the calender year. In this case the combina-
tion of the variables state and year uniquely identifies each line in the
two datasets. To merge these two datasets you first sort both datasets with
sort state year. Then open dataset1 with the use command and ty-
pe merge state year using dataset2.dta. The resulting dataset will
combine the variables of both datasets.
It is of critical importance that you check carefully after each merge whe-
ther STATA has done what you wanted it to do. One useful piece of help in
checking whether a merge has worked as intended is the variable merge
that STATA automatically generates after each merge (and which you need
to drop before you can perform a second merge). See the online help and the
user guide for a detailed description of merge and further examples.
append is similar to merge and can be used to add additional observations

6
to a dataset. If you have two datasets dataset1 and dataset2 that con-
tain the same variables, but different observations for these variables (diffe-
rent time periods for example) you can combine these two datasets by ope-
ning dataset1 and then typing append using dataset2.dta. For this to
work, variables have to have the same name in the two datasets.
reshape is a very powerful command to convert panel datasets between the
wide and the long form. Most of the time you want to convert datasets from
the wide to the long form. Suppose you have a dataset for a number of states.
The name of each state is contained in the first column of the dataset and this
variable is called state. The second column contains the variable inc1970
which is the income of each state in 1970. The third column contains inc1980,
which is the income of each state in 1980. To convert this dataset to the long
form type reshape inc long, i(state) j(year). See the STATA ma-
nuals for a comprehensive description of this command and further examp-
les.
do can be used to execute a do-file. This command is particularly useful if
you want break up your STATA programme into several do-files. You might,
for example, have a do-file that merges the various data sources and creates
and saves the final estimation dataset. The next do-file opens this dataset and
runs the regressions. In this case (and particularly if you have many more do-
files) it is very useful to have a file “master.do”, which contains explanations of
what the separate do-files do and which calls each do-file with the command
do ‘‘name’’.do (as usual STATA will be looking in the working directory
for these do-files). While you would only execute one of your do-files at a
time during the actual empirical work to save on computing time, executing
this master file should enable you to replicate all your empirical results.
#delimit can be useful when you are dealing with long STATA commands.
The default setting in STATA is that each line of your dofile is a separate com-
mand. In other words, each command ends with a carriage return. If you is-
sue the command #delimit ; in a do-file (it can only be used in do-files)
STATA now requires a “;” at the end of each command, but commands can
stretch over several lines in your do-file. If you want to switch back to the
default setting later in the same do-file you would have to issue the command
#delimit cr.
Conditional Statements
if is a statement that can be combined with almost any command in STA-
TA. If you, for example, want to run a regression of outcome on policy
for the years after 1980 (assuming that your dataset also contains observa-
tions for years before 1980) you would type regress outcome policy if
year>1980. Note that the if condition comes immediately after the main

7
command and is not separated by a comma. The if condition is also often
used in combination with generate, replace, drop, egen, list and many
other commands.
by is a statement that specifies that a particular command should be com-
puted for each of the groups specified through by separately. This sounds
more complicated than it is. Consider the following example. You have a la-
bor market dataset for a sample of men and woman. You want to regress the
wage of a person (wage) on the years of experience (exp). Furthermore, you
first want to run this regression only on the observations for men (dummy
variable gender equal to zero) and then only on the observations for woman
(dummy variable gender equal to one). To do this you first have to type sort
gender and then by gender: reg wage exp. Note that you could have
achieved the same result by first typing reg wage exp if gender==0 and
then reg wage exp if gender==1. However, if you have more than two
categories the by command is much easier to handle.

Presenting Output
log using is the most basic way to save a record of your output in a log file.
If you include the command log using ‘‘filename.log’’ at the begin-
ning of your do-file STATA will save the output of all following commands
in this file. STATA now supports several different types of log files. .log files
are the no-nonsense basic version, but the default is now a .smcl file, which
has a somewhat fancier layout. If you use a log file it is useful to include the
command capture log close at the beginning of your do file to suppress
(unnecessary) error messages related to open log files.
outreg One way to turn your regression results into the type of tables that you
see in empirical papers is to create them by hand (or copy and paste from
your log file). A much more efficient process is to use the outreg routine.
By essentially typing outreg followed by a list of variables you can save the
results for these variables to a text or WORD file. See the long description of
outreg available in the internet. Unfortunately, outreg is not included in the
typical STATA installation and you have to download it from the web. This
is not difficult. Type net search outreg, nosj to get a list of locations
where you can download the ado-file that implements outreg and follow
the instructions.
esttab Provides another basic way to view and present results. After you esti-
mate a regression, store the results using the eststo command (see help).
Then present or export the results in table form using the esttab command.
This lets you export in multiple, quite flexible formats. For example, as “.csv”,
as raw text, as “.tex”, etc.
saved results Sometimes it is necessary to read a particular result of a regres-

8
sion into a variable, so that this result can be used in some other place in your
do-file. One typical example is that you have estimated a regression that con-
tains the variable important as an explanatory variable and want to store
the estimated coefficient of this variable in a variable coeff imp. To do this
you would have to type gen coeff imp= b[important]. See the manuals
for an overview of all saved results from an estimation that can be accessed
in this way.

You might also like