Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

A Short Introduction to STATA

1) Introduction:

This session serves to link everyone from theoretical equations to tangible results under the amazing
promise of Stata!
Stata is a statistical package that includes a wide variety of capabilities, such as data management,
statistical and econometric analysis, graphics, etc. The user’s interface includes the following windows
(see Figure 1.)
• Command Window (highlighted in red): the window where we can type all the commands;
• Results Window (highlighted in blue): the window displays all the results and output generated
by the commands we have typed;
• Variables Window (highlighted in orange): the window shows all the variables currently stored
in the Stata’s memory. We can visualize these variables as in spreadsheet by typing in the
Command Window browse (br) followed by the variables to be displayed (if no variables are
specified, Stata will show all the variables). If we want to make changes to the data, we will type
edit in the Command Window.
• Command History (highlighted in green): the window keeps a record of all the commands used
in each session.
• Current Working Directory (highlighted in black): the window shows the current directory in the
file of your computer from where State will read or save any files. It can be changed by writing
in the Command Window cd path_to_the_new_directory (e.g. cd c:\desktop\State11\session1
or cd “c:\desktop\State11\session 1” if the directory contains a space); or from the Stata menu:
File/Change Working Directory.

Figure 1: State User’s Interface


2) Some Basic Commands:

To clear all the variables saved in Stata’s memory from last session, we can type in the Command
Window clear;

When we need to learn the use of a command, like what options it allows, or to see some examples of
its uses, we can type help name_of_the_command or findit name_of_the_command in the Command
Window. Try help reg and findit reg, and see the differences.

If we are not sure about the name the command we need, we can type search instead.

Any command in Stata that is preceded by a star (*) will be regarded as comment, and will not be
executed by Stata.

Stata can also be used a calculator by using the command display (e.g. display 4+5).

3) Entering Data:
I. Input from .xls or .xlsx files

If your original data source in an excel files or workbook looks like this:

Econ526 students may recognize this is the data set from C. Dougherty’s textbook Introduction to
Econometrics, with eaef21.xls as its file name. The command to input this into Stata is

import excel using eaef21, firstrow case(lower)

Here, excel cannot be omitted, as we do not only import excel, we also import others like txt file.
firstrow means to treat the first row in the excel file as the default variable names in Stata. Notice they
are all in upper case letters, so case(lower) is used as part of the command to have lower case letters as
variable names. A Capital letter and the same lower case letter are different variables in Stata. So
likewise, case(preserve) keeps the names unchanged from the excel file; use case(upper) if you want
upper case names anyway.
II. Input from .csv files

A .csv file is different from an .xls file in that data are separated by comma in .csv files. Using the same
data set for example, save is as an .csv file, you are supposed to use thefollowing command to load it:
import delimited using eaef21.csv

Here, you don’t need to specify the firstrow or case(lower) as the first row from .csv file serves as
variable names and they are in lower case automatically. It makes sense since .csv file has separated
data already, it eases Stata to pin down the data structure, thus you benefit by having an easier
command. Another way to load a .csv file is to usean older version command insheet:

insheet using eaef21.csv

These two commands yield the same result. Starting from Stata14, insheet is replaced by a new
command import delimited. So if you are using an old version, use insheet. It still works in up-to-date
versions of Stata, its help file just may no longer update.

III. Input from .txt files

A .txt file may look like this:

This data "earnings" is taken from R. Davidson and J.G. MacKinnon Econometric Theory and Method,
New York, Oxford University Press, 2004. The first column is observation number; column 2 to 4 are
dummy variables for individuals in group 1, 2 and 3 respectively. The last column is average annual
earnings in 1988 and 1989, measured in 1982 US dollars. You may notice there are no names shown up
in the first row, so you are supposed to key in the variable names all by yourself, and the command for
dealing with .txt files is infile:

infile obs d1 d2 d3 earnings using earnings.txt

where obs is the variable name for observation numbers, so are d1 d2 d3 and earnings.
IV. Miscellaneous

Actually it’s also quite easy for us to generate number of observations in a given data set:

gen n = _n

gen is short for generate, n is the variable name, _n is the way Stata tracks observations. For example,
Let’s regress earnings on two dummies d1 and d2.

reg earnings d1 d2

lf you want to run a regression without using the first 500 observations, just plus if_n>500 in the
command:

reg earnings d1 d2 if _n > 500

Since referring to a specific observation is quite handy, we don’t really need the variable obs in our data
set. The way to delete it is to use drop

drop obs

You can drop variables, you can also drop part of the observations, before we do that, let’s preserve the
data first so that we can restore it easily after this destructive trial.

preserve

drop if _n <=1000

restore

After carrying out the second command, Stata reminds you that 1000 obs have been deleted. But once
you preserve the data, you can always restore it, and restore it onceonly! Au contraire, the reverse
operation of drop is keep.

keep earnings is equivalent to drop n d1 d2 d3

To prevent you from forgetting about what a particular variable is about, label it:

label var earnings "Average annual earnings"

var stands for variable, anything put in the quotation is the label, pretty self-clear.

Stata stores on hard drive its own data set as a .dta file. Whenever you want to open an existing data set,
use the following command:

use earnings

Again, like every case above, you have to put earnings.dta under the current working directory. Stata
also contain 27 data sets (in the 14th version) of its own, those data sets cannot be deleted providing
your Stata is intact, and they also serve repeatedly as example data for demonstrative purpose in Stata’s
User Reference Manual which I highly recommend anyone who wants to learn more. Please type

sysuse dir

to form an initial impression of these data sets. The command to invoke any of them

is sysuse (e.g. sysuse auto).

4) Exploring the Data:

We have seen commands that can help us explore and understand the data better. Type the following
command to use the NLSW88 dataset (National Longitudinal Survey of Women in 1988)

webuse nlsw88 or webuse nlsw88, clear if you need to clear preloaded variables

Now, try the following commands and see the differences between them:

describe

describe wage age

summarize wage

sum wage

summarize wage, detail

sum wage, de

list age race married

list age race married in 1/10

codebook wage

inspect wage

tab race collgrad

tab race collgrad, nolabel

tab race collgrad if wage>16.5

Note that when we add if followed by a condition (e.g. wage>16.5 the command will be executed only
for those observations in the dataset that meet this condition.
5) Visualizations
A. Histograms

To see the distribution of a variable graphically, we use command histogram or hist:

For example, type histogram wage; or hist wage, normal if you would like to add a normal distortion to
it in the Command Window, you should see the following picture.

.15
.1
Density

.05
0

0 10 20 30 40
hourly wage

The picture shows that wage is right skewed.

B. Scatter Graphs

graph twoway scatter wage tenure

graph twoway (scatter wage tenure)(lfit wage tenure)

We use lfit to create a liner predication over the variable

scatter wage tenure

scatter wage tenure, by(race)

Note that in the context of graphs, by is used as an option (after a comma) rather than as a prefix.

C. Matrix Graphs

graph matrix wage tenure hours

D. Box Graphs

graph box wage, over(race)

The following picture will be generated:


40
30
hourly wage

20
10
0 white black other

From the picture, it seems that median wage among the three ethnic groups does not differ too much,
even though the whites have more high income outlier.

6) An OLS regression:

To run an OLS regression we can use the command regress or, in short, reg followed by the dependent

variable (the one we want to explain) and the independent variable or variables (the ones that we
suspect explain the dependent variable). For example: runs a regression of wage on tenure, collgrad,
and married.

reg wage tenure collgrad married

After running a regression, Stata temporarily stores (until another regression is run) some useful items.
For example we can generate the residuals of the regression by using the command predict:

predict myresids, residuals

Residuals of the aforementioned regression are then saved in the variable myresids. Are my residuals
correlated with any other variables that perhaps is missing in my regression? Use the command
correlate or a scatter graph as shown below to check this.

7) Hypothesis Testing

Hypothesis testing is straight forward in Stata, for instance, if we want to test the coefficient of tenure
equals zero:

test tenure = 0

and it give the result:

( 1) tenure = 0

F( 1, 2227) = 58.18
Prob > F = 0.0000

This is a single variable test. The joint significant test for the coefficients on collgrad and marrid equal
zero is:

test collgrad = marrid = 0

and it gives the result

( 1) collgrad - married = 0

( 2) collgrad = 0

F( 2, 2227) = 80.20

Prob > F = 0.0000

The following commands get you fitted values 𝑦̂ and the residuals 𝑢̂

predict yhat, xb

predict u, res

To get them out of the regression, the command is predict, yhat and u are names, option xb tells Stata
you want the fitted values, and resid is just short for residuals. You’ll find two more variables appear on
your variable list. Finally, all the useful information has been stored in the e-class 3 (e stands for
estimation) returns. Please take a look at them by using the following command after the regression:

ereturn list

8) Extra Resources

http://www.stata.com/links/resources-for-learning-stata/

http://www.stata.com/links/video-tutorials/

You might also like