Professional Documents
Culture Documents
Stata Tutorial Dan Edelstein Kristi Thompson Data and Statistical Services
Stata Tutorial Dan Edelstein Kristi Thompson Data and Statistical Services
Stata Tutorial Dan Edelstein Kristi Thompson Data and Statistical Services
Dan Edelstein
Kristi Thompson
Stata is a statistical analysis package, used for exploring, graphing, summarizing and
manipulating data files. There are commands built into Stata that allow the user to do
statistical analysis such as cross-tabulation and regression on data sets.
Data Sets
A data set is just a file organized into rows and columns. The rows represent individual
units, or observations. An observation might be a person, a stock, a district, or a
company. The columns represent variables. A variable is a piece of information that has
been recorded about each observation.
As an example, we will be using a file with some data about people. Each observation is
a person, and the variables include each person's name, income, age, and race.
Stata has its own native format for datasets, and files in this format can be read directly
into Stata with the "use" command or by using the "Open" command on the File menu.
Files in Stata format have the file extension ".dta".
Other file types can be read into Stata, or converted to Stata format using conversion
programs, depending on the type of file. The data lab can provide assistance.
Problem Spreadsheet
Cleaned Spreadsheet
For rest of this tutorial we will be using a sample spreadsheet, sample.csv, that can be
downloaded from the AP501 blackboard site. Start by downloading this spreadsheet
from the web site and save it to your H: drive. In this tutorial, commands you should
type will be show like this:
.stata command
Output that stata gives you will be shown below the commands, on lines that do not start
with periods.
Commands, logs, help
Commands can be executed one at a time at the Stata prompt (command window in Stata
for Windows). Just type the command and hit the "Enter" key. Alternatively, groups of
commands can be entered into do files which can then be executed.
If you are not sure which command you need, you can type "search" and a keyword. Stata
will display a list of commands and other resources associated with that keyword, if there
are any. Click on the name of one of the commands or resources to display the help
screen.
Alternatively, you can use the "Help" menu. Click on "Stata Command" if you know the
command or "Search" if you don't.
Opening Stata
Stata starts in its default working folder or directory, typically C:\data or C:\stata. If you
don't change to something else, Stata will assume that any file name you type is in the
default directory. Since normally your data will be in other directories, you need the cd
(Change Directory) command:
. search cd
. cd "h:\"
"h:\" is where you saved the data spreadsheet. Now that you have told stata to work there,
you can read it in with a simple "insheet" command. But before you do that, you need to
open a log file:
Because you first cd'd to your h:\ drive, the log file "stata1.log" will be saved there.
Everything you type into Stata (commands) and all the output Stata gives you will be
saved in this file, exactly as it appeared on the screen. This is important for two reasons:
1. You will have a record of your results to refer to and paste in to your report when
you write up your analysis.
2. You will be able to see exactly what happened if you realize you made a mistake.
A log file with extension ".log" is a plain text file. This means you can open and read it in
MS Word or almost any other program. If you just issued the command
Stata would create the file "stata1.smcl", which is a file type that can only be opened by
Stata. This is usually inconvenient, so don't.
Describing and listing data
Now that a log file is open, read the data into Stata, then use the "describe" command to
find out something about it.
. describe
The describe command gives the names of each variable in your dataset and also gives
the variable type.
There are two types of variables in Stata: numeric and string. String variables are
indicated by variable types that begin with "str", followed by a number indicating the
maximum length of the string: str9, str5, etc. Different sizes of number have different
labels in the variable type column: byte, int, long, double, float. Numeric variables are
simple – they contain numbers. String variables contain text which can contain any
characters on the keyboard: letters, numbers, and special characters. We can do numeric
calculations and statistical analysis on numeric variables – we can’t on string variables.
So it's important to check your data to make sure that variables that should be numeric
actually are!
The list command lists values of the different variables in your dataset.
. list name
. list in 1/2
Lists the values of all variables for the first two observations. The lower number / higher
number syntax can be used with other commands too.
. list in 5/10
. list
Lists the value of every variable for every observation in the dataset-- usually not a good
idea!
. l sex income
Lists the value of sex and income for every observation. l (the letter L) is an abbreviation
for list.
. l name if income<50000
Lists the names of people with an income of less than 50,000. Note the use of the less-
than symbol to specify a condition. We'll see more of this later.
Two commands that are useful for getting basic descriptive statistics for your variables
are summarize and tabulate (abbreviated sum and tab respectively). sum gives the
number of valid observations, mean, standard deviation, minimum and maximum values
for any variables you specify. You can do the entire dataset at once:
. sum
The tabulate (or tab) command gives you a frequency distribution (for one variable) or a
crosstabulation (for two).
. tab sex
The summarize command can be combined with the tabulate command to produce
summaries of one variable for each value of another. The following table shows separate
summaries of income for males and females.
. tab sex, sum(income)
| Summary of income
sex | Mean Std. Dev. Freq.
------------+------------------------------------
Female | 27100 12251.531 10
Male | 45444.444 24734.142 9
------------+------------------------------------
Total | 35789.474 20868.847 19
Same as the earlier crosstab, but calculates Pearson's chi-squared for the hypothesis that
race and sex are independent.
| race
sex | White Black Other | Total
-----------+---------------------------------+----------
Female | 5 3 2 | 10
| 50.00 30.00 20.00 | 100.00
| 50.00 42.86 66.67 | 50.00
-----------+---------------------------------+----------
Male | 5 4 1 | 10
| 50.00 40.00 10.00 | 100.00
| 50.00 57.14 33.33 | 50.00
-----------+---------------------------------+----------
Total | 10 7 3 | 20
| 50.00 35.00 15.00 | 100.00
| 100.00 100.00 100.00 | 100.00
(Notice that we've started getting into performing statistical tests!) The tab command
usually only makes sense for categorical variables. Trying to tab an income variable with
hundreds of different values wouldn't be very enlightening. Stata sensibly refuses to do a
two-way table when the table would be excessively large.
Statistical tests
To test whether there is a significant difference in the means between two groups, use the
command ttest:
The by() construct is used to specify the variable that divides the dataset into two groups.
If the sex variable had three categories, say Male, Female, and Unknown, the ttest
command would not work. However, you could give the command
regress (abbreviated reg) is the command that runs simple linear regression. A binary
regression of income on age would look like this:
------------------------------------------------------------------------------
income | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | -78.67952 326.4362 -0.24 0.812 -767.3997 610.0406
_cons | 38891.1 13776.23 2.82 0.012 9825.808 67956.4
------------------------------------------------------------------------------
A multivariate regression might include sex as a control variable, as men and women
have different mean incomes. However, running
no observations
This is because sex is a string variable, with values of Male and Female. To include sex
in this regression, we need to modify the variable to make it useable.
The basic commands for creating new variables and modifying old ones in Stata are
generate (abbreviated gen), egen and replace.
. gen one = 1
. gen two = 1+1
. gen three = one+two
. gen bmi = (weight/(height*height)) * 703
. gen approx_health_score = round(health_score,1)
There is another command, egen, that is also used to create new variables. egen works
with a different set of functions than gen. There is no particular logic as to why there are
two variable creation commands - it's just an oddity that has to do with the way Stata was
written. Most egen functions work across all observations to produce variables that
summarize other variables. For example:
For information on what gen functions there are, search for "functions" in Stata's online
help. For information on egen functions, search for "egen".
The if qualifier
The if qualifier is used to isolate a set of observations with variables meeting some
particular criteria. Values on variables in a dataset are compared to values on other
variables or to numbers or strings using logical comparison operators. This is very often
used to create "dummy variables", 0-1 indicators used to indicate whether something is
true or false.
Operator Meaning
== equal to
> greater than
>= greater than or equal to
< less than
<= less than or equal to
!= or ~= not equal to
Pay special attention to that double equals sign! If you are testing for equality, use a
double equals sign (==). A single equals sign (=) is used to set something equal to
something else.
For example, if you want to list all information for a person in your dataset whose first
name is Sarah, you would type:
. list if name=="Sarah"
To display the names of people with an income less than or equal to $40,000:
if on its own is useful if you are interested in testing for only one thing at once, such as
the condition "Female". But let's say you want to find out the mean income for women in
your dataset between the ages of 25 and 34. What you need to do is take this series of
tests and combine them with the and operator, &.
Note that the if statement is included only once, and then the tests are simply stated one
after another. Also note that you need to write out the entire test statement each time:
is not allowed.
If you want to look at cases where at least one of two or more conditions is met, the or
operator | is needed.
. drop child_of_immigrant
Notice that this time a test for missing birthplaces was included.
This pair of commands looks correct, but is actually wrong, for a subtle, Stata-specific
reason. A peculiarity of Stata is that numerical missing, represented as a period (.), is
internally treated as an infinitely large number, the highest number possible. So if you are
testing for values greater than some number, missing values will always be included. This
can produce very strange results -- for example, people who show up as being both senior
citizens and in high school, because their age was missing. The correct sequence of
commands is:
The moral is, always check your variable creation statements and then ask yourself,
"What is happening to the missings?"
Replace can be used with all gen functions to change the values of existing variables, but
not with egen functions. However, you can use replace to modify variables created by
egen as well as those created by gen.
You normally want to use replace for second and later steps in multi-step variable
creations, just as we used it here. It is bad practice to "write over" existing variables,
because if you make a mistake there's no way to get the original data back. For example,
even if you decided that you only cared about health scores rounded to the nearest
integer,
is not recommended. It's always better to create a new variable. Stata doesn't have an
"undo" command, so once something is gone, it's gone!
Finally, we're ready to return to the sex and regression problem. Variables like sex, where
the values have no numeric meaning, are called categorical. Variables like this are
included in regressions by creating dummy variables, which are variables with the values
of 0 and 1. For sex, we create a variable called female with a value of 1 meaning female
and a value of 0 meaning not female (or male).
. tab sex
. tab race
You can save your data in Stata format using the save command.
. save sample
You can then open the Stata data set with the use command.
. use sample
Graphs
While Stata's graphs are often not the most visually appealing, they can be quite useful
for diagnosing data problems and for illustrating the results of analyses. For discovering
the different kinds of graphs available, it is often helpful to explore the menus, one of the
cases where Stata's menu interface is really helpful.
The basic command for drawing a bivariate graph in Stata is twoway. The command
twoway is followed by a keyword indicating the type of graph. To do a scatter plot
showing the relationship between age and income, type
To use the menus, click on graphics, twoway graph (scatterplot, line etc), then choose
scatter from the list on the right and select age and income as the x and y variables.
Twoway graphs can be overlaid; that is, you can draw two twoway graphs on the same
set of axes. A common use of this is to draw a scatterplot with a linear fit line laid
overtop of it to show how the regression line fits the data.
To use the menus, click on graphics, overlaid twoway graphs, then choose a scatterplot
as above for plot 1 and a linear prediction with age and income as the x and y variables
for plot 2.