Professional Documents
Culture Documents
Stata Basics
Stata Basics
Lecture 1
Log File
Do File
Do files are a very important feature of Stata. It allows you to save all your commands in a tiny text file
that you can use later to reproduce all your results (to be made manually).
You can open a Do File window from window>Do-file Editor>New Do-file Editor. Or open
from the toolbar by clicking a shortcut button.
Help
1. You can copy-paste data into Stata from an excel file (.xls extension) using Data Editor.
2. You can open Data Editor from Data>Data Editor>Data Editor (Edit). You can also type “ed”
in the command menu. Or open from the toolbar by clicking a shortcut button.
3. Import data: File>Import>Excel Spreadsheet (imp: Treat First Row as Variable Names).
4. In case of a Stata data file (.dta extension): open>Select file
Data Browser
We usually view data in data browser. It looks same as Data Editor but it does not allow you to make any
changes in the data. You can open Data Browser from the button on top or simply type “br”
Variable types
1. String: this variable can contain both numbers and alphabets (and other symbols)
String variables are stored in “str” format (they are highlighted red and are not read by
Stata in regressions unless converted to numeric through destring or encode command)
2. Numeric: you can only have numbers in this format of variable
Following are different types of number variables: byte, int, long, float, and double
If the data contains string variables that contain numbers, an easy way to convert them to numbers is to
use the “destring” command
destring year month date, replace
This will convert string variables named year, month and date to numeric variables, asumming the strings
really contain numbers.
Make sure the variable name does not contain any space/symbols and label is written between inverted
commas
Dropping a variable
keep is another command for the same purpose. You can also specify the list of variables that you do not
want to drop:
keep VariableName3
Sort command is used to arrange your dataset with respect to a given variable
sort VariableName
Count command
It counts observations satisfying specified conditions. A simple count returns total number of
observations ins data set. We can also combine count command with if command to narrow our search.
count if VariableName==1
Summarize Command
This command is used to generate summary statistics for a given variable or a set of variables
A simple summarize command followed by a variable or a list of variables will display: Number
of observations, mean, standard deviation, minimum value, and maximum value.
sum VariableName
sum VariableName1 VariableName2
We can also generate additional summary statistics by using option of “detail” as follows
Tabulate Command
A simple tabulate command followed by single variable will display each unique observation
along with its frequency followed by cumulative frequency. We can use option of “missing” to
find out number of missing observations as well.
tab VariableName1
tab VariableName1, missing
Tabulate command followed by two variables will generate a cross-tab of those variables. We can
use both string and numeric variables with tabulate command.
Generate Command
Gen VariableNew=1
This command will generate a new variable with name “VariableNew” where each observation
will be “1”. Note that we have only used single “equals to” sign instead of writing it twice; this is
used when we are setting anything equal to a value or condition, on the other hand double “equals
to” signs are used after an if condition to test for equality.
This command will generate a new variable with the name “Dummy” that will have a value of 1
where gender will be male. This command will leave rest of the observations blank (or missing,
displayed by dots). In order to fill those blanks, we use the “replace” command:
In the first command, we are replacing the rest of the values with a zero which denote a value for
female in the variable. Now remember there might be some values in the variable “Male” in
which there will be missing values i.e. dots. To take them into account we use the second
command.
gen NewVariableName=log(VariableName)
or
gen NewVariableName=ln(VariableName)
gen NewVariableName=VariableName^2
gen Newvariable=VariableName1*VariableName2
Regression Commands
reg y x1 x2 x3
where y is the dependent variable and x1, x2 and x3 are independent variables
After the regression output several post-estimation commands can be used for example to
generate estimated/fitted values of the dependent variable or the residual/error term
predict fitted,xb
predict error,residual
In the above commands, “fitted” and “error” are the names of the variables generated containing the
linear prediction/fitted values and residuals from the estimated regression model, respectively.
Please note that post estimation commands will only work after the regression model has been estimated
Typing “corr” by itself produces a correlation matrix for all variables in the dataset. If you specify the list
of variables, a correlation matrix for just those variables is displayed.
Another useful feature of Stata is that it allows exporting of the regression output into excel or word
format in the form of tables generally shown in published papers.
Stata can also combine multiple regression output in the same tableif the same file name is given for the
regression output to be combined followed by append