Professional Documents
Culture Documents
SIDM2 Preparing Data For Stata Analysis 2017
SIDM2 Preparing Data For Stata Analysis 2017
It is essential to create and save a do file of commands, otherwise you will lose your work. Attention needs
to be paid to the presence of missing data, which are treated as infinitely large positive values by Stata.
You learnt to run commands from this do file as you go along. You also learn how to open and save datasets,
log files and graphs. You learn to distinguish between different types of data. You learnt to use variable
labels for a fuller description of what they are, and value labels to define categories (when coded
numerically, so we know what the numbers represent). You learn basic graph and tables commands, how to
create new variables and amend their values, use of if statement.
This describes how to read in data Stata in the first place. There is a recap on different types of data, and
rationale on need to change some variables between formats before analysis can begin, in many cases. This
gives example code, and exercises (with solutions available) for you to see how these commands are used in
practice.
Merging datasets in many different ways, reshaping datasets, looping in Stata, and extracting saved results
into files. Efficient production of publication quality tables.
This series teaches most of the material contained in Stata Data Management.doc, referenced SDM. The
accompanying Stata commands crib sheet.xls, SCCS, acts as a quick reference guide (and also summarises some data
analysis commands). Stata manuals (accessed online and via help) and Stata help itself, are both excellent resources.
The manuals teach statistics, as well as Stata, and provide statistics references.
Contents
1. Reading data into Stata from other files ................................................................................................................... 2
2. Recap on types of data.............................................................................................................................................. 2
3. Converting strings to numeric and categorical data as necessary............................................................................ 3
4. Dealing with Dates .................................................................................................................................................... 4
5. Checking for errors and missing data ....................................................................................................................... 4
6. When your dataset erroneously has 2 or more lines of data for a few patients...................................................... 5
7. Recoding numeric data into groups .......................................................................................................................... 5
8. Extracting information from string variables ............................................................................................................ 7
9. Further sources of help ............................................................................................................................................. 7
SDM=Stata Data Management.doc
Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.1
SIDM2 – Preparing data for analysis
cd "H:\_MPHTeaching\stata\dataman\wk2"
clear
import excel census, firstrow /* reading data in from excel file census. Xls, firstrow indicates that the first row is
treated as variable names or labels */
descr // look for string variables, storage type = str##
browse // look for string variables which appear in red
Student Exercise:
Read in nlsw88dates2.xls into Stata. View the data, looking at types of variables and seeing what is contains.
Summarise the data.
Here is some further information on what is contained in this data set, with commands for labelling appropriately:
label var grade "Current grade completed"
label var c_city "Lives in central city"
label var wage "Hourly Wage"
label var south "Lives in South"
label var union "Union Worker"
label var hours "Usual hours worked"
label var ttl_exp "Total Work Experience"
label var tenure "Job Tenure (years)"
label var quesday "Day of month that questionnaire was filled in (started to be filled in)"
label var quesmon "Month that questionnaire was filled in (started to be filled in)"
label var quesyr "Year that questionnaire was filled in (started to be filled in)"
label var quesfinish "Date that questionnaire was completed"
It is usually necessary to have data in numeric format, in order to use it in Stata data analysis and most graph
commands; this includes dates & time in Stata format recognised as such (showing up in black in the data editor) and
categorical data (showing as blue and looking like text). The main exception to this is patient id variable (or hospital
id’s or regions or similar) where it is generally okay to use a string variable.
SDM=Stata Data Management.doc
Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.2
SIDM2 – Preparing data for analysis
Dates and categorical data are often read into Stata as string variables, this is generally the best way to do it. Hence
the need to recode into different types of variables. There may also be a desire to recode numeric data into
categories, and to recode categorical variables, perhaps by combining categories.
*** converting from string variables to numeric variables (when most/ all values already look like strings)
destring medage2, replace // converts string to numeric data, keeping the same variable name
tab medage2 medage, miss // check new variable against another variable that looks the same
scatter medage2 medage // they are identical
summ medage medage2 // identical also in number of missing values
list if medage2==. // identical also in missing values, they are for the same observation
drop medage2 // we don't need 2 identical variables
encode state2, gen(state_2) // create categorical var (state_2) from string var, state2
tab state2 state_2 // compare -
browse state2 state_2 // easier to compare this way
codebook state_2 // gives examples of coding
label list state_2 // gives the full numeric correspondence between numbers and value labels
drop state2 // string version is no-longer needed
Student Exercise:
a) Look for string variables in nlsw88dates2.xls that look numeric/ as if they should be numeric. Create numeric
version of (one or more of) the variables.
b) Look for categorical variables and convert (some of) them also to numeric variables. For instance, do this for
industry and race. Does encode command work well for both/all? If not, then what approach shall we take?
c) Try help string function and decide whether it is a good strategy to use one or a few string functions to tidy
up the variables before using the encode command. For instance, could take just the first character and
change to lower case.
gen dateofsurvey3=date( dateofsurvey, "DMY") /* converts from string to Stata date variable, string ordered day
month year DMY */
browse dateofsurvey dateofsurvey3 // dates are coded as number of days from a fixed date
format dateofsurvey3 %d /* display the date variable in a format that we can understand as a date (not a number)
*/
browse dateofsurvey dateofsurvey3 // check the date can correctly been recoded by Stata
Student Exercise:
a) Look at the nlsw88dates2 data set. Which variables look like dates, but are currently string variables? Change
these to Stata date format.
b) Now create a stata date variable from the 3 variables which give questionnaire day, month and year.
Remember the function mdy for month, day and year. Use help date function and find it if necessary.
c) Find the time interval in years between questionnaire date and the date that the questionnaire was finally
filled in (quesfinish – for the few people where this is not missing).
d) Count how many questionnaire dates are before 30 april 2011. Count how many dates are after this date.
e) There are more date commands described in SDM chapter 7 Dates and time. Within Stata, type help
function and click on date and time functions. You will see more options here.
* remember last time we also checked total populations added up, and can check values are smaller than total
population and similar
* there are no obvious cross tabulations to check here, e.g. do we have pregnant men? Do we have non-smokers
with 10 cigs/day?
Summarise all variables, and look at maximum and minimum values. Do they all appear to be valid values? Look at
histograms of continuous data to see if there are outlying values, and to see what the distributions look like. Do not
recode outlying values to missing (unless you are pretty confident that they are errors and report that you have done
this).
Note that for duplicates, the egen or the collapse commands, and explicit subscripting, can be useful when we want
data contained in both duplicates. See SIDM 3 section 3.6, i.e. next week’s class.
Student Exercise:
a) Check for further errors in nlsw88dates2. Do any numeric values look impossible, and need recoding to
missing.
b) Can you think of any variables that need to be consistent with each other, where you can check for this?
c) Read SDM chapter 6 and see if you can think of any times when these errors might apply.
d) Note that when you were creating new variables above you will always be looking out for errors as you go
along, in order to create appropriate values.
6. When your dataset erroneously has 2 or more lines of data for a few patients
help duplicate gives commands that help you to tidy up your data in situations like this. See SIDM3 for more details
and for other ways of dealing with these duplicates.
xtile marriage_bin=marriage, n(2) // divide number of marriages into 2 groups by the median
tab marriage_bin
summ marriage mariage_bin // check both have same amount of missing data
SDM=Stata Data Management.doc
Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.5
SIDM2 – Preparing data for analysis
label list region2 // want to recode to get a binary variable for North/ South
recode region2 (1 2=1) (3 4=2), gen (region_bin) /* recode is very flexible for recoding individual values and ranges
of values */
label define region_lbl 1 "North" 2 "South/West" // adding informative labels - firstly we define the value label
label values region_bin region_lbl // now we attach the newly created value label to the values
tab region2 region_bin // now we tabulate against the original value that we recoded
tab state region_bin // tabulating against state - but not ideal since South/West contains some North West states
label list
tab state if region2==4 // list states in West region
tab state_2 if region2==4, nolabel // list numeric values of state_2 for states in West region
help recode // look for further details of command
recode state_2 (1 37 47 13 26 50 =1) (nonmissing=0), gen(northwest) /* create a variable specific for North West
states (=1 for them, =0 otherwise) */
tab state northwest // check what we have done
codebook region_bin // check coding of region_bin variable
gen south=region_bin-1 // want variable south=1 for southern states, =0 for northern states
replace south=0 if northwest==1 // use newly constructed northwest variable to recode as appropriate
tab state south // check the result
label define yesno 0 "No" 1 "Yes" // this is standard coding, commonly used for binary variables
label values south yesno // attach the newly defined value label to the variable south
label values northwest yesno // attach the same value label to the variable northwest
descr // shows names of value labels allocated to each variable (blank for variables with no value labels)
Student Exercises:
a) Create a new categorical variable for usual hours worked into 3 groups of roughly equal sizes, then label the
variable and its values (command xtile).
b) Create a new binary variable for age, using the median as a cut-off
c) Create a categorical variable for age, divided into 5 year age groups.
d) Create a categorical variable containing age in quintiles.
e) Create an ethic group variable which is 1 for whites and 0 for other ethnic groups.
f) Create a new variable that industry, recoding into fewer categories, according to what you think is sensible.
help string function – read through the list if you need to do this, e.g. extract first letter of the State.
They also allow you to tidy up strings (e.g. remove leading and trailing blanks, convert all to small letters).