Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

SIDM2 – Preparing data for analysis

Preparing Data for Analysis in Stata


Before you can analyse your data, you need to get your data into an
appropriate format, to enable Stata to work for you. To avoid rubbish results,
you need to check your data is sensible and free of nonsense values.
Recap on preceding workshop, SIDM1

It is essential to create and save a do file of commands, otherwise you will lose your work. Attention needs
to be paid to the presence of missing data, which are treated as infinitely large positive values by Stata.
You learnt to run commands from this do file as you go along. You also learn how to open and save datasets,
log files and graphs. You learn to distinguish between different types of data. You learnt to use variable
labels for a fuller description of what they are, and value labels to define categories (when coded
numerically, so we know what the numbers represent). You learn basic graph and tables commands, how to
create new variables and amend their values, use of if statement.

Learning objectives of this Session, SIDM2

This describes how to read in data Stata in the first place. There is a recap on different types of data, and
rationale on need to change some variables between formats before analysis can begin, in many cases. This
gives example code, and exercises (with solutions available) for you to see how these commands are used in
practice.

Learning objectives of further workshops, SIDM3 and SIDM4.

Merging datasets in many different ways, reshaping datasets, looping in Stata, and extracting saved results
into files. Efficient production of publication quality tables.

Further resources complementary to this series

This series teaches most of the material contained in Stata Data Management.doc, referenced SDM. The
accompanying Stata commands crib sheet.xls, SCCS, acts as a quick reference guide (and also summarises some data
analysis commands). Stata manuals (accessed online and via help) and Stata help itself, are both excellent resources.
The manuals teach statistics, as well as Stata, and provide statistics references.

Contents
1. Reading data into Stata from other files ................................................................................................................... 2
2. Recap on types of data.............................................................................................................................................. 2
3. Converting strings to numeric and categorical data as necessary............................................................................ 3
4. Dealing with Dates .................................................................................................................................................... 4
5. Checking for errors and missing data ....................................................................................................................... 4
6. When your dataset erroneously has 2 or more lines of data for a few patients...................................................... 5
7. Recoding numeric data into groups .......................................................................................................................... 5
8. Extracting information from string variables ............................................................................................................ 7
9. Further sources of help ............................................................................................................................................. 7
SDM=Stata Data Management.doc
Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.1
SIDM2 – Preparing data for analysis

1. Reading data into Stata from other files


Now suppose that the census data we looked at last week was received as an excel file, that we need to read into
Stata before we can analyse it. See SDM 2.3 Opening data from an excel file. See SDM chapter 2 for reading in from
other sources.

cd "H:\_MPHTeaching\stata\dataman\wk2"
clear
import excel census, firstrow /* reading data in from excel file census. Xls, firstrow indicates that the first row is
treated as variable names or labels */
descr // look for string variables, storage type = str##
browse // look for string variables which appear in red

Student Exercise:

Read in nlsw88dates2.xls into Stata. View the data, looking at types of variables and seeing what is contains.
Summarise the data.

Here is some further information on what is contained in this data set, with commands for labelling appropriately:
label var grade "Current grade completed"
label var c_city "Lives in central city"
label var wage "Hourly Wage"
label var south "Lives in South"
label var union "Union Worker"
label var hours "Usual hours worked"
label var ttl_exp "Total Work Experience"
label var tenure "Job Tenure (years)"

label var quesday "Day of month that questionnaire was filled in (started to be filled in)"
label var quesmon "Month that questionnaire was filled in (started to be filled in)"
label var quesyr "Year that questionnaire was filled in (started to be filled in)"
label var quesfinish "Date that questionnaire was completed"

2. Recap on types of data


There are 4 main types of data in Stata:
i) numeric (numerical with types int, byte, float, double – black in data browser)
ii) string (e.g. str2, str24 – red in data browser)
iii) categorical (i.e. numeric with value labels - blue in data browser)
iv) dates & times (numeric with format %d or %td or similar – black in data browser).
The describe command will detail data types, format, presence of value labels and variable labels.

It is usually necessary to have data in numeric format, in order to use it in Stata data analysis and most graph
commands; this includes dates & time in Stata format recognised as such (showing up in black in the data editor) and
categorical data (showing as blue and looking like text). The main exception to this is patient id variable (or hospital
id’s or regions or similar) where it is generally okay to use a string variable.
SDM=Stata Data Management.doc
Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.2
SIDM2 – Preparing data for analysis

Dates and categorical data are often read into Stata as string variables, this is generally the best way to do it. Hence
the need to recode into different types of variables. There may also be a desire to recode numeric data into
categories, and to recode categorical variables, perhaps by combining categories.

3. Converting strings to numeric and categorical data as necessary


See DSM 5.5 and 5.6 Converting strings to numeric data and categorical data

*** converting from string variables to numeric variables (when most/ all values already look like strings)

destring medage2, replace // converts string to numeric data, keeping the same variable name
tab medage2 medage, miss // check new variable against another variable that looks the same
scatter medage2 medage // they are identical
summ medage medage2 // identical also in number of missing values
list if medage2==. // identical also in missing values, they are for the same observation
drop medage2 // we don't need 2 identical variables

destring marriage, replace // gives error because there is non-numeric data


help destring
destring marriage, force replace /* converts string var marriage into numeric var, with
missing value where there is any non-numeric data */

*** converting from strings to categorical variables


descr
encode region, gen(region2) // create a categorical (numeric) var (region2) from the string var, region
tab region region2 // compare the newly created and original variables
tab region region2, nolabel // compare the newly created and original vars without value labels
codebook region2 // see correspondence of values and value labels
drop region // the string version is nolonger needed

encode state2, gen(state_2) // create categorical var (state_2) from string var, state2
tab state2 state_2 // compare -
browse state2 state_2 // easier to compare this way
codebook state_2 // gives examples of coding
label list state_2 // gives the full numeric correspondence between numbers and value labels
drop state2 // string version is no-longer needed

Student Exercise:

a) Look for string variables in nlsw88dates2.xls that look numeric/ as if they should be numeric. Create numeric
version of (one or more of) the variables.
b) Look for categorical variables and convert (some of) them also to numeric variables. For instance, do this for
industry and race. Does encode command work well for both/all? If not, then what approach shall we take?
c) Try help string function and decide whether it is a good strategy to use one or a few string functions to tidy
up the variables before using the encode command. For instance, could take just the first character and
change to lower case.

SDM=Stata Data Management.doc


Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.3
SIDM2 – Preparing data for analysis

4. Dealing with Dates


See SDM chapter 7 on dates.

*** convering to Stata dates variables from string variables

gen dateofsurvey3=date( dateofsurvey, "DMY") /* converts from string to Stata date variable, string ordered day
month year DMY */

browse dateofsurvey dateofsurvey3 // dates are coded as number of days from a fixed date

format dateofsurvey3 %d /* display the date variable in a format that we can understand as a date (not a number)
*/

browse dateofsurvey dateofsurvey3 // check the date can correctly been recoded by Stata

Student Exercise:

a) Look at the nlsw88dates2 data set. Which variables look like dates, but are currently string variables? Change
these to Stata date format.
b) Now create a stata date variable from the 3 variables which give questionnaire day, month and year.
Remember the function mdy for month, day and year. Use help date function and find it if necessary.
c) Find the time interval in years between questionnaire date and the date that the questionnaire was finally
filled in (quesfinish – for the few people where this is not missing).
d) Count how many questionnaire dates are before 30 april 2011. Count how many dates are after this date.
e) There are more date commands described in SDM chapter 7 Dates and time. Within Stata, type help
function and click on date and time functions. You will see more options here.

5. Checking for errors and missing data

See SDM chapter 6 on looking for errors.

*** recode missing values

replace pop=. if pop<0 // impossible values recoded to missing

replace divorce=. if divorce==999999 // imposssible/ implausible value recoded to missing

* remember last time we also checked total populations added up, and can check values are smaller than total
population and similar

* there are no obvious cross tabulations to check here, e.g. do we have pregnant men? Do we have non-smokers
with 10 cigs/day?

Summarise all variables, and look at maximum and minimum values. Do they all appear to be valid values? Look at
histograms of continuous data to see if there are outlying values, and to see what the distributions look like. Do not
recode outlying values to missing (unless you are pretty confident that they are errors and report that you have done
this).

SDM=Stata Data Management.doc


Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.4
SIDM2 – Preparing data for analysis

*** search for and drop duplicates

isid state // checks if state is unique on each row


help duplicates // see options for dealing with duplicates
duplicates report state // report any 2 or more rows with the same value for state
duplicates list state // list the duplicates
duplicates tag state, gen(dup) // tag the duplicates, i.e. add a new variable called dup=1 for duplicates, = 0
otherwise
browse if dup==1 // browse the duplicates
help duplicates // look for any more useful options
*duplicates drop state
duplicates drop state, force // this drops one duplicate, despite the duplicates not being equal on other variables
browse if state=="Texas" // see the result
isid state // check if state is now unique, i.e. not the same on any 2 rows

Note that for duplicates, the egen or the collapse commands, and explicit subscripting, can be useful when we want
data contained in both duplicates. See SIDM 3 section 3.6, i.e. next week’s class.

Student Exercise:

a) Check for further errors in nlsw88dates2. Do any numeric values look impossible, and need recoding to
missing.
b) Can you think of any variables that need to be consistent with each other, where you can check for this?
c) Read SDM chapter 6 and see if you can think of any times when these errors might apply.
d) Note that when you were creating new variables above you will always be looking out for errors as you go
along, in order to create appropriate values.

6. When your dataset erroneously has 2 or more lines of data for a few patients
help duplicate gives commands that help you to tidy up your data in situations like this. See SIDM3 for more details
and for other ways of dealing with these duplicates.

7. Recoding numeric data into groups


See SDM 5.8 and 5.9 recoding numeric and categorical variables and creating categorical variables from numeric
data.

***** recode numeric data into categories

xtile deathq5=death, n(5) // divide number of deaths into quintiles


tab deathq5
summ death deathq5 // checks both same amount of missing data

xtile marriage_bin=marriage, n(2) // divide number of marriages into 2 groups by the median
tab marriage_bin
summ marriage mariage_bin // check both have same amount of missing data
SDM=Stata Data Management.doc
Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.5
SIDM2 – Preparing data for analysis

tab marriage_bin, sum(marriage) // check the result


tabstat marriage, by( marriage_bin) stat(n min max) // look at min and max in each group, comprehensive check

hist marriage // see what might be sensible groupings


egen marriage50k=cut(marriage), at(50000 100000 150000) // divide marriages by pre-chosen cut-offs
tab marriage50k // see the result - there are many missing values
drop marriage50k // drop variable and try again
egen marriage50k=cut(marriage), at(0 50000 100000 150000 200000) /* specify values below and above range to
avoid missings */
count
summ marriage // now we don't have too much missing data
tab marriage50k // see distribution by regions
label define marrlbl 0 "0 to 49,000" 50000 "50,000 to 99,999" 100000 "100,000 to 149,999" 150000 "150,000 to
199,999" // defining labels
label values marriage50k marrlbl // attaching fully informative labels
tab marriage50k // tabulate now shows informative labels

label list region2 // want to recode to get a binary variable for North/ South
recode region2 (1 2=1) (3 4=2), gen (region_bin) /* recode is very flexible for recoding individual values and ranges
of values */
label define region_lbl 1 "North" 2 "South/West" // adding informative labels - firstly we define the value label
label values region_bin region_lbl // now we attach the newly created value label to the values
tab region2 region_bin // now we tabulate against the original value that we recoded
tab state region_bin // tabulating against state - but not ideal since South/West contains some North West states

label list
tab state if region2==4 // list states in West region
tab state_2 if region2==4, nolabel // list numeric values of state_2 for states in West region
help recode // look for further details of command
recode state_2 (1 37 47 13 26 50 =1) (nonmissing=0), gen(northwest) /* create a variable specific for North West
states (=1 for them, =0 otherwise) */
tab state northwest // check what we have done
codebook region_bin // check coding of region_bin variable
gen south=region_bin-1 // want variable south=1 for southern states, =0 for northern states
replace south=0 if northwest==1 // use newly constructed northwest variable to recode as appropriate
tab state south // check the result
label define yesno 0 "No" 1 "Yes" // this is standard coding, commonly used for binary variables
label values south yesno // attach the newly defined value label to the variable south
label values northwest yesno // attach the same value label to the variable northwest
descr // shows names of value labels allocated to each variable (blank for variables with no value labels)

SDM=Stata Data Management.doc


Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.6
SIDM2 – Preparing data for analysis

Student Exercises:

a) Create a new categorical variable for usual hours worked into 3 groups of roughly equal sizes, then label the
variable and its values (command xtile).
b) Create a new binary variable for age, using the median as a cut-off
c) Create a categorical variable for age, divided into 5 year age groups.
d) Create a categorical variable containing age in quintiles.
e) Create an ethic group variable which is 1 for whites and 0 for other ethnic groups.
f) Create a new variable that industry, recoding into fewer categories, according to what you think is sensible.

8. Extracting information from string variables


There are many string functions that enable you to extract substrings, from within strings.

help string function – read through the list if you need to do this, e.g. extract first letter of the State.

They also allow you to tidy up strings (e.g. remove leading and trailing blanks, convert all to small letters).

9. Further sources of help


Look at the Stata commands crib sheet.xls. Many students find it useful to have a crib sheet on hand whilst analysing
data. What are pros and cons of these resources?
 Stata help
 Stata manuals
 Stata youtube videos
 Stata Commands Crib sheet.xls
 Stata Data Management.doc
 Menus in Stata to learn new commands/ find out what is available

SDM=Stata Data Management.doc


Hilary Watt SIDM=Stata Introduction and Data Management.doc workshops
SCCS=Stata Commands Crib Sheet.xls 2.7

You might also like