Professional Documents
Culture Documents
Data Analyses Stata Manual NYTS
Data Analyses Stata Manual NYTS
c o m P a ge |1
! Not (for example != means not equal to; !missing means not missing.)
= Denotes mathematical equality.
== Used within a subset or conditional statement (i.e., use after an if statement)
> Greater than
>= Greater than or equal to
< Less than
<= Less than or equal to
. Missing. NOTE: the syntax if var == . can also be written as if missing(var)
& And
| Or. Rather than specifying several categories of the same variable with multiple “|” operators, consider
inrange, or inlist. E.g., replace vartotal = 1 if var == 1 | var == 2 | var ==
3 can be re-written as: replace vartotal = 1 if inlist(var,1,2,3) or alternatively as:
replace vartotal = 1 if inrange(var,1,3)
/ to (i.e., range). Mostly used with the recode function, e.g., recode var (1/10 = 1)(11/max =
2), gen(var2). The highest and lowest values in Stata are represented by the functions max and
min respectively.
* Stand-alone comment without a command on the same line, e.g., *This is a comment. When
comment appears after a command, it is written as /*comment*/ e.g., tab var1 var2 /*What
an awesome cross-tabulation! */
You can import an Excel file into Stata using the dropdown menu as follows: “File” “Import” “Excel
Spreadsheet (*.xls; *.xlsx)”. Alternatively, you can use the import excel function.
1. FROM SAS TRANSPORT FILE (.XPT): use the fdause function. Note that the grey portion highlighted below is
the location of the file on your computer (i.e., file path); the yellow portion is the name of the dataset. To determine
the file path, go to where the file is located on your computer, right click on it and select properties. Copy the text
beside “Location”.
fdause "/Users/Zatum/Downloads/DEMO_H.XPT
2. FROM CSV: You can import a .csv file into Stata using the dropdown menu as follows: “File” “Import” “Text
data (delimited, *.csv, …)”. Alternatively, you can use the import delimited function.
3. FROM ACCESS (.MDB): Open Access file click node at top left corner to select entire dataframe copy (Ctrl C)
Open Stata and click on “Data Editor (Edit)” click on the top left corner to select entire editor paste (Ctrl V).
Keep first rows as variable names not data.
In Access: Click node then In Stata: Click Data Editor to In Stata Data Editor: Select first row as
Ctrl C to copy entire data. open blank data editor which click first cell then Ctrl V variable names
Progress bar at bottom right will hold the data to paste data
ww w. z a t u m co r p .c o m P a ge |4
4. FROM SAS (SAS7BDAT): There is no straightforward way to read a .sas7bdat file into Stata; the only available
options all involve the use of another program such as SAS, stat transfer, etc., that can convert the file into a format
that Stata can read (e.g., .xpt, .csv, .dta). The challenge however is that many of these programs are not
open source and may not be readily available. R is however an open source program which we can use for the
conversion of .sas7bdat into a .dta file using the steps outlined below.
(a) Download and install R software from the Internet using the links below: Windows: https://cran.r-
project.org/bin/windows/base/; Mac: https://cran.r-project.org/bin/macosx/
(b) Install the sas7bdat and foreign packages in R. sas7bdat will be used to import the SAS file into R
while
The foreign
properties willisbea used
box quicktoway
export the file
to see thefrom R as of
number a stata dataset
variables and(note below thereinare
observations quotes
your around
dataset.
the packages for installation).
install.packages(c(“sas7bdat”, “foreign”))
(c) Load the packages after they have finished installing in R (note there are no quotes to call the libraries)
library(sas7bdat)
library(foreign)
(d) Read the dataset into R; object “y” has been assigned arbitrarily. The file.choose() argument
allows you to manually select the dataset from wherever it is located on your computer.
y = read.sas7bdat(file.choose())
(e) Export the dataset from R as a stata dataset and save somewhere on your computer. We are assigning the
arbitrary name “statafile” (you can change to something else). Replace the file path highlighted in gray below
with the actual file path from your computer where you wish to save the file. If you want to save on your
desktop, simply right click any file on your desktop and select properties. Next, copy the text beside
“Location” and use it to replace the gray text below. Ensure all slashes are double as shown below.
write.dta(y, “C:\\Users\\Zatum\\OneDrive\\Desktop\\statafile.dta”)
(f) Now read the .dta file into Stata directly using File open within Stata.
To see a list of all the variables in the dataset, type the command describe or simply desc
ww w. z a t u m co r p .c o m P a ge |5
1. Variables in Stata could be factor; numeric; or character/string. Stata uses a color-coded system for variables, with
three possible color conventions. Enter browse in the command window to see:
2. Two useful commands that work in tandem for creating temporary working files are preserve and restore.
You can preserve a dataset, play as much with the dataset as you desire, and then restore back to the original state,
at which point all changes made are promptly discarded. You cannot restore when you have not preserved!
3. It is good practice to inspect your data to see the format the variables are stored in (character, numeric, string); to
see number and pattern of missing values; and to ensure that the codebook faithfully reflects what is in the dataset.
Quickly examine the entire dataset with the browse command to see the extent of factor variables with value
labels (blue), numeric (black), or string (red) variables.
browse
You can also browse just one or more variables rather than the entire dataset.
browse sex
ww w. z a t u m co r p .c o m P a ge |7
2. You cannot recode string variables. You can however use any one of replace, real, or destring
functions, depending on which is appropriate. Never modify, change, or delete an original variable as you might need
it again later.
3. With few exceptions, the maximum number of commas in everyday Stata commands is one.
5. You can run code in either batch mode, or as single line of code from the command console. If you wish to get the
entire printout of results from the output window, you can use the command translate @Results
filename.txt.The generated txt file will be saved in the working directory.
6. Loop statements can avoid repetitious coding. Loop statements can be used when you either want to execute the
exact same function on several variables, or on several categories/levels of the same variable. For example, if we
wanted a tabulation of several variables, we could execute the following loop rather than tabulating them one at a
time.
Note that the opening symbol ` is a backtick, found on the key below the Esc key. The closing symbol ’ is an
apostrophe, found on the key beside the Enter key.
The number of opening and closing curly brackets must correspond to the number of foreach statements. In the
example above, there is only one foreach statement, and so we have only one opening and one closing curly brackets.
ww w. z a t u m co r p .c o m P a ge |8
7. When recoding or collapsing outcome variables (e.g., creating a binary variable), responses of “don’t know”, “not
sure” should be excluded because of potential for misclassification; otherwise, justification should be provided for
collapsing them with another category. For outcomes measured on a Likert scale, computing the mean of the raw
responses is not recommended given truncation as well as the fact that the ensuing results have no meaningful
interpretation. The variable(s) could instead be dichotomized based on a priori determined study objectives, e.g.,
“strongly agree”/”agree” vs other responses.
8. Skip patterns: it is very important to recognize that some surveys use skip patterns, meaning that only individuals
eligible for a given question answered it based on their responses to one or more preceding filter questions. To ensure
the accurate denominator is being assessed, both the filter question(s) and the final question may need to be
incorporated as appropriate. For example, in several adult surveys, smoking status is determined by first asking
respondents if they have smoked up to 100 cigarettes in their lifetime. Those who answer yes are then asked if they
smoke now. Those who answer no to the first question are not asked the second question at all (i.e., skipped to the
next question). In creating a variable describing current smoking among all participants with these questions, a value
of 1 will be assigned to those who answered “yes” to both questions (i.e., have smoked 100+ cigarettes AND smoke
now). A value of 0 will assigned to those who answered yes to the first question but no to the second question (i.e.,
have smoked 100+ cigarettes but do not smoke now). A value of 0 will also be assigned to those who answered no to
the first question and were skipped from answering the second question.
9. When creating a composite variable from two or more variables (e.g., the measure of “any tobacco use” in the 2017
NYTS denoting use of at least one of seven different tobacco product types), missing values can be handled in one of
two ways. The first (and more stringent approach) would be to analyze only individuals with complete information on
all variables of interest (listwise deletion). Under this approach, even individuals with information on all but one
variable would be excluded. The second and more conservative approach would be to exclude individuals only if they
were missing information on all variables of interest. Thus, an individual with information present for only one
variable would still be included in the analyses. The first approach may lead to loss of sample size and precision; also,
the sheer magnitude of those excluded potentially increases the magnitude of selection bias if missingness was not at
random. The second approach however increases the likelihood for misclassification bias. For example, if an
individual only had information for one tobacco product (say cigars), for which he reported being a non-user, he
would be classified as a non-tobacco user, even if he used other forms of tobacco (for which data are missing).
Absence of evidence is not evidence of absence! Information should be provided on how missing data were dealt
with. Note that this MMWR used the second approach.
ww w. z a t u m co r p .c o m P a ge |9
This article examined current use prevalence of 10 tobacco products: cigarettes, cigars, smokeless tobacco,
electronic cigarettes, hookah, pipe tobacco, bidis, any tobacco product, 2+ tobacco products, and any
combustible tobacco use. Results were stratified by school level (middle and high school), sex (male and
female) and race (white, black, Hispanic, other race).
Note: Use tab1 when you want to get the marginal distributions of multiple variables at the same time. Use
tab (without the “1”) when you want to tabulate or cross-tabulate variables.
w w w . z a t u m c o r p . c o m P a g e | 13
. tab stratum
. tab psu
w w w . z a t u m c o r p . c o m P a g e | 14
RECODE of
ccigt
(CCIGT) Freq. Percent Cum.
RECODE of
ccigar
(CCIGAR) Freq. Percent Cum.
RECODE of
cslt (CSLT) Freq. Percent Cum.
RECODE of
celcigt
(CELCIGT) Freq. Percent Cum.
RECODE of
chookah
(CHOOKAH) Freq. Percent Cum.
RECODE of
cpipe
(CPIPE) Freq. Percent Cum.
RECODE of
csnus
(CSNUS) Freq. Percent Cum.
RECODE of
cdissolv
(CDISSOLV) Freq. Percent Cum.
RECODE of
cbidis
(CBIDIS) Freq. Percent Cum.
/*create composite variable for any smokeless tobacco (i.e., dissolvable/snus/ chewing
tobacco, snuff, or dip). For command below, the ‘missing’ option allows a tally to be
generated for an individual as long as they have information present for at least one
variable; individuals are only assigned a missing value if they are missing information
on all variables assessed*/
egen csmokeless = rowtotal(csnus_2 cdissolv_2 cslt_2), missing
recode csmokeless (100/max = 100)
tab csmokeless
w w w . z a t u m c o r p . c o m P a g e | 19
. tab csmokeless
. tab ctobany
. tab ctobcomb
/* create composite variable for use of 2+ tobacco products. First, we will create a
tally for total number of products used by each individual, then we will dichotomize it
into 0-1 vs 2+ */
egen ctob2 = rowtotal(ccigt_2 ccigar_2 celcigt_2 chookah_2 cpipe_2 csmokeless
cbidis_2), missing
recode ctob2 (0 100 = 0)(200/700 = 100) /*numbers are in the hundreds because we
recoded as 0, 100*/
tab ctob2
w w w . z a t u m c o r p . c o m P a g e | 20
. tab ctob2
. tab sex
RECODE of
race_m
(RACE_M) Freq. Percent Cum.
STEP 4: Set data to survey mode using the weight, PSU, and stratum variables. Within the
1. Some surveys (e.g., certain telephone-based surveys that use random-digit dialing) may not involve a multi-stage
selection process and hence PSUs may not be available. In such cases, setting the data to survey mode will require
using only the weight and stratum variables:
2. Similarly, some surveys may involve a multi-stage selection process but may not involve stratification. In such cases,
the PSU variable will be present, but the stratum variable will be missing. In such cases, setting the data to survey
mode will require using only the weight variable and the PSU variable.
3. The weight variable is used to accurately estimate means and percentages. The PSU and strata variables are used to
accurately estimate measures of variance (e.g., 95% confidence intervals).
4. Since each survey year is individually weighted to represent the population for that year, when appending data from
multiple years to increase sample size for a point estimate, the weights have to be adjusted by dividing by the number
of years pooled. The newly adjusted weight variable would then be used to set the data to survey mode.
5. For the weight, stratum, or PSU variables, it is either all or none (all individuals have the variable, or no individual
does). There cannot be missing values for any of these variables. When appending data from multiple years, inspect
carefully to ensure that information is complete for all observations in the dataset. Otherwise, results may be
erroneous and invalid.
6. Certain analytic techniques (e.g., inverse proportionality weighting) create weights that are different from the survey
weights that came with the dataset. Since only one set of weights can be used in analyses, compute new weights by
multiplying the survey weights within the dataset with the weights generated from the statistical procedure. These
newly created weights can then be used to set the data to survey mode.
7. Occasionally (e.g., with restrictive inclusion criteria for analyses), Stata may produce output that contains
percentages but omits the confidence intervals. This occurs because some strata have only one PSU, in which case it is
impossible to estimate a measure of variance (e.g., standard error or confidence interval). The default option in Stata
for single units (i.e., strata with only one PSU) is to set the standard error (or ensuing confidence intervals) for such
strata to missing. To solve this problem, you can change the default settings in Stata such that the standard errors for
the single units are set to the grand mean across all strata instead of the stratum-specific means. Specify this when you
are setting the data to survey mode as follows:
8. Running a new svyset command overrides any previous svyset command(s). To correct an error made during
setting the data to survey mode, simply re-run the svyset command.
w w w . z a t u m c o r p . c o m P a g e | 22
STEP 5: Compute overall prevalence estimates for all the outcomes for middle and high school
students separately. For reference, the results from the MMWR are shown below. The estimates of interest are
shown in red.
/*To generate estimates for both middle and high school students (hsms) in the
same command, we can use the code below. Note that the outer double quotes
“”around "`l'" only exist because hsms is a string variable. If it were not a
string variable (i.e., a factor variable), we would simply write it as `l' */
levelsof hsms, local(levels)
foreach l of local levels {
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "`l'"): mean `var'
}
}
/*To simplify the command above, the following analyses will compute results
separately for high and middle school students*/
*high school students*
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "HS"): mean `var'
}
w w w . z a t u m c o r p . c o m P a g e | 23
. foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2 cbidis_2 ct
> obany ctob2 ctobcomb{
2. svy, subpop(if hsms == "HS"): mean `var'
3. }
(running mean on estimation sample)
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
Linearized
Mean Std. Err. [95% Conf. Interval]
1. All survey commands must be preceded by svy to account for the complex survey design and yield valid results. Both the
tabulate (or tab) and mean functions can generate point prevalence estimates with 95% confidence intervals. To use the
mean, you however must recode as 0-1 or 0-100) (where 1 or 100 represents a case, e.g., a smoker, and 0 represents a non-
case, e.g., a non-smoker). The means of 0’s and 1’s is the same time as the proportion of adults that smoke.
2. The mean function is preferable in the simplicity of its output (only one single point estimate is generated – representing
the % that smoke; rather than having two complementary percentages for smokers and nonsmokers). The mean function
also generates the standard error which can be used to compute the relative standard error (RSE = standard error/mean).
3. Percentages and counts generated are only estimates of the true population parameter; number of decimal places should
reflect this and not display an unreasonable degree of precision. Round percentages to the nearest 1 decimal place,
population counts to the nearest 100,000.
4. 95% confidence intervals do not have to be provided when a complete census of the study population is taken. Similarly,
parametrically-computed confidence intervals are not scientifically justifiable for non-probability samples because there are
no associated sampling errors (no randomness in selection); there is thus no mathematical basis for computing standard
errors for such samples. If 95% confidence intervals are desired, they could be computed using non-parametric approaches
such as bootstrapping; the quartiles at 0.025 and 0.975 yield the boot-strapped 95% confidence intervals.
w w w . z a t u m c o r p . c o m P a g e | 26
STEP 6: Compute weighted population counts for all outcomes for middle and high school
students separately. For reference, the results from the MMWR are shown below. The estimates of interest are
shown in orange.
/* We can use a single loop statement to generate results for all the outcomes at
once for middle and high school students simultaneously. The output from the code
below produces results for both high school and middle school students. HOWEVER,
THE RESULTS PRESENTED BELOW ARE ONLY FOR HIGH SCHOOL STUDENTS. The numbers beside
0 represent the weighted counts of non-users of the specified tobacco product.
The numbers beside 100 are the weighted counts of users of the specified tobacco
product. For example, for current electronic cigarette use (celcigt), 1723292
high school students (~1.7 million) reported current use. Note that the outer
double quotes “”around "`l'" only exist because hsms is a string variable. If it
were not a string variable (i.e., a factor variable), we would simply write it as
`l' */
RECODE of
celcigt
(CELCIGT) count
0 13039887
100 1723292
Total 14763179
RECODE of
ccigt
(CCIGT) count
0 13549932
100 1123588
Total 14673519
RECODE of
ccigar
(CCIGAR) count
0 13525742
100 1135367
Total 14661109
csmokeles
s count
0 14084680
100 812689
Total 14897370
RECODE of
chookah
(CHOOKAH) count
0 14162043
100 503694
Total 14665737
RECODE of
cpipe
(CPIPE) count
0 14501264
100 127164
Total 14628428
RECODE of
cbidis
(CBIDIS) count
0 14522248
100 106179
Total 14628428
ctobany count
0 12027983
100 2933779
Total 14961761
ctob2 count
0 13570037
100 1391724
Total 14961761
ctobcomb count
0 13009354
100 1941645
Total 14950999
1. Deleting cases from a survey data set can be problematic since it can lead to wrong estimation of the standard
errors. For example, if you wanted to analyze the smoking prevalence among only high school students, and you
dropped all observations for middle school students, this would be inappropriate because the standard errors of the
estimates would be incorrectly estimated. In calculating subpopulation estimates, only the cases defined by the
subpopulation are to be used in the calculation of the estimate, however all cases in the dataset should be used in the
calculation of the standard errors. For this reason, you should not use the functions drop, keep, or by when sub-
setting subgroups with complex survey data. Appropriate Stata functions for subgroup analyses are subpop and
over.
2. Suppression rules are used when dealing with subpopulation estimates to ensure that only precise estimates are
presented. Common suppression rules are: relative standard errors >30% or cell sample sizes < 30 persons.
For simplicity, we will generate stratified prevalence estimates for sex and race/ethnicity separately. For each
variable, results are analyzed for all products and school levels within the same code. The desired estimates
are shown below:
variable. If it were not a string variable (i.e., a factor variable), we would simply
write it as `l' */
Linearized
Over Mean Std. Err. [95% Conf. Interval]
celcigt_2
female 9.933178 1.033186 7.870355 11.996
male 13.30177 1.245029 10.81599 15.78755
Linearized
Over Mean Std. Err. [95% Conf. Interval]
ccigt_2
female 7.570606 .7749495 6.02337 9.117843
male 7.574926 .6676701 6.241879 8.907972
w w w . z a t u m c o r p . c o m P a g e | 32
Linearized
Over Mean Std. Err. [95% Conf. Interval]
ccigar_2
female 6.290991 .7172439 4.858967 7.723014
male 8.979454 .776574 7.428974 10.52993
Linearized
Over Mean Std. Err. [95% Conf. Interval]
csmokeless
female 3.097841 .434743 2.229849 3.965834
male 7.643461 1.064272 5.518573 9.768349
Linearized
Over Mean Std. Err. [95% Conf. Interval]
chookah_2
female 3.280372 .3791087 2.523457 4.037287
male 3.407643 .4659882 2.477268 4.338019
w w w . z a t u m c o r p . c o m P a g e | 33
Linearized
Over Mean Std. Err. [95% Conf. Interval]
cpipe_2
female .5403658 .111872 .3170062 .7637254
male 1.049335 .1511733 .7475076 1.351162
Linearized
Over Mean Std. Err. [95% Conf. Interval]
cbidis_2
female .6124759 .1033142 .4062025 .8187494
male .6833788 .1648692 .3542069 1.012551
Linearized
Over Mean Std. Err. [95% Conf. Interval]
ctobany
female 17.57929 1.205825 15.17178 19.9868
male 21.39717 1.563038 18.27647 24.51788
w w w . z a t u m c o r p . c o m P a g e | 34
Linearized
Over Mean Std. Err. [95% Conf. Interval]
ctob2
female 7.702139 .7875678 6.129709 9.274569
male 10.67469 .9431419 8.791646 12.55773
Linearized
Over Mean Std. Err. [95% Conf. Interval]
ctobcomb
female 12.24056 .9370825 10.36962 14.11151
male 13.4728 1.054905 11.36662 15.57899
/*Generating stratified estimates by race. The code below runs the analyses for
both middle and high school students. Results NOT shown for race for brevity. The
outer double quotes “”around "`l'" in the code below only exist because hsms is a
string variable. If it were not a string variable (i.e., a factor variable), we
would simply write it as `l' */
1. Computed 95% Confidence intervals are a measure of the degree of precision of an estimate and should not be used
in lieu of a formal comparison of two estimates. Non-overlap of 95% confidence intervals always indicates that two
estimates differ statistically; however, the presence of an overlap does not preclude statistical significance. A formal
statistical test should therefore always be performed (e.g., a chi-squared test). The type of test will depend on the
variable type (e.g., categorical or continuous) and the underlying assumptions regarding distributions of the data
(parametric or non-parametric). Non-parametric tests are those that make no assumptions regarding parameters or
distributions. The table below shows some appropriate tests for bivariate testing based on variable type and
assumptions regarding underlying distributions.
Categorical with continuous, ANOVA, Z statistic, Regression (linear or Kruskal-Wallis test (non-parametric
independent logistic), t-test for independent samples alternative to ANOVA)
Categorical with continuous, Repeated-measures ANOVA, Mixed- Sign test; Wilcoxon Signed Rank
correlated effects models; GEE, paired t-test Test.
Continuous with continuous Pearson’s correlation Spearman’s correlation
Count with categorical Poisson regression Mann Whitney U Test
Nested comparisons (e.g., Nested Z test:
nested multi-year estimates) 𝑋 −𝑋
𝑍=
𝑆𝐸 + 𝑆𝐸 − 2𝑃 ∗ 𝑆𝐸
/* Chi-squared test for gender and racial differences in the use of the different
tobacco products, among middle and high school students separately (RESULTS SHOWN
FOR ONLY HIGH SCHOOL STUDENTS)*/
w w w . z a t u m c o r p . c o m P a g e | 36
*sex*
levelsof hsms, local(levels)
foreach l of local levels {
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "`l'"): tab `var' sex, pearson
}
}
RECODE of
celcigt SEX
(CELCIGT) female male Total
Pearson:
Uncorrected chi2(1) = 48.5521
Design-based F(1, 66) = 15.5947 P = 0.0002
(running tabulate on estimation sample)
RECODE of
ccigt SEX
(CCIGT) female male Total
Pearson:
Uncorrected chi2(1) = 0.0001
Design-based F(1, 66) = 0.0000 P = 0.9957
(running tabulate on estimation sample)
RECODE of
ccigar SEX
(CCIGAR) female male Total
Pearson:
Uncorrected chi2(1) = 44.7972
Design-based F(1, 66) = 10.4239 P = 0.0019
(running tabulate on estimation sample)
csmokeles SEX
s female male Total
Pearson:
Uncorrected chi2(1) = 178.6821
Design-based F(1, 66) = 74.0356 P = 0.0000
(running tabulate on estimation sample)
RECODE of
chookah SEX
(CHOOKAH) female male Total
Pearson:
Uncorrected chi2(1) = 0.2195
Design-based F(1, 66) = 0.0733 P = 0.7874
(running tabulate on estimation sample)
RECODE of
cpipe SEX
(CPIPE) female male Total
0 .492 .5 .992
100 .0027 .0053 .008
Pearson:
Uncorrected chi2(1) = 14.3164
Design-based F(1, 66) = 6.7252 P = 0.0117
(running tabulate on estimation sample)
RECODE of
cbidis SEX
(CBIDIS) female male Total
Pearson:
Uncorrected chi2(1) = 0.3413
Design-based F(1, 66) = 0.1404 P = 0.7090
(running tabulate on estimation sample)
SEX
ctobany female male Total
Pearson:
Uncorrected chi2(1) = 41.1560
Design-based F(1, 66) = 13.7333 P = 0.0004
(running tabulate on estimation sample)
SEX
ctob2 female male Total
Pearson:
Uncorrected chi2(1) = 46.8518
Design-based F(1, 66) = 16.0548 P = 0.0002
(running tabulate on estimation sample)
SEX
ctobcomb female male Total
Pearson:
Uncorrected chi2(1) = 6.0052
Design-based F(1, 66) = 1.9494 P = 0.1673
*race*
levelsof hsms, local(levels)
foreach l of local levels {
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "`l'"): tab `var' race4, pearson
}
}
Check and re-check your code to ensure there are no bugs and all variables have been recoded
correctly.
Check to make sure the results in your spreadsheets or tables are the same as those in your Stata
console.
Check to see that imprecise estimates are not reported. For subgroup analyses, cells with fewer than
30 people may not provide precise estimates. Consider combining similar categories to increase cell
sample size. Relative Standard Errors (RSEs) in the range of 30% to 50% have been used acceptably
in the scientific literature (with prevalence estimates above the cut-off being statistically unreliable).
Estimates above the threshold should ideally be suppressed. RSEs are calculated by dividing the
estimate (mean or percentage) by the standard error.
Check to ensure proper statistical tests have been conducted. Ninety-five percent confidence intervals
are merely an eyeball test and should not be used as a definitive statistical test to compare two
prevalence estimates. The absence of an overlap ALWAYS indicates a statistically significant
w w w . z a t u m c o r p . c o m P a g e | 40
difference between the two estimates being compared. However, the absence of an overlap does NOT
always preclude significance.
Check the numbers and percentages for correctness in tables and figures, and that they correspond
with information in the text. Ensure tables and figures are able to stand alone with the appropriate
descriptive title and footnotes.
Check to ensure the description of the methods provides sufficient information so the results could be
duplicated by someone with access to the same data and information. This includes providing within
the manuscript detailed descriptions of analytical and/or statistical approaches used with clear
definitions of variables used.
When reporting sample sizes, use the unweighted numbers, NOT the weighted population counts. The
unweighted numbers are the persons who actually completed the survey. For example in the 2017
National Youth Tobacco Survey, a total of 17,872 students in middle and high school participated in the
survey, and the total weighted population count was 27.1 million. The number to be reported as the
sample size is the 17,872 number, NOT the 27.1 million number.
Report the response rate for the survey.
It is generally not enough to report only the p-value. There is several valuable information that cannot
be revealed solely by a p value such as the effect size or the consistency of a finding. Presenting
information on both the point estimates and the 95% confidence intervals is preferable because it
provides these estimates of magnitude of effect and consistency.
When reporting percentages, use weighted NOT unweighted percentages. Otherwise, results may not
be valid because the unweighted results are from a sample whose distributions (e.g., age, sex, race)
may be very different from the target population.
Inferences from the weighted analyses should be made to the target population rather than the
sampled population. For example, weighted prevalence of current e-cigarette use among high school
students was 11.7% from the 2017 NYTS. Appropriate language to report this result would be “11.7%
of U.S. high school students reported current e-cigarette use”, not “11.7% of sampled high school
students who participated in the survey reported current e-cigarette use”.
Typically, percentages are expressed to one decimal place, measures of association (e.g., odds ratios,
prevalence ratios, etc.) to two decimal places, and p-values to three decimal places.
Do not report p-value as 0 (e.g., 0.0000). Rather, express it as < 0.0001
Provide the percentage of respondents with missing data for key outcomes.
Describe any sensitivity analyses and rationale.
Suggested Citation: Step-by-Step Guide To Analyses of Complex Survey Data in Stata. Available at
www.zatumcorp.com. Accessed MM/DD/YYYY.
For comments or questions, please email at info@zatumcorp.com