Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

ww w. z a t u m co r p .

c o m P a ge |1

Step-by-Step Guide To Analyses of


Complex Survey Data in Stata
This analysis guide will replicate findings (2017 data) from the MMWR titled:
“Tobacco Product Use Among Middle and High School Students — United States, 2011–2017,” available
at https://www.cdc.gov/mmwr/volumes/67/wr/mm6722a3.htm. This publication was a secondary analysis of
data from the National Youth Tobacco Survey, a nationally representative survey of U.S. students in middle
and high school. The survey used a multi-staged sampling procedure to generate a nationally representative
sample. Analyses of these data must use the supplied weights and other variance variables in the dataset for
the results to be valid. This paper provides a simple, step-by-step guidance to conducting analyses with
complex survey data in Stata, using this research article as a case study. The data can be downloaded at the
following site https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html. The data are available
for download in three different formats: SAS (.sas7bdat), Access (.mdb), and Excel (.xlsx).

Below is the table from the report we wish to replicate:


ww w. z a t u m co r p .c o m P a ge |2

TIPS ON COMMON STATA OPERATORS

! Not (for example != means not equal to; !missing means not missing.)
= Denotes mathematical equality.
== Used within a subset or conditional statement (i.e., use after an if statement)
> Greater than
>= Greater than or equal to
< Less than
<= Less than or equal to
. Missing. NOTE: the syntax if var == . can also be written as if missing(var)
& And
| Or. Rather than specifying several categories of the same variable with multiple “|” operators, consider
inrange, or inlist. E.g., replace vartotal = 1 if var == 1 | var == 2 | var ==
3 can be re-written as: replace vartotal = 1 if inlist(var,1,2,3) or alternatively as:
replace vartotal = 1 if inrange(var,1,3)
/ to (i.e., range). Mostly used with the recode function, e.g., recode var (1/10 = 1)(11/max =
2), gen(var2). The highest and lowest values in Stata are represented by the functions max and
min respectively.
* Stand-alone comment without a command on the same line, e.g., *This is a comment. When
comment appears after a command, it is written as /*comment*/ e.g., tab var1 var2 /*What
an awesome cross-tabulation! */

STEP 1: Download and save the data.

 Download the Excel file from the internet at


https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html.
 Extract the contents of the downloaded file to a folder somewhere on your computer (e.g., your
desktop).

STEP 2: Read the Excel data into Stata.

You can import an Excel file into Stata using the dropdown menu as follows: “File”  “Import”  “Excel
Spreadsheet (*.xls; *.xlsx)”. Alternatively, you can use the import excel function.

The command in Stata as an alternative to the drop-down menu would be

import excel "C:\Users\Zatum\OneDrive \Desktop\2017-nyts-dataset-and-


codebook-microsoft-excel\nyts2017. xlsx", sheet("nyts2017") clear
ww w. z a t u m co r p .c o m P a ge |3

TIPS ON IMPORTING DATA INTO STATA FROM OTHER PROGRAMS

1. FROM SAS TRANSPORT FILE (.XPT): use the fdause function. Note that the grey portion highlighted below is
the location of the file on your computer (i.e., file path); the yellow portion is the name of the dataset. To determine
the file path, go to where the file is located on your computer, right click on it and select properties. Copy the text
beside “Location”.

fdause "/Users/Zatum/Downloads/DEMO_H.XPT

2. FROM CSV: You can import a .csv file into Stata using the dropdown menu as follows: “File”  “Import”  “Text
data (delimited, *.csv, …)”. Alternatively, you can use the import delimited function.

import delimited "C:\Users\Zatum\OneDrive\Desktop\nyts2017.csv"

3. FROM ACCESS (.MDB): Open Access file  click node at top left corner to select entire dataframe  copy (Ctrl C)
 Open Stata and click on “Data Editor (Edit)”  click on the top left corner to select entire editor  paste (Ctrl V).
Keep first rows as variable names not data.

In Access: Click node then In Stata: Click Data Editor to In Stata Data Editor: Select first row as
Ctrl C to copy entire data. open blank data editor which click first cell then Ctrl V variable names
Progress bar at bottom right will hold the data to paste data
ww w. z a t u m co r p .c o m P a ge |4

4. FROM SAS (SAS7BDAT): There is no straightforward way to read a .sas7bdat file into Stata; the only available
options all involve the use of another program such as SAS, stat transfer, etc., that can convert the file into a format
that Stata can read (e.g., .xpt, .csv, .dta). The challenge however is that many of these programs are not
open source and may not be readily available. R is however an open source program which we can use for the
conversion of .sas7bdat into a .dta file using the steps outlined below.

(a) Download and install R software from the Internet using the links below: Windows: https://cran.r-
project.org/bin/windows/base/; Mac: https://cran.r-project.org/bin/macosx/

(b) Install the sas7bdat and foreign packages in R. sas7bdat will be used to import the SAS file into R
 while
The foreign
properties willisbea used
box quicktoway
export the file
to see thefrom R as of
number a stata dataset
variables and(note below thereinare
observations quotes
your around
dataset.
the packages for installation).

install.packages(c(“sas7bdat”, “foreign”))

(c) Load the packages after they have finished installing in R (note there are no quotes to call the libraries)

library(sas7bdat)

library(foreign)

(d) Read the dataset into R; object “y” has been assigned arbitrarily. The file.choose() argument
allows you to manually select the dataset from wherever it is located on your computer.

y = read.sas7bdat(file.choose())

(e) Export the dataset from R as a stata dataset and save somewhere on your computer. We are assigning the
arbitrary name “statafile” (you can change to something else). Replace the file path highlighted in gray below
with the actual file path from your computer where you wish to save the file. If you want to save on your
desktop, simply right click any file on your desktop and select properties. Next, copy the text beside
“Location” and use it to replace the gray text below. Ensure all slashes are double as shown below.

write.dta(y, “C:\\Users\\Zatum\\OneDrive\\Desktop\\statafile.dta”)

(f) Now read the .dta file into Stata directly using File  open within Stata.

 To see a list of all the variables in the dataset, type the command describe or simply desc
ww w. z a t u m co r p .c o m P a ge |5

TIPS ON DATA INSPECTION IN STATA

1. Variables in Stata could be factor; numeric; or character/string. Stata uses a color-coded system for variables, with
three possible color conventions. Enter browse in the command window to see:

o Black: numeric or factor variables without value labels


o Blue: Factor variables with value labels
o Red: character/string

2. Two useful commands that work in tandem for creating temporary working files are preserve and restore.
You can preserve a dataset, play as much with the dataset as you desire, and then restore back to the original state,
at which point all changes made are promptly discarded. You cannot restore when you have not preserved!
3. It is good practice to inspect your data to see the format the variables are stored in (character, numeric, string); to
see number and pattern of missing values; and to ensure that the codebook faithfully reflects what is in the dataset.

4. Two options for terminating a code in Stata are:

#delimit ; /*similar to SAS coding; useful for multi-line codes*/

#delimit cr /*carriage return (default). For multi-line codes, you have to


enter /// at the end of each line that does not terminate a command */
ww w. z a t u m co r p .c o m P a ge |6

Quickly examine the entire dataset with the browse command to see the extent of factor variables with value
labels (blue), numeric (black), or string (red) variables.

browse

You can also browse just one or more variables rather than the entire dataset.

browse sex
ww w. z a t u m co r p .c o m P a ge |7

STEP 3: Recode variables.

TIPS ON RECODING VARIABLES IN STATA


1. To compute percentages using svy: mean, you MUST recode your outcome so that cases are assigned a value of
1 (or 100) and non-cases are assigned a value of 0. in several surveys, responses of “yes” are classified as 1, while
responses of “no” are classified as “2”. You may therefore need to recode 2 to 0. Assigning cases and non-cases
values of 0 and 1 respectively produces results as proportions (e.g., 0.56). Assigning cases and non-cases values of 0
and 100 respectively produces results as percentages (e.g., 56%).

2. You cannot recode string variables. You can however use any one of replace, real, or destring
functions, depending on which is appropriate. Never modify, change, or delete an original variable as you might need
it again later.

3. With few exceptions, the maximum number of commas in everyday Stata commands is one.

5. You can run code in either batch mode, or as single line of code from the command console. If you wish to get the
entire printout of results from the output window, you can use the command translate @Results
filename.txt.The generated txt file will be saved in the working directory.

6. Loop statements can avoid repetitious coding. Loop statements can be used when you either want to execute the
exact same function on several variables, or on several categories/levels of the same variable. For example, if we
wanted a tabulation of several variables, we could execute the following loop rather than tabulating them one at a
time.

foreach var of varlist age sex race {


tab `var’
}

Note that the opening symbol ` is a backtick, found on the key below the Esc key. The closing symbol ’ is an
apostrophe, found on the key beside the Enter key.

The number of opening and closing curly brackets must correspond to the number of foreach statements. In the
example above, there is only one foreach statement, and so we have only one opening and one closing curly brackets.
ww w. z a t u m co r p .c o m P a ge |8

TIPS ON RECODING VARIABLES IN STATA

7. When recoding or collapsing outcome variables (e.g., creating a binary variable), responses of “don’t know”, “not
sure” should be excluded because of potential for misclassification; otherwise, justification should be provided for
collapsing them with another category. For outcomes measured on a Likert scale, computing the mean of the raw
responses is not recommended given truncation as well as the fact that the ensuing results have no meaningful
interpretation. The variable(s) could instead be dichotomized based on a priori determined study objectives, e.g.,
“strongly agree”/”agree” vs other responses.

8. Skip patterns: it is very important to recognize that some surveys use skip patterns, meaning that only individuals
eligible for a given question answered it based on their responses to one or more preceding filter questions. To ensure
the accurate denominator is being assessed, both the filter question(s) and the final question may need to be
incorporated as appropriate. For example, in several adult surveys, smoking status is determined by first asking
respondents if they have smoked up to 100 cigarettes in their lifetime. Those who answer yes are then asked if they
smoke now. Those who answer no to the first question are not asked the second question at all (i.e., skipped to the
next question). In creating a variable describing current smoking among all participants with these questions, a value
of 1 will be assigned to those who answered “yes” to both questions (i.e., have smoked 100+ cigarettes AND smoke
now). A value of 0 will assigned to those who answered yes to the first question but no to the second question (i.e.,
have smoked 100+ cigarettes but do not smoke now). A value of 0 will also be assigned to those who answered no to
the first question and were skipped from answering the second question.

9. When creating a composite variable from two or more variables (e.g., the measure of “any tobacco use” in the 2017
NYTS denoting use of at least one of seven different tobacco product types), missing values can be handled in one of
two ways. The first (and more stringent approach) would be to analyze only individuals with complete information on
all variables of interest (listwise deletion). Under this approach, even individuals with information on all but one
variable would be excluded. The second and more conservative approach would be to exclude individuals only if they
were missing information on all variables of interest. Thus, an individual with information present for only one
variable would still be included in the analyses. The first approach may lead to loss of sample size and precision; also,
the sheer magnitude of those excluded potentially increases the magnitude of selection bias if missingness was not at
random. The second approach however increases the likelihood for misclassification bias. For example, if an
individual only had information for one tobacco product (say cigars), for which he reported being a non-user, he
would be classified as a non-tobacco user, even if he used other forms of tobacco (for which data are missing).
Absence of evidence is not evidence of absence! Information should be provided on how missing data were dealt
with. Note that this MMWR used the second approach.
ww w. z a t u m co r p .c o m P a ge |9

This article examined current use prevalence of 10 tobacco products: cigarettes, cigars, smokeless tobacco,
electronic cigarettes, hookah, pipe tobacco, bidis, any tobacco product, 2+ tobacco products, and any
combustible tobacco use. Results were stratified by school level (middle and high school), sex (male and
female) and race (white, black, Hispanic, other race).

OUTCOME VARIABLES ASSESSED IN THE 2017 NYTS MMWR

Tobacco type Variable name Status in Categories in Categories desired


downloaded dataset for analyses
dataset
Cigarettes ccigt Present 1 = yes; 2 = no 100 = yes; 0 = no
Cigars ccigar Present 1 = yes; 2 = no 100 = yes; 0 = no
Chewing tobacco, snuff, or dip cslt Present 1 = yes; 2 = no 100 = yes; 0 = no
Snus csnus Present 1 = yes; 2 = no 100 = yes; 0 = no
Dissolvable tobacco products cdissolv Present 1 = yes; 2 = no 100 = yes; 0 = no
Electronic cigarettes celcigt Present 1 = yes; 2 = no 100 = yes; 0 = no
Hookah chookah Present 1 = yes; 2 = no 100 = yes; 0 = no
Pipe tobacco cpipe Present 1 = yes; 2 = no 100 = yes; 0 = no
Bidis cbidis Present 1 = yes; 2 = no 100 = yes; 0 = no
Any tobacco product 1 ctobany Not present. 100 = yes; 0 = no
To be derived
2+ tobacco products 2 ctob2 Not present. 100 = yes; 0 = no
To be derived
Any smokeless tobacco use 3 csmokeless Not present. 100 = yes; 0 = no
To be derived
Any combustible tobacco use 4 ctobcomb Not present. 100 = yes; 0 = no
To be derived
1 Any tobacco use: any of e-cigarettes, cigarettes, cigars, smokeless tobacco, hookah, pipe tobacco, bidis
2 Use of 2+ tobacco products: two or more of any of the following: e-cigarettes, cigarettes, cigars, smokeless tobacco,
hookah, pipe tobacco, bidis
3 Any smokeless tobacco use: chewing tobacco, snuff, dip, snus, or dissolvable tobacco products
4 Any combustible tobacco product use: cigarettes, cigars, hookah, pipe tobacco, or bidis
w w w . z a t u m c o r p . c o m P a g e | 10

*Tabulate tobacco use variables in dataset


. tab1 ccigt -cbidis

-> tabulation of ccigt

CCIGT Freq. Percent Cum.

1 973 5.58 5.58


2 16,461 94.42 100.00

Total 17,434 100.00

-> tabulation of ccigar

CCIGAR Freq. Percent Cum.

1 977 5.61 5.61


2 16,439 94.39 100.00

Total 17,416 100.00

-> tabulation of cslt

CSLT Freq. Percent Cum.

1 510 2.95 2.95


2 16,795 97.05 100.00

Total 17,305 100.00

-> tabulation of celcigt

CELCIGT Freq. Percent Cum.

1 1,360 7.74 7.74


2 16,210 92.26 100.00

Total 17,570 100.00

-> tabulation of chookah

CHOOKAH Freq. Percent Cum.

1 508 2.92 2.92


2 16,880 97.08 100.00

Total 17,388 100.00


w w w . z a t u m c o r p . c o m P a g e | 11

-> tabulation of cpipe

CPIPE Freq. Percent Cum.

1 137 0.79 0.79


2 17,191 99.21 100.00

Total 17,328 100.00

-> tabulation of csnus

CSNUS Freq. Percent Cum.

1 249 1.44 1.44


2 17,079 98.56 100.00

Total 17,328 100.00

-> tabulation of cdissolv

CDISSOLV Freq. Percent Cum.

1 105 0.61 0.61


2 17,223 99.39 100.00

Total 17,328 100.00

-> tabulation of cbidis

CBIDIS Freq. Percent Cum.

1 105 0.61 0.61


2 17,223 99.39 100.00

Total 17,328 100.00

DEMOGRAPHIC VARIABLES ASSESSED IN THE 2017 NYTS MMWR

Demographic variables variable Categories Categories in dataset: Categories desired for


assessed name analyses:
Race/ethnicity race_m 1 White, non-Hispanic White
2 Black, non-Hispanic Black
3 Hispanic Hispanic
4 Asian, non-Hispanic Other
5 AI/AN, non-Hispanic
6 NHOPI, non-Hispanic
7 Multi-race, non-Hispanic
School level mshs MS Middle School Middle
HS High School High
Sex sex 1 Female Female
2 Male Male
w w w . z a t u m c o r p . c o m P a g e | 12

*Tabulate demographic variables in dataset


. tab1 race_m sex hsms

-> tabulation of race_m

RACE_M Freq. Percent Cum.

1 7,532 44.09 44.09


2 2,983 17.46 61.55
3 4,614 27.01 88.56
4 728 4.26 92.82
5 230 1.35 94.16
6 96 0.56 94.73
7 901 5.27 100.00

Total 17,084 100.00

-> tabulation of sex

SEX Freq. Percent Cum.

1 8,815 49.81 49.81


2 8,881 50.19 100.00

Total 17,696 100.00

-> tabulation of hsms

hsms Freq. Percent Cum.

HS 10,172 56.92 56.92


MS 7,700 43.08 100.00

Total 17,872 100.00

Note: Use tab1 when you want to get the marginal distributions of multiple variables at the same time. Use
tab (without the “1”) when you want to tabulate or cross-tabulate variables.
w w w . z a t u m c o r p . c o m P a g e | 13

COMPLEX SURVEY DESIGN VARIABLES ASSESSED IN THE 2017 NYTS MMWR

Complex survey Variable Values Rationale


variables name
Weight variable Range: 30.7 to The weight variable accounts for differential probabilities of
finwgt 6505.1. Mean = selection and non-response. Incorporating the weight variable in
1518.3 the analyses ensures that the means or percentages are
estimated correctly.
Stratum stratum There are 16 strata Stratification was used to increase precision and allow sufficient
in NYTS. The sample sizes for racial minorities. In NYTS, there are 16 strata
number of PSUs (2*2*4) based on the following criteria: (1) two strata of
per stratum ranged urban/rural location—urban vs. nonurban; (2) two strata of
from 2 to 12. racial/ethnic minority enrollment, predominantly Hispanic vs.
predominantly non-Hispanic black, and (3) four density
distribution groupings (substrata) for each of: Non-Hispanic
Black urban; non-Hispanic black nonurban, Hispanic urban, and
Hispanic nonurban PSUs.

The stratum and PSU variables together are used to correctly


estimate variance. These two variables ensure that the 95%
confidence intervals are estimated correctly.
Primary sampling psu There are 82 PSU NYTS used cluster sampling (three-stage sampling process)
unit (PSU) in NYTS across all because this approach is logistically advantageous and efficient.
strata. Within each The PSU in NYTS were counties (entire counties, merger of
stratum, PSUs smaller counties, or parts of large counties). After selection of
were selected the PSUs, schools were next selected at random within each
probabilistically. selected PSU, and classes selected randomly from each school.

While cluster sampling is efficient, its downside is that


observations tend to be highly correlated within clusters.
Accounting for the PSU variable helps to adjust for this intra-
cluster correlation.

*Describe complex survey design variables in dataset


. summarize finwgt

Variable Obs Mean Std. Dev. Min Max

finwgt 17,872 1518.329 1244.704 30.73072 6505.084

. tab stratum

stratum Freq. Percent Cum.

BR1 1,865 10.44 10.44


BR2 968 5.42 15.85
BR3 1,087 6.08 21.93
BR4 1,021 5.71 27.65
BU1 1,205 6.74 34.39
BU2 930 5.20 39.59
BU3 1,318 7.37 46.97
BU4 711 3.98 50.95
HR1 2,493 13.95 64.89
HR2 249 1.39 66.29
HR3 707 3.96 70.24
HR4 323 1.81 72.05
HU1 1,546 8.65 80.70
HU2 1,654 9.25 89.96
HU3 697 3.90 93.86
HU4 1,098 6.14 100.00

Total 17,872 100.00

. tab psu
w w w . z a t u m c o r p . c o m P a g e | 14

. bys stratum: tab psu

-> stratum = BR1

psu Freq. Percent Cum.

174258 318 17.05 17.05


301900 188 10.08 27.13
302266 309 16.57 43.70
373427 463 24.83 68.53
515683 68 3.65 72.17
515792 208 11.15 83.32
559816 137 7.35 90.67
772237 174 9.33 100.00

Total 1,865 100.00

-> stratum = BR2

psu Freq. Percent Cum.

142976 252 26.03 26.03


344106 265 27.38 53.41
374076 293 30.27 83.68
644424 158 16.32 100.00

Total 968 100.00

-> stratum = BR3

psu Freq. Percent Cum.

14595 354 32.57 32.57


173094 160 14.72 47.29
188113 288 26.49 73.78
729632 285 26.22 100.00

Total 1,087 100.00

-> stratum = BR4

psu Freq. Percent Cum.

186117 341 33.40 33.40


314736 370 36.24 69.64
400362 229 22.43 92.07
401050 81 7.93 100.00

Total 1,021 100.00

-> stratum = BU1

psu Freq. Percent Cum.

258883 140 11.62 11.62


343581 213 17.68 29.29
387897 203 16.85 46.14
418326 193 16.02 62.16
516100 172 14.27 76.43
559447 155 12.86 89.29
602173 129 10.71 100.00

Total 1,205 100.00


w w w . z a t u m c o r p . c o m P a g e | 15

-> stratum = BU2

psu Freq. Percent Cum.

171594 116 12.47 12.47


259591 274 29.46 41.94
515682 257 27.63 69.57
558771 283 30.43 100.00

Total 930 100.00

-> stratum = BU3

psu Freq. Percent Cum.

188307 421 31.94 31.94


374456 336 25.49 57.44
487115 182 13.81 71.24
530582 379 28.76 100.00

Total 1,318 100.00

-> stratum = BU4

psu Freq. Percent Cum.

188207 320 45.01 45.01


344317 60 8.44 53.45
350305 75 10.55 63.99
602417 169 23.77 87.76
602425 87 12.24 100.00

Total 711 100.00

-> stratum = HR1

psu Freq. Percent Cum.

115262 278 11.15 11.15


129751 170 6.82 17.97
245033 148 5.94 23.91
259810 167 6.70 30.61
274380 189 7.58 38.19
373245 164 6.58 44.77
516335 189 7.58 52.35
585762 72 2.89 55.23
600815 531 21.30 76.53
602062 206 8.26 84.80
672182 205 8.22 93.02
758663 174 6.98 100.00

Total 2,493 100.00

-> stratum = HR2

psu Freq. Percent Cum.

686767 32 12.85 12.85


692818 217 87.15 100.00

Total 249 100.00


w w w . z a t u m c o r p . c o m P a g e | 16

-> stratum = HR3

psu Freq. Percent Cum.

86452 296 41.87 41.87


501078 240 33.95 75.81
757723 171 24.19 100.00

Total 707 100.00

-> stratum = HR4

psu Freq. Percent Cum.

686434 138 42.72 42.72


690913 185 57.28 100.00

Total 323 100.00

-> stratum = HU1

psu Freq. Percent Cum.

129060 256 16.56 16.56


487123 37 2.39 18.95
515990 364 23.54 42.50
586736 178 11.51 54.01
674270 174 11.25 65.27
701295 168 10.87 76.13
730705 184 11.90 88.03
758362 185 11.97 100.00

Total 1,546 100.00

-> stratum = HU2

psu Freq. Percent Cum.

86972 268 16.20 16.20


87144 188 11.37 27.57
174186 249 15.05 42.62
243848 134 8.10 50.73
343702 394 23.82 74.55
692424 421 25.45 100.00

Total 1,654 100.00

-> stratum = HU3

psu Freq. Percent Cum.

58385 165 23.67 23.67


87858 93 13.34 37.02
88152 129 18.51 55.52
689716 276 39.60 95.12
693010 34 4.88 100.00

Total 697 100.00

-> stratum = HU4

psu Freq. Percent Cum.

86889 240 21.86 21.86


87174 314 28.60 50.46
87568 318 28.96 79.42
173711 226 20.58 100.00

Total 1,098 100.00


w w w . z a t u m c o r p . c o m P a g e | 17

/*Dichotomizing tobacco use variables as 0, 100


We shall recode all the individual tobacco products from 2= no, 1 = yes; to 0 = no; 100 =
yes. Recoding as 0-100 allows us to compute percentages directly (instead of proportions
which would have been produced from a 0-1 recode). Naming convention for the newly
created variables will be as follows: the old variable name plus “_2” at the end (e.g.,
ccigt will become ccigt_2). As a precautionary measure, NEVER alter or delete your
original variables as you might need them again! We will recode all the variables in a
single loop.
foreach var of varlist ccigt ccigar cslt celcigt chookah cpipe csnus cdissolv
cbidis {
recode `var' (2=0)(1 = 100), gen(`var'_2)
}
tab1 ccigt_2-cbidis_2

. tab1 ccigt_2- cbidis_2

-> tabulation of ccigt_2

RECODE of
ccigt
(CCIGT) Freq. Percent Cum.

0 16,461 94.42 94.42


100 973 5.58 100.00

Total 17,434 100.00

-> tabulation of ccigar_2

RECODE of
ccigar
(CCIGAR) Freq. Percent Cum.

0 16,439 94.39 94.39


100 977 5.61 100.00

Total 17,416 100.00

-> tabulation of cslt_2

RECODE of
cslt (CSLT) Freq. Percent Cum.

0 16,795 97.05 97.05


100 510 2.95 100.00

Total 17,305 100.00

-> tabulation of celcigt_2

RECODE of
celcigt
(CELCIGT) Freq. Percent Cum.

0 16,210 92.26 92.26


100 1,360 7.74 100.00

Total 17,570 100.00

-> tabulation of chookah_2


w w w . z a t u m c o r p . c o m P a g e | 18

-> tabulation of chookah_2

RECODE of
chookah
(CHOOKAH) Freq. Percent Cum.

0 16,880 97.08 97.08


100 508 2.92 100.00

Total 17,388 100.00

-> tabulation of crollcigts_2


-> tabulation of cpipe_2

RECODE of
cpipe
(CPIPE) Freq. Percent Cum.

0 17,191 99.21 99.21


100 137 0.79 100.00

Total 17,328 100.00

-> tabulation of csnus_2

RECODE of
csnus
(CSNUS) Freq. Percent Cum.

0 17,079 98.56 98.56


100 249 1.44 100.00

Total 17,328 100.00

-> tabulation of cdissolv_2

RECODE of
cdissolv
(CDISSOLV) Freq. Percent Cum.

0 17,223 99.39 99.39


100 105 0.61 100.00

Total 17,328 100.00

-> tabulation of cbidis_2

RECODE of
cbidis
(CBIDIS) Freq. Percent Cum.

0 17,223 99.39 99.39


100 105 0.61 100.00

Total 17,328 100.00

/*create composite variable for any smokeless tobacco (i.e., dissolvable/snus/ chewing
tobacco, snuff, or dip). For command below, the ‘missing’ option allows a tally to be
generated for an individual as long as they have information present for at least one
variable; individuals are only assigned a missing value if they are missing information
on all variables assessed*/
egen csmokeless = rowtotal(csnus_2 cdissolv_2 cslt_2), missing
recode csmokeless (100/max = 100)
tab csmokeless
w w w . z a t u m c o r p . c o m P a g e | 19

. tab csmokeless

csmokeless Freq. Percent Cum.

0 17,000 96.01 96.01


100 706 3.99 100.00

Total 17,706 100.00

/* create composite variable for any tobacco use*/


egen ctobany = rowtotal(ccigt_2 ccigar_2 chookah_2 cpipe_2 cbidis_2 celcigt_2
csmokeless), missing
recode ctobany (100/max = 100)
tab ctobany

. tab ctobany

ctobany Freq. Percent Cum.

0 15,312 85.96 85.96


100 2,501 14.04 100.00

Total 17,813 100.00

/* create composite variable for any combustible tobacco use*/


egen ctobcomb = rowtotal(ccigt_2 ccigar_2 chookah_2 cpipe_2 cbidis_2), missing
recode ctobcomb (100/max = 100)
tab ctobcomb

. tab ctobcomb

ctobcomb Freq. Percent Cum.

0 16,084 90.36 90.36


100 1,715 9.64 100.00

Total 17,799 100.00

/* create composite variable for use of 2+ tobacco products. First, we will create a
tally for total number of products used by each individual, then we will dichotomize it
into 0-1 vs 2+ */
egen ctob2 = rowtotal(ccigt_2 ccigar_2 celcigt_2 chookah_2 cpipe_2 csmokeless
cbidis_2), missing
recode ctob2 (0 100 = 0)(200/700 = 100) /*numbers are in the hundreds because we
recoded as 0, 100*/
tab ctob2
w w w . z a t u m c o r p . c o m P a g e | 20

. tab ctob2

ctob2 Freq. Percent Cum.

0 16,650 93.47 93.47


100 1,163 6.53 100.00

Total 17,813 100.00

*Assign value labels to sex


label define sex 1 "female" 2 "male"
label values sex sex
tab sex

. tab sex

SEX Freq. Percent Cum.

female 8,815 49.81 49.81


male 8,881 50.19 100.00

Total 17,696 100.00

*Recode race and assign value labels


recode race_m (1=1 "white")(2=2 "black")(3=3 "hispanic")(4/7 = 4 "other"),
gen(race4)
tab race4
. tab race4

RECODE of
race_m
(RACE_M) Freq. Percent Cum.

white 7,532 44.09 44.09


black 2,983 17.46 61.55
hispanic 4,614 27.01 88.56
other 1,955 11.44 100.00

Total 17,084 100.00

STEP 4: Set data to survey mode using the weight, PSU, and stratum variables. Within the

survey mode, all commands must be preceded by svy.

svyset [pweight = finwgt], strata(stratum) psu(psu)


w w w . z a t u m c o r p . c o m P a g e | 21

TIPS ON SETTING DATA TO SURVEY MODE

1. Some surveys (e.g., certain telephone-based surveys that use random-digit dialing) may not involve a multi-stage
selection process and hence PSUs may not be available. In such cases, setting the data to survey mode will require
using only the weight and stratum variables:

svyset [pweight = weight_var], strata(stratum_var)

2. Similarly, some surveys may involve a multi-stage selection process but may not involve stratification. In such cases,
the PSU variable will be present, but the stratum variable will be missing. In such cases, setting the data to survey
mode will require using only the weight variable and the PSU variable.

svyset [pweight = weight_var], psu(psu_var)

3. The weight variable is used to accurately estimate means and percentages. The PSU and strata variables are used to
accurately estimate measures of variance (e.g., 95% confidence intervals).

4. Since each survey year is individually weighted to represent the population for that year, when appending data from
multiple years to increase sample size for a point estimate, the weights have to be adjusted by dividing by the number
of years pooled. The newly adjusted weight variable would then be used to set the data to survey mode.

5. For the weight, stratum, or PSU variables, it is either all or none (all individuals have the variable, or no individual
does). There cannot be missing values for any of these variables. When appending data from multiple years, inspect
carefully to ensure that information is complete for all observations in the dataset. Otherwise, results may be
erroneous and invalid.

6. Certain analytic techniques (e.g., inverse proportionality weighting) create weights that are different from the survey
weights that came with the dataset. Since only one set of weights can be used in analyses, compute new weights by
multiplying the survey weights within the dataset with the weights generated from the statistical procedure. These
newly created weights can then be used to set the data to survey mode.

7. Occasionally (e.g., with restrictive inclusion criteria for analyses), Stata may produce output that contains
percentages but omits the confidence intervals. This occurs because some strata have only one PSU, in which case it is
impossible to estimate a measure of variance (e.g., standard error or confidence interval). The default option in Stata
for single units (i.e., strata with only one PSU) is to set the standard error (or ensuing confidence intervals) for such
strata to missing. To solve this problem, you can change the default settings in Stata such that the standard errors for
the single units are set to the grand mean across all strata instead of the stratum-specific means. Specify this when you
are setting the data to survey mode as follows:

svyset [pweight = weight_var], psu(psu_var) strata(strata_var) singleunit(centered)

8. Running a new svyset command overrides any previous svyset command(s). To correct an error made during
setting the data to survey mode, simply re-run the svyset command.
w w w . z a t u m c o r p . c o m P a g e | 22

STEP 5: Compute overall prevalence estimates for all the outcomes for middle and high school

students separately. For reference, the results from the MMWR are shown below. The estimates of interest are
shown in red.

/*To generate estimates for both middle and high school students (hsms) in the
same command, we can use the code below. Note that the outer double quotes
“”around "`l'" only exist because hsms is a string variable. If it were not a
string variable (i.e., a factor variable), we would simply write it as `l' */
levelsof hsms, local(levels)
foreach l of local levels {
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "`l'"): mean `var'
}
}
/*To simplify the command above, the following analyses will compute results
separately for high and middle school students*/
*high school students*
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "HS"): mean `var'
}
w w w . z a t u m c o r p . c o m P a g e | 23

. foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2 cbidis_2 ct
> obany ctob2 ctobcomb{
2. svy, subpop(if hsms == "HS"): mean `var'
3. }
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,683


Number of PSUs = 82 Population size = 26,895,773
Subpop. no. obs = 9,983
Subpop. size = 14,763,179
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

celcigt_2 11.6729 1.069494 9.53759 13.80822

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,614


Number of PSUs = 82 Population size = 26,806,114
Subpop. no. obs = 9,914
Subpop. size = 14,673,519
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

ccigt_2 7.657248 .6089528 6.441434 8.873061

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,601


Number of PSUs = 82 Population size = 26,793,703
Subpop. no. obs = 9,901
Subpop. size = 14,661,109
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

ccigar_2 7.744072 .6218093 6.502589 8.985554

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,776


Number of PSUs = 82 Population size = 27,029,964
Subpop. no. obs = 10,076
Subpop. size = 14,897,370
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

csmokeless 5.455253 .7194613 4.018802 6.891704


w w w . z a t u m c o r p . c o m P a g e | 24

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,602


Number of PSUs = 82 Population size = 26,798,331
Subpop. no. obs = 9,902
Subpop. size = 14,665,737
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

chookah_2 3.434495 .3592209 2.717287 4.151703

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,566


Number of PSUs = 82 Population size = 26,761,022
Subpop. no. obs = 9,866
Subpop. size = 14,628,428
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

cpipe_2 .8692908 .0990611 .671509 1.067073

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,566


Number of PSUs = 82 Population size = 26,761,022
Subpop. no. obs = 9,866
Subpop. size = 14,628,428
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

cbidis_2 .7258421 .1068328 .5125436 .9391406

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,834


Number of PSUs = 82 Population size = 27,094,356
Subpop. no. obs = 10,134
Subpop. size = 14,961,761
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

ctobany 19.60851 1.301991 17.009 22.20802


w w w . z a t u m c o r p . c o m P a g e | 25

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,834


Number of PSUs = 82 Population size = 27,094,356
Subpop. no. obs = 10,134
Subpop. size = 14,961,761
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

ctob2 9.301875 .787803 7.728975 10.87477

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,824


Number of PSUs = 82 Population size = 27,083,594
Subpop. no. obs = 10,124
Subpop. size = 14,950,999
Design df = 66

Linearized
Mean Std. Err. [95% Conf. Interval]

ctobcomb 12.98673 .9043501 11.18113 14.79232

*For middle school students (RESULTS NOT PRESENTED)*


foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "MS"): mean `var'
}

TIPS ON SUMMARY STATISTICS WITH 95% CONFIDENCE INTERVALS

1. All survey commands must be preceded by svy to account for the complex survey design and yield valid results. Both the
tabulate (or tab) and mean functions can generate point prevalence estimates with 95% confidence intervals. To use the
mean, you however must recode as 0-1 or 0-100) (where 1 or 100 represents a case, e.g., a smoker, and 0 represents a non-
case, e.g., a non-smoker). The means of 0’s and 1’s is the same time as the proportion of adults that smoke.

svy: tab var, ci


svy: mean var

2. The mean function is preferable in the simplicity of its output (only one single point estimate is generated – representing
the % that smoke; rather than having two complementary percentages for smokers and nonsmokers). The mean function
also generates the standard error which can be used to compute the relative standard error (RSE = standard error/mean).

3. Percentages and counts generated are only estimates of the true population parameter; number of decimal places should
reflect this and not display an unreasonable degree of precision. Round percentages to the nearest 1 decimal place,
population counts to the nearest 100,000.

4. 95% confidence intervals do not have to be provided when a complete census of the study population is taken. Similarly,
parametrically-computed confidence intervals are not scientifically justifiable for non-probability samples because there are
no associated sampling errors (no randomness in selection); there is thus no mathematical basis for computing standard
errors for such samples. If 95% confidence intervals are desired, they could be computed using non-parametric approaches
such as bootstrapping; the quartiles at 0.025 and 0.975 yield the boot-strapped 95% confidence intervals.
w w w . z a t u m c o r p . c o m P a g e | 26

STEP 6: Compute weighted population counts for all outcomes for middle and high school

students separately. For reference, the results from the MMWR are shown below. The estimates of interest are
shown in orange.

/* We can use a single loop statement to generate results for all the outcomes at
once for middle and high school students simultaneously. The output from the code
below produces results for both high school and middle school students. HOWEVER,
THE RESULTS PRESENTED BELOW ARE ONLY FOR HIGH SCHOOL STUDENTS. The numbers beside
0 represent the weighted counts of non-users of the specified tobacco product.
The numbers beside 100 are the weighted counts of users of the specified tobacco
product. For example, for current electronic cigarette use (celcigt), 1723292
high school students (~1.7 million) reported current use. Note that the outer
double quotes “”around "`l'" only exist because hsms is a string variable. If it
were not a string variable (i.e., a factor variable), we would simply write it as
`l' */

levelsof hsms, local(levels)


foreach l of local levels {
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb {
svy, subpop(if hsms == "`l'"): tab `var', count format(%15.2g)
}
}
w w w . z a t u m c o r p . c o m P a g e | 27

. levelsof hsms, local(levels)


`"HS"' `"MS"'

. foreach l of local levels {


2. foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2 cbidis_2
> ctobany ctob2 ctobcomb{
3. svy, subpop(if hsms == "`l'"): tab `var', count format(%15.2g)
4. }
5. }
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,683


Number of PSUs = 82 Population size = 26,895,773
Subpop. no. obs = 9,983
Subpop. size = 14,763,179
Design df = 66

RECODE of
celcigt
(CELCIGT) count

0 13039887
100 1723292

Total 14763179

Key: count = weighted count


(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,614


Number of PSUs = 82 Population size = 26,806,114
Subpop. no. obs = 9,914
Subpop. size = 14,673,519
Design df = 66

RECODE of
ccigt
(CCIGT) count

0 13549932
100 1123588

Total 14673519

Key: count = weighted count


(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,601


Number of PSUs = 82 Population size = 26,793,703
Subpop. no. obs = 9,901
Subpop. size = 14,661,109
Design df = 66

RECODE of
ccigar
(CCIGAR) count

0 13525742
100 1135367

Total 14661109

Key: count = weighted count


w w w . z a t u m c o r p . c o m P a g e | 28

(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,776


Number of PSUs = 82 Population size = 27,029,964
Subpop. no. obs = 10,076
Subpop. size = 14,897,370
Design df = 66

csmokeles
s count

0 14084680
100 812689

Total 14897370

Key: count = weighted count


(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,602


Number of PSUs = 82 Population size = 26,798,331
Subpop. no. obs = 9,902
Subpop. size = 14,665,737
Design df = 66

RECODE of
chookah
(CHOOKAH) count

0 14162043
100 503694

Total 14665737

Key: count = weighted count


(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,566


Number of PSUs = 82 Population size = 26,761,022
Subpop. no. obs = 9,866
Subpop. size = 14,628,428
Design df = 66

RECODE of
cpipe
(CPIPE) count

0 14501264
100 127164

Total 14628428

Key: count = weighted count


(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,566


Number of PSUs = 82 Population size = 26,761,022
Subpop. no. obs = 9,866
Subpop. size = 14,628,428
Design df = 66

RECODE of
cbidis
(CBIDIS) count

0 14522248
100 106179

Total 14628428

Key: count = weighted count


w w w . z a t u m c o r p . c o m P a g e | 29

(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,834


Number of PSUs = 82 Population size = 27,094,356
Subpop. no. obs = 10,134
Subpop. size = 14,961,761
Design df = 66

ctobany count

0 12027983
100 2933779

Total 14961761

Key: count = weighted count


(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,834


Number of PSUs = 82 Population size = 27,094,356
Subpop. no. obs = 10,134
Subpop. size = 14,961,761
Design df = 66

ctob2 count

0 13570037
100 1391724

Total 14961761

Key: count = weighted count


(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,824


Number of PSUs = 82 Population size = 27,083,594
Subpop. no. obs = 10,124
Subpop. size = 14,950,999
Design df = 66

ctobcomb count

0 13009354
100 1941645

Total 14950999

Key: count = weighted count


w w w . z a t u m c o r p . c o m P a g e | 30

STEP 7: Generate subpopulation estimates.

TIPS FOR GENERATING SUBPOPULATION ESTIMATES

1. Deleting cases from a survey data set can be problematic since it can lead to wrong estimation of the standard
errors. For example, if you wanted to analyze the smoking prevalence among only high school students, and you
dropped all observations for middle school students, this would be inappropriate because the standard errors of the
estimates would be incorrectly estimated. In calculating subpopulation estimates, only the cases defined by the
subpopulation are to be used in the calculation of the estimate, however all cases in the dataset should be used in the
calculation of the standard errors. For this reason, you should not use the functions drop, keep, or by when sub-
setting subgroups with complex survey data. Appropriate Stata functions for subgroup analyses are subpop and
over.

2. Suppression rules are used when dealing with subpopulation estimates to ensure that only precise estimates are
presented. Common suppression rules are: relative standard errors >30% or cell sample sizes < 30 persons.

For simplicity, we will generate stratified prevalence estimates for sex and race/ethnicity separately. For each
variable, results are analyzed for all products and school levels within the same code. The desired estimates
are shown below:

/*Generating stratified estimates by sex.


The code below runs the analyses for both middle and high school students. However, the
results shown below are only for high school students for illustrative purposes. Note
also that the outer double quotes “”around "`l'" only exist because hsms is a string
w w w . z a t u m c o r p . c o m P a g e | 31

variable. If it were not a string variable (i.e., a factor variable), we would simply
write it as `l' */

levelsof hsms, local(levels)


foreach l of local levels {
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "`l'"): mean `var', over(sex)
}
}

. levelsof hsms, local(levels)


`"HS"' `"MS"'

. foreach l of local levels {


2. foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2 cbidis_2
> ctobany ctob2 ctobcomb{
3. svy, subpop(if hsms == "`l'"): mean `var', over(sex)
4. }
5. }
(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,607


Number of PSUs = 82 Population size = 26,777,777
Subpop. no. obs = 9,907
Subpop. size = 14,645,182
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

celcigt_2
female 9.933178 1.033186 7.870355 11.996
male 13.30177 1.245029 10.81599 15.78755

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,544


Number of PSUs = 82 Population size = 26,702,219
Subpop. no. obs = 9,844
Subpop. size = 14,569,625
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

ccigt_2
female 7.570606 .7749495 6.02337 9.117843
male 7.574926 .6676701 6.241879 8.907972
w w w . z a t u m c o r p . c o m P a g e | 32

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,525


Number of PSUs = 82 Population size = 26,673,970
Subpop. no. obs = 9,825
Subpop. size = 14,541,375
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

ccigar_2
female 6.290991 .7172439 4.858967 7.723014
male 8.979454 .776574 7.428974 10.52993

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,695


Number of PSUs = 82 Population size = 26,906,991
Subpop. no. obs = 9,995
Subpop. size = 14,774,396
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

csmokeless
female 3.097841 .434743 2.229849 3.965834
male 7.643461 1.064272 5.518573 9.768349

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,526


Number of PSUs = 82 Population size = 26,678,435
Subpop. no. obs = 9,826
Subpop. size = 14,545,841
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

chookah_2
female 3.280372 .3791087 2.523457 4.037287
male 3.407643 .4659882 2.477268 4.338019
w w w . z a t u m c o r p . c o m P a g e | 33

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,492


Number of PSUs = 82 Population size = 26,645,141
Subpop. no. obs = 9,792
Subpop. size = 14,512,546
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

cpipe_2
female .5403658 .111872 .3170062 .7637254
male 1.049335 .1511733 .7475076 1.351162

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,492


Number of PSUs = 82 Population size = 26,645,141
Subpop. no. obs = 9,792
Subpop. size = 14,512,546
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

cbidis_2
female .6124759 .1033142 .4062025 .8187494
male .6833788 .1648692 .3542069 1.012551

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,751


Number of PSUs = 82 Population size = 26,968,422
Subpop. no. obs = 10,051
Subpop. size = 14,835,827
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

ctobany
female 17.57929 1.205825 15.17178 19.9868
male 21.39717 1.563038 18.27647 24.51788
w w w . z a t u m c o r p . c o m P a g e | 34

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,751


Number of PSUs = 82 Population size = 26,968,422
Subpop. no. obs = 10,051
Subpop. size = 14,835,827
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

ctob2
female 7.702139 .7875678 6.129709 9.274569
male 10.67469 .9431419 8.791646 12.55773

(running mean on estimation sample)

Survey: Mean estimation

Number of strata = 16 Number of obs = 17,743


Number of PSUs = 82 Population size = 26,959,408
Subpop. no. obs = 10,043
Subpop. size = 14,826,814
Design df = 66

female: sex = female


male: sex = male

Linearized
Over Mean Std. Err. [95% Conf. Interval]

ctobcomb
female 12.24056 .9370825 10.36962 14.11151
male 13.4728 1.054905 11.36662 15.57899

/*Generating stratified estimates by race. The code below runs the analyses for
both middle and high school students. Results NOT shown for race for brevity. The
outer double quotes “”around "`l'" in the code below only exist because hsms is a
string variable. If it were not a string variable (i.e., a factor variable), we
would simply write it as `l' */

levelsof hsms, local(levels)


foreach l of local levels {
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "`l'"): mean `var', over(race4)
}
}

[RESULTS NOT SHOWN]


w w w . z a t u m c o r p . c o m P a g e | 35

STEP 8: Perform statistical testing of subgroup estimates.

TIPS ON STATISTICAL TESTING OF GROUP DIFFERENCES

1. Computed 95% Confidence intervals are a measure of the degree of precision of an estimate and should not be used
in lieu of a formal comparison of two estimates. Non-overlap of 95% confidence intervals always indicates that two
estimates differ statistically; however, the presence of an overlap does not preclude statistical significance. A formal
statistical test should therefore always be performed (e.g., a chi-squared test). The type of test will depend on the
variable type (e.g., categorical or continuous) and the underlying assumptions regarding distributions of the data
(parametric or non-parametric). Non-parametric tests are those that make no assumptions regarding parameters or
distributions. The table below shows some appropriate tests for bivariate testing based on variable type and
assumptions regarding underlying distributions.

SCENARIO PARAMETRIC NON-PARAMETRIC


Categorical with categorical, Chi-square, logistic regression Fisher’s exact test
independent
Categorical with categorical, Conditional logistic regression, GEE
correlated

Categorical with continuous, ANOVA, Z statistic, Regression (linear or Kruskal-Wallis test (non-parametric
independent logistic), t-test for independent samples alternative to ANOVA)
Categorical with continuous, Repeated-measures ANOVA, Mixed- Sign test; Wilcoxon Signed Rank
correlated effects models; GEE, paired t-test Test.
Continuous with continuous Pearson’s correlation Spearman’s correlation
Count with categorical Poisson regression Mann Whitney U Test
Nested comparisons (e.g., Nested Z test:
nested multi-year estimates) 𝑋 −𝑋
𝑍=
𝑆𝐸 + 𝑆𝐸 − 2𝑃 ∗ 𝑆𝐸

/* Chi-squared test for gender and racial differences in the use of the different
tobacco products, among middle and high school students separately (RESULTS SHOWN
FOR ONLY HIGH SCHOOL STUDENTS)*/
w w w . z a t u m c o r p . c o m P a g e | 36

*sex*
levelsof hsms, local(levels)
foreach l of local levels {
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "`l'"): tab `var' sex, pearson
}
}

(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,607


Number of PSUs = 82 Population size = 26,777,777
Subpop. no. obs = 9,907
Subpop. size = 14,645,182
Design df = 66

RECODE of
celcigt SEX
(CELCIGT) female male Total

0 .4441 .4395 .8836


100 .049 .0674 .1164

Total .4931 .5069 1

Key: cell proportion

Pearson:
Uncorrected chi2(1) = 48.5521
Design-based F(1, 66) = 15.5947 P = 0.0002
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,544


Number of PSUs = 82 Population size = 26,702,219
Subpop. no. obs = 9,844
Subpop. size = 14,569,625
Design df = 66

RECODE of
ccigt SEX
(CCIGT) female male Total

0 .455 .4693 .9243


100 .0373 .0385 .0757

Total .4922 .5078 1

Key: cell proportion

Pearson:
Uncorrected chi2(1) = 0.0001
Design-based F(1, 66) = 0.0000 P = 0.9957
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,525


Number of PSUs = 82 Population size = 26,673,970
Subpop. no. obs = 9,825
Subpop. size = 14,541,375
Design df = 66

RECODE of
ccigar SEX
(CCIGAR) female male Total

0 .4622 .4613 .9235


100 .031 .0455 .0765

Total .4932 .5068 1

Key: cell proportion


w w w . z a t u m c o r p . c o m P a g e | 37

Pearson:
Uncorrected chi2(1) = 44.7972
Design-based F(1, 66) = 10.4239 P = 0.0019
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,695


Number of PSUs = 82 Population size = 26,906,991
Subpop. no. obs = 9,995
Subpop. size = 14,774,396
Design df = 66

csmokeles SEX
s female male Total

0 .4768 .4691 .9459


100 .0152 .0388 .0541

Total .4921 .5079 1

Key: cell proportion

Pearson:
Uncorrected chi2(1) = 178.6821
Design-based F(1, 66) = 74.0356 P = 0.0000
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,526


Number of PSUs = 82 Population size = 26,678,435
Subpop. no. obs = 9,826
Subpop. size = 14,545,841
Design df = 66

RECODE of
chookah SEX
(CHOOKAH) female male Total

0 .4775 .4891 .9666


100 .0162 .0173 .0334

Total .4937 .5063 1

Key: cell proportion

Pearson:
Uncorrected chi2(1) = 0.2195
Design-based F(1, 66) = 0.0733 P = 0.7874
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,492


Number of PSUs = 82 Population size = 26,645,141
Subpop. no. obs = 9,792
Subpop. size = 14,512,546
Design df = 66

RECODE of
cpipe SEX
(CPIPE) female male Total

0 .492 .5 .992
100 .0027 .0053 .008

Total .4947 .5053 1

Key: cell proportion


w w w . z a t u m c o r p . c o m P a g e | 38

Pearson:
Uncorrected chi2(1) = 14.3164
Design-based F(1, 66) = 6.7252 P = 0.0117
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,492


Number of PSUs = 82 Population size = 26,645,141
Subpop. no. obs = 9,792
Subpop. size = 14,512,546
Design df = 66

RECODE of
cbidis SEX
(CBIDIS) female male Total

0 .4917 .5018 .9935


100 .003 .0035 .0065

Total .4947 .5053 1

Key: cell proportion

Pearson:
Uncorrected chi2(1) = 0.3413
Design-based F(1, 66) = 0.1404 P = 0.7090
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,751


Number of PSUs = 82 Population size = 26,968,422
Subpop. no. obs = 10,051
Subpop. size = 14,835,827
Design df = 66

SEX
ctobany female male Total

0 .4045 .4003 .8048


100 .0863 .109 .1952

Total .4908 .5092 1

Key: cell proportion

Pearson:
Uncorrected chi2(1) = 41.1560
Design-based F(1, 66) = 13.7333 P = 0.0004
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,751


Number of PSUs = 82 Population size = 26,968,422
Subpop. no. obs = 10,051
Subpop. size = 14,835,827
Design df = 66

SEX
ctob2 female male Total

0 .453 .4549 .9078


100 .0378 .0544 .0922

Total .4908 .5092 1

Key: cell proportion


w w w . z a t u m c o r p . c o m P a g e | 39

Pearson:
Uncorrected chi2(1) = 46.8518
Design-based F(1, 66) = 16.0548 P = 0.0002
(running tabulate on estimation sample)

Number of strata = 16 Number of obs = 17,743


Number of PSUs = 82 Population size = 26,959,408
Subpop. no. obs = 10,043
Subpop. size = 14,826,814
Design df = 66

SEX
ctobcomb female male Total

0 .4308 .4405 .8713


100 .0601 .0686 .1287

Total .4909 .5091 1

Key: cell proportion

Pearson:
Uncorrected chi2(1) = 6.0052
Design-based F(1, 66) = 1.9494 P = 0.1673

*race*
levelsof hsms, local(levels)
foreach l of local levels {
foreach var of varlist celcigt_2 ccigt_2 ccigar_2 csmokeless chookah_2 cpipe_2
cbidis_2 ctobany ctob2 ctobcomb{
svy, subpop(if hsms == "`l'"): tab `var' race4, pearson
}
}

[RESULTS NOT SHOWN]

STEP 9: Check! Check! Check!

 Check and re-check your code to ensure there are no bugs and all variables have been recoded
correctly.
 Check to make sure the results in your spreadsheets or tables are the same as those in your Stata
console.
 Check to see that imprecise estimates are not reported. For subgroup analyses, cells with fewer than
30 people may not provide precise estimates. Consider combining similar categories to increase cell
sample size. Relative Standard Errors (RSEs) in the range of 30% to 50% have been used acceptably
in the scientific literature (with prevalence estimates above the cut-off being statistically unreliable).
Estimates above the threshold should ideally be suppressed. RSEs are calculated by dividing the
estimate (mean or percentage) by the standard error.
 Check to ensure proper statistical tests have been conducted. Ninety-five percent confidence intervals
are merely an eyeball test and should not be used as a definitive statistical test to compare two
prevalence estimates. The absence of an overlap ALWAYS indicates a statistically significant
w w w . z a t u m c o r p . c o m P a g e | 40

difference between the two estimates being compared. However, the absence of an overlap does NOT
always preclude significance.
 Check the numbers and percentages for correctness in tables and figures, and that they correspond
with information in the text. Ensure tables and figures are able to stand alone with the appropriate
descriptive title and footnotes.
 Check to ensure the description of the methods provides sufficient information so the results could be
duplicated by someone with access to the same data and information. This includes providing within
the manuscript detailed descriptions of analytical and/or statistical approaches used with clear
definitions of variables used.

STEP 10: Code right, then write right!

 When reporting sample sizes, use the unweighted numbers, NOT the weighted population counts. The
unweighted numbers are the persons who actually completed the survey. For example in the 2017
National Youth Tobacco Survey, a total of 17,872 students in middle and high school participated in the
survey, and the total weighted population count was 27.1 million. The number to be reported as the
sample size is the 17,872 number, NOT the 27.1 million number.
 Report the response rate for the survey.
 It is generally not enough to report only the p-value. There is several valuable information that cannot
be revealed solely by a p value such as the effect size or the consistency of a finding. Presenting
information on both the point estimates and the 95% confidence intervals is preferable because it
provides these estimates of magnitude of effect and consistency.
 When reporting percentages, use weighted NOT unweighted percentages. Otherwise, results may not
be valid because the unweighted results are from a sample whose distributions (e.g., age, sex, race)
may be very different from the target population.
 Inferences from the weighted analyses should be made to the target population rather than the
sampled population. For example, weighted prevalence of current e-cigarette use among high school
students was 11.7% from the 2017 NYTS. Appropriate language to report this result would be “11.7%
of U.S. high school students reported current e-cigarette use”, not “11.7% of sampled high school
students who participated in the survey reported current e-cigarette use”.
 Typically, percentages are expressed to one decimal place, measures of association (e.g., odds ratios,
prevalence ratios, etc.) to two decimal places, and p-values to three decimal places.
 Do not report p-value as 0 (e.g., 0.0000). Rather, express it as < 0.0001
 Provide the percentage of respondents with missing data for key outcomes.
 Describe any sensitivity analyses and rationale.

Suggested Citation: Step-by-Step Guide To Analyses of Complex Survey Data in Stata. Available at
www.zatumcorp.com. Accessed MM/DD/YYYY.
For comments or questions, please email at info@zatumcorp.com

You might also like