Data Analyses R Manual NYTS

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

ww w. z a t u m co r p .

c o m P a ge |1

Step-by-Step Guide To Analyses of


Complex Survey Data in R
This analysis guide will replicate findings (2017 data) from the MMWR titled:
“Tobacco Product Use Among Middle and High School Students — United States, 2011–2017,” available
at https://www.cdc.gov/mmwr/volumes/67/wr/mm6722a3.htm. This publication was a secondary analysis of
data from the National Youth Tobacco Survey, a nationally representative survey of U.S. students in middle
and high school. The survey uses a multi-staged sampling procedure to generate a nationally representative
sample. Analyses of these data must use the supplied weights and other variance variables in the dataset for
the results to be valid. This paper provides a simple, step-by-step guidance to conducting analyses with
complex survey data in R, using this research article as a case study. The data can be downloaded at the
following site https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html. The data are available
for download in three different formats: SAS (.sas7bdat), Access (.mdb), and Excel (.xlsx).

Below is the table from the report we wish to replicate:


ww w. z a t u m co r p .c o m P a ge |2

TIPS ON COMMON R OPERATORS

! Not (for example !y$var %in% NA means not missing the variable var within the dataset y)
= Used for assignment. Also, denotes mathematical equality.
== Used within a subset or conditional statement.
> Greater than
>= Greater than or equal to
< Less than
<= Less than or equal to
NA Not available (i.e., missing). NA is very “infectious” and every operation conducted with NA returns
NA, e.g., 5 + NA = NA; 5-NA = NA; 5/NA = NA; 5*NA = NA, etc. NA differs from NaN (not a number,
e.g., 0/0 = NaN) or Inf (infinite, e.g., 5/0 = Inf).
& And
| Or
: to (i.e., range). Commonly used with the recode function, e.g., y$var2 =
car::recode(y$var, “1:10 = 1; 11:hi = 2”). The highest and lowest values in R are
represented by the functions hi and lo respectively
# Comment
c Concatenate or combine: use this operator to group variables or categories meant to receive the same
action or operation. E.g., y$var2 = car::recode(y$var, “c(1,2,3,4,5,6,7,8,9,10)
= 1; 11:hi = 2”)
%in% Within. For example, y$var %in% c(1:10) states a condition assessing if var contains any of the
values 1 to 10.
$ Indexing operator. This comes between a variable name and the dataframe it belongs to. R can hold
many data frames so, this helps R know which dataframe you are referring to. For example, y$var
means the variable “var” that is in the dataframe y.
[ Square bracket. Used as indexing operator. Can be used in place of $ above. For example, y$var is
equivalent to y[“var”]
( Round bracket. Used to pass a function to an object. Everything in R is either a verb (function, does an
action) or a noun (an object, receives an action). Use the round bracket to separate a function from an
object. E.g., mean(y$var, na.rm = T); here the function is mean and the object is var.
{ Curly bracket. Used for loop statements
%>% Piping operator

STEP 1: Download and save the data.

 Download the Excel file from the internet at


https://www.cdc.gov/tobacco/data_statistics/surveys/nyts/data/index.html.
 Extract the contents of the downloaded file to a folder somewhere on your computer (e.g., your
desktop).
ww w. z a t u m co r p .c o m P a ge |3

STEP 2: Read the Excel file into R with the read_excel function.

Note that “file.choose()” which appears many times in the box below allows you to manually select where
the dataset is located on your computer. A pop-up box will appear, and you can navigate to select the file
without having to type a file path.

TIPS ON IMPORTING DATA INTO R FROM OTHER PROGRAMS

1. FROM SAS TRANSPORT FILE (.XPT):


library(foreign)
y <- read.xport(file.choose())

2. FROM EXCEL (XLSX):


library(readxl)
library(tidyverse)
y <- read_excel(file.choose())

3. FROM ACCESS (.MDB):


You can simply copy the entire data in Access and use the read.delim function in
R to import it. First, click the node on the top-left corner of the Access
database to select all, then Ctrl C to copy. Progress bar is at bottom right.
Then, run the following command in R:
y <- read.delim(“clipboard”)

4. FROM SAS (SAS7BDAT):


library(sas7bdat)
y <- read.sas7bdat(file.choose())

5. FROM SPSS
library(memisc)
y <- as.data.set(spss.system.file('C:\\Users\\Zatum\\file2016.sav'))

library(foreign)
y <- read.spss(file.choose())

6. FROM STATA
library(haven)
y <- read_dta(file.choose())

library(foreign)
y <- read.dta(file.choose())

library(readstata13)
y=read.dta13(file.choose())

7. FROM CSV
y <- read.csv(file.choose())
ww w. z a t u m co r p .c o m P a ge |4

 After reading the data into R, check the dimensions and other properties of the data.

The output above (rows by columns) tells us that there are 17,872 observations in the dataset, and 373
variables. To see all variables in the dataset, type colnames(y)

TIPS ON DATA INSPECTION IN R

Variables could be factor; numeric; character/string, or logical. R has a series of functions to determine the structure
of a variable. Take the variable finweight.
To see, its structure, type

> str(y$finwgt)
num [1:17872] 1234 1234 1234 1234 1234 ...

R also has several logical functions which test whether a variable meets a criterion.
> is.numeric(y$ finwgt)
[1] TRUE

> is.factor(y$ finwgt)


[1] FALSE

> is.character(y$ finwgt)


[1] FALSE

STEP 3: Recode variables.


ww w. z a t u m co r p .c o m P a ge |5

TIPS ON RECODING VARIABLES IN R


1. Install the car package for easy recode. The apply group of functions (lapply, sapply, apply,
mapply) are also very helpful and will be used frequently in this guide.

2. To compute percentages using the svymean function, you MUST recode your outcome so that cases are assigned
a value of 1 (or 100) and non-cases are assigned a value of 0. in several surveys, responses of “yes” are classified as 1,
while responses of “no” are classified as “2”. You may therefore need to recode 2 to 0. Assigning cases and non-cases
values of 0 and 1 respectively produces results as proportions (e.g., 0.56). Assigning cases and non-cases values of 0
and 100 respectively produces results as percentages (e.g., 56%).

3. When recoding or collapsing outcome variables (e.g., creating a binary variable), responses of “don’t know”, “not
sure” should be excluded because of potential for misclassification; otherwise, justification should be provided for
collapsing them with another category. For outcomes measured on a Likert scale, computing the mean of the raw
responses is not recommended given truncation as well as the fact that the ensuing results have no meaningful
interpretation. The variable(s) could instead be dichotomized based on a priori determined study objectives, e.g.,
“strongly agree”/”agree” vs other responses.

4. Skip patterns: it is very important to recognize that some surveys use skip patterns, meaning that only individuals
eligible for a given question answered it based on their responses to one or more preceding filter questions. To ensure
the accurate denominator is being assessed, both the filter question(s) and the final question may need to be
incorporated as appropriate. For example, in several adult surveys, smoking status is determined by first asking
respondents if they have smoked up to 100 cigarettes in their lifetime. Those who answer yes are then asked if they
smoke now. Those who answer no to the first question are not asked the second question at all (i.e., skipped to the
next question). In creating a variable describing current smoking among all participants with these questions, a value
of 1 will be assigned to those who answered “yes” to both questions (i.e., have smoked 100+ cigarettes AND smoke
now). A value of 0 will assigned to those who answered yes to the first question but no to the second question (i.e.,
have smoked 100+ cigarettes but do not smoke now). A value of 0 will also be assigned to those who answered no to
the first question and were skipped from answering the second question.

5. When creating a composite variable from two or more variables (e.g., the measure of “any tobacco use” in the 2017
NYTS denoting use of at least one of seven different tobacco product types), missing values can be handled in one of
two ways. The first (and more stringent approach) would be to analyze only individuals with complete information on
all variables of interest (listwise deletion). Under this approach, even individuals with information on all but one
variable would be excluded. The second and more conservative approach would be to exclude individuals only if they
were missing information on all variables of interest. Thus, an individual with information present for only one
variable would still be included in the analyses. The first approach may lead to loss of sample size and precision; also,
the sheer magnitude of those excluded potentially increases the magnitude of selection bias if missingness was not at
random. The second approach however increases the likelihood for misclassification bias. For example, if an
individual only had information for one tobacco product (say cigars), for which he reported being a non-user, he
would be classified as a non-tobacco user, even if he used other forms of tobacco (for which data are missing).
Absence of evidence is not evidence of absence! Information should be provided on how missing data were dealt
with. Note that this MMWR used the second approach.
ww w. z a t u m co r p .c o m P a ge |6

This article examined current use prevalence of 10 tobacco products: cigarettes, cigars, smokeless tobacco,
electronic cigarettes, hookah, pipe tobacco, bidis, any tobacco product, 2+ tobacco products, and any
combustible tobacco use. Results were stratified by school level (middle and high school), sex (male and
female) and race (white, black, Hispanic, other race).

OUTCOME VARIABLES ASSESSED IN THE 2017 NYTS MMWR


Tobacco type Variable name Status in Categories in Categories desired
downloaded dataset for analyses
dataset
Cigarettes ccigt Present 1 = yes; 2 = no 100 = yes; 0 = no
Cigars ccigar Present 1 = yes; 2 = no 100 = yes; 0 = no
Chewing tobacco, snuff, or dip cslt Present 1 = yes; 2 = no 100 = yes; 0 = no
Snus csnus Present 1 = yes; 2 = no 100 = yes; 0 = no
Dissolvable tobacco products cdissolv Present 1 = yes; 2 = no 100 = yes; 0 = no
Electronic cigarettes celcigt Present 1 = yes; 2 = no 100 = yes; 0 = no
Hookah chookah Present 1 = yes; 2 = no 100 = yes; 0 = no
Pipe tobacco cpipe Present 1 = yes; 2 = no 100 = yes; 0 = no
Bidis cbidis Present 1 = yes; 2 = no 100 = yes; 0 = no
Any tobacco product 1 ctobany Not present. 100 = yes; 0 = no
To be derived
2+ tobacco products 2 ctob2 Not present. 100 = yes; 0 = no
To be derived
Any smokeless tobacco use 3 csmokeless Not present. 100 = yes; 0 = no
To be derived
Any combustible tobacco use 4 ctobcomb Not present. 100 = yes; 0 = no
To be derived
1 Any tobacco use: any of e-cigarettes, cigarettes, cigars, smokeless tobacco, hookah, pipe tobacco, bidis
2 Use of 2+ tobacco products: two or more of any of the following: e-cigarettes, cigarettes, cigars, smokeless tobacco,
hookah, pipe tobacco, bidis
3 Any smokeless tobacco use: chewing tobacco, snuff, dip, snus, or dissolvable tobacco products
4 Any combustible tobacco product use: cigarettes, cigars, hookah, pipe tobacco, or bidis
ww w. z a t u m co r p .c o m P a ge |7

#Tabulate tobacco use variables in dataset

lapply(y[,c("CCIGT", "CCIGAR", "CSLT", "CELCIGT", "CHOOKAH", "CPIPE", "CSNUS", "CDISSOLV"


, "CBIDIS")], function(x) table(x))
$CCIGT
x
1 2
973 16461

$CCIGAR
x
1 2
977 16439

$CSLT
x
1 2
510 16795

$CELCIGT
x
1 2
1360 16210

$CHOOKAH
x
1 2
508 16880

$CPIPE
x
1 2
137 17191

$CSNUS
x
1 2
249 17079

$CDISSOLV
x
1 2
105 17223

$CBIDIS
x
1 2
105 17223

#What does lapply do? Here, it takes the subset of columns of y that are inside
#the c() function, and to each of these variables, it applies the table function.

DEMOGRAPHIC VARIABLES ASSESSED IN THE 2017 NYTS MMWR


Demographic variables variable Categories Categories in dataset: Categories desired for
assessed name analyses:
Race/ethnicity race_m 1 White, non-Hispanic White
2 Black, non-Hispanic Black
3 Hispanic Hispanic
4 Asian, non-Hispanic Other
5 AI/AN, non-Hispanic
6 NHOPI, non-Hispanic
ww w. z a t u m co r p .c o m P a ge |8

7 Multi-race, non-Hispanic
School level mshs MS Middle School Middle
HS High School High
Sex sex 1 Female Female
2 Male Male

#Tabulate demographic variables in dataset

lapply(y[,c("RACE_M", "SEX", "hsms")], function(x) table(x))


$RACE_M
x
1 2 3 4 5 6 7
7532 2983 4614 728 230 96 901

$SEX
x
1 2
8815 8881

$hsms
x
HS MS
10172 7700

COMPLEX SURVEY DESIGN VARIABLES ASSESSED IN THE 2017 NYTS MMWR

Complex survey Variable Values Rationale


variables name
Weight variable Range: 30.7 to The weight variable accounts for differential probabilities of
finwgt 6505.1. Mean = selection and non-response. Incorporating the weight variable in
1518.3 the analyses ensures that the means or percentages are
estimated correctly.
Stratum stratum There are 16 strata Stratification was used to increase precision and allow sufficient
in NYTS. The sample sizes for racial minorities. In NYTS, there are 16 strata
number of PSUs (2*2*4) based on the following criteria: (1) two strata of
per stratum ranged urban/rural location—urban vs. nonurban; (2) two strata of
from 2 to 12. racial/ethnic minority enrollment, predominantly Hispanic vs.
predominantly non-Hispanic black, and (3) four density
distribution groupings (substrata) for each of: Non-Hispanic
Black urban; non-Hispanic black nonurban, Hispanic urban, and
Hispanic nonurban PSUs.

The stratum and PSU variables together are used to correctly


estimate variance. These two variables ensure that the 95%
confidence intervals are estimated correctly.
Primary sampling psu There are 82 PSU NYTS used cluster sampling (three-stage sampling process)
unit (PSU) in NYTS across all because this approach is logistically advantageous and efficient.
strata. Within each The PSU in NYTS were counties (entire counties, merger of
stratum, PSUs smaller counties, or parts of large counties). After selection of
were selected the PSUs, schools were next selected at random within each
probabilistically. selected PSU, and classes selected randomly from each school.

While cluster sampling is efficient, its downside is that


observations tend to be highly correlated within clusters.
Accounting for the PSU variable helps to adjust for this intra-
cluster correlation.
ww w. z a t u m co r p .c o m P a ge |9

#Describe complex survey design variables in dataset

summary(y$finwgt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.73 630.24 1219.92 1518.33 1951.53 6505.08

table(y$stratum)

BR1 BR2 BR3 BR4 BU1 BU2 BU3 BU4 HR1 HR2 HR3 HR4 HU1 HU2 HU3 HU4
1865 968 1087 1021 1205 930 1318 711 2493 249 707 323 1546 1654 697 1098

> lapply(split(y, y$stratum), function(x) table(x$psu))


$BR1

174258 301900 302266 373427 515683 515792 559816 772237


318 188 309 463 68 208 137 174

$BR2

142976 344106 374076 644424


252 265 293 158

$BR3

14595 173094 188113 729632


354 160 288 285

$BR4

186117 314736 400362 401050


341 370 229 81

$BU1

258883 343581 387897 418326 516100 559447 602173


140 213 203 193 172 155 129

$BU2

171594 259591 515682 558771


116 274 257 283

$BU3

188307 374456 487115 530582


421 336 182 379

$BU4

188207 344317 350305 602417 602425


320 60 75 169 87

$HR1

115262 129751 245033 259810 274380 373245 516335 585762 600815 602062 672182 758663
278 170 148 167 189 164 189 72 531 206 205 174

$HR2

686767 692818
32 217

$HR3

86452 501078 757723


w w w . z a t u m c o r p . c o m P a g e | 10

296 240 171

$HR4

686434 690913
138 185

$HU1

129060 487123 515990 586736 674270 701295 730705 758362


256 37 364 178 174 168 184 185

$HU2

86972 87144 174186 243848 343702 692424


268 188 249 134 394 421

$HU3

58385 87858 88152 689716 693010


165 93 129 276 34

$HU4

86889 87174 87568 173711


240 314 318 226

#Dichotomizing tobacco use variables as 0, 100


#We shall recode all the individual tobacco products from 2= no, 1 = yes; to 0 = no; 100
= yes. Recoding as 0-100 allows us to compute percentages directly (instead of
proportions which would have been produced from a 0-1 recode). Naming convention for the
newly created variables will be as follows: the old variable name plus “_2” at the end
(e.g., ccigt will become ccigt_2). As a precautionary measure, NEVER alter or delete your
original variables as you might need them again!

#Recoding different tobacco products

tobrec = paste(colnames(y[,c("CCIGT", "CCIGAR", "CSLT", "CELCIGT", "CHOOKAH", "CPIPE", "C


SNUS", "CDISSOLV", "CBIDIS")]), "_2", sep = "")

y[tobrec] = lapply(y[,c("CCIGT", "CCIGAR", "CSLT", "CELCIGT", "CHOOKAH", "CPIPE", "CSNUS"


, "CDISSOLV", "CBIDIS")], function(x) ifelse(x == 2, 0, ifelse(x == 1, 100, NA)))

lapply(y[,colnames(y) %in% tobrec], function(x) table(x))


$CCIGT_2
x
0 100
16461 973

$CCIGAR_2
x
0 100
16439 977

$CSLT_2
x
0 100
16795 510

$CELCIGT_2
x
0 100
16210 1360

$CHOOKAH_2
x
w w w . z a t u m c o r p . c o m P a g e | 11

0 100
16880 508

$CPIPE_2
x
0 100
17191 137

$CSNUS_2
x
0 100
17079 249

$CDISSOLV_2
x
0 100
17223 105

$CBIDIS_2
x
0 100
17223 105

#create composite variable for any smokeless tobacco (i.e., dissolvable/snus/ chewing
tobacco, snuff, or dip)

y$csmokeless = apply(y[,c("CSLT", "CSNUS", "CDISSOLV")], 1, function(x) ifelse(all(is.na(


x)), NA, ifelse(any(x == 1, na.rm = T), 100, 0)))

table(y$csmokeless)

0 100
17000 706

#Note that the apply function is different from lapply. The “1” that appears just
after the square bracket tells R to perform operations on rows (“2” is for column
operations). In plain language, the code above says: “For each individual row,
examine the three variables provided ("CSLT", "CSNUS", "CDISSOLV") and if an
individual has missing information on all three variables, assign them a value of
NA. Otherwise, if they have a value of 1 on any of the three variables, assign
them a value of 100. For all other people who meet neither of the two criteria
just mentioned, assign them a value of 0”.
# create composite variable for any tobacco use
y$ctobany = apply(y[,c("CCIGT", "CCIGAR", "CSLT", "CELCIGT", "CHOOKAH", "CPIPE", "CSNUS"
, "CDISSOLV", "CBIDIS")], 1, function(x) ifelse(all(is.na(x)), NA, ifelse(any(x == 1, na.
rm = T), 100, 0)))

table(y$ctobany)

0 100
15312 2501

# create composite variable for any combustible tobacco use


y$ctobcomb = apply(y[,c("CCIGT", "CCIGAR", "CHOOKAH", "CPIPE", "CBIDIS")], 1, function(x
) ifelse(all(is.na(x)), NA, ifelse(any(x == 1, na.rm = T), 100, 0)))

table(y$ctobcomb)

0 100
16084 1715
w w w . z a t u m c o r p . c o m P a g e | 12

# create composite variable for use of 2+ tobacco products.

y$ctob2 = apply(y[, c("CCIGT_2", "CCIGAR_2", "CELCIGT_2", "CHOOKAH_2", "CPIPE_2", "CBIDI


S_2", "csmokeless")], 1, function(x) ifelse(all(is.na(x)), NA, ifelse(sum(x > 0, na.rm =
T) >= 2, 100, 0)))

table(y$ctob2)

0 100
16650 1163
#Assign value labels to sex
y$sex = factor(y$SEX,
levels=c(1:2),
labels =c("female", "male"))

table(y$sex)

female male
8815 8881

#Recode race and assign value labels


#To recode, first install and load the car package. DO NOT RECODE VARIABLES AS
CHARACTER IF THEY WILL BE USED AS STRATIFICATION VARIABLES WITHIN SURVEY
ANALYSES. The svyby function (discussed later) does not tolerate character
variables very well. Recode instead as numeric or factor. To illustrate the
differences, race is recoded below in the three formats (character, numeric, and
factor).
#character
library(car)
y$race4.c = recode(y$RACE_M, "1 = '1White'; 2 = '2Black'; 3 = '3Hispanic'; 4:7 = '4Other'
; else = NA")

table(y$race4.c)

1White 2Black 3Hispanic 4Other


7532 2983 4614 1955

# R arranges the levels of a character variable in alphabetical order, in which


case ‘White’ would come last. To trick R into preserving the order we want, we
inserted numbers in front of the value labels so that R maintains ascending
order.

#numeric
y$race4.n = recode(y$RACE_M, "1 = 1; 2 = 2; 3 = 3; 4:7 = 4; else = NA")

table(y$race4.n)

1 2 3 4
7532 2983 4614 1955

#factor
y$race4.f = factor(y$race4.n,
levels=c(1:4),
labels =c("White", "Black", ))

> table(y$race4.f)

White Black Hispanic Other


7532 2983 4614 1955
w w w . z a t u m c o r p . c o m P a g e | 13

#While race4.f (factor variable) and race4.c (character variable) look alike,
they aren’t the same. Notice how “White” comes first in race4.f without having to
trick R by assigning numbers in front. This is because, behind the string label
is a real number which is the basis for ordering.

STEP 4: Install the survey package in R. Set data to survey mode using the weight, PSU, and

stratum variables.
library(survey)
s = svydesign(data = y, id = ~ y$psu, strata = ~ y$stratum, weights = ~ y$finwgt, nest=T)

TIPS ON SETTING DATA TO SURVEY MODE

1. Some surveys (e.g., certain telephone-based surveys that use random-digit dialing) may not involve a multi-stage
selection process and hence PSUs may not be available. In such cases, setting the data to survey mode will require
using only the weight and stratum variables:

s = svydesign(data = y, id = ~ 1, strata = ~ y$stratum, weights = ~ y$finwgt, nest=T)

2. Similarly, some surveys may involve a multi-stage selection process but may not involve stratification. In such
cases, the PSU variable will be present, but the stratum variable will be missing. In such cases, setting the data to
survey mode will require using only the weight and PSU variables:

s = svydesign(data = y, id = ~ y$psu, strata = NULL, weights = ~ y$finwgt, nest=T)

3. The weight variable is used to accurately estimate means and percentages. The PSU and strata variables are used
to accurately estimate measures of variance (e.g., 95% confidence intervals).

4. Since each survey year is individually weighted to represent the population for that year, when appending data
from multiple years to increase sample size for a point estimate, the weights have to be adjusted by dividing by the
number of years pooled. The newly adjusted weight variable would then be used to set the data to survey mode.

5. For the weight, stratum, or PSU variables, it is either all or none (all individuals have the variable, or no individual
does). There cannot be missing values for any of these variables. When appending data from multiple years, inspect
carefully to ensure that information is complete for all observations in the dataset. Otherwise, results may be
erroneous and invalid.

6. Certain analytic techniques (e.g., inverse proportionality weighting) create weights that are different from the
survey weights that came with the dataset. Since only one set of weights can be used in analyses, compute new
weights by multiplying the survey weights within the dataset with the weights generated from the statistical
procedure. These newly created weights can then be used to set the data to survey mode.

7. Since the dataframe (in this case, y) and the created survey object (in this case, s) are distinct from each other,
making changes in the dataframe (e.g, creating new variables) requires creating a new survey object using the
svydesign function above. Note that all complex survey analytical procedures are based off the survey object, not
the parent dataframe.
w w w . z a t u m c o r p . c o m P a g e | 14

STEP 5: Compute overall prevalence estimates for all the outcomes for middle and high school

students separately. For reference, the results from the MMWR are shown below. The estimates of interest are
shown in red.

#For clarity, a simpler (but slightly longer) code is used below. Means and
confidence intervals are computed separately and then put together using the Map
function(The R survey package does not automatically compute the 95% confidence
intervals; you have to generate them yourself). Following the stratification
approach in the MMWR report, the analyses are subset to high school and middle
school students separately using the subset function (RESULTS SHOWN ONLY FOR HIGH
SCHOOL STUDENTS).

means = list() #create an empty list to store the percentages


varlist = c("CELCIGT_2", "CCIGT_2", "CCIGAR_2", "csmokeless", "CHOOKAH_2", "CPIPE_2",
"CBIDIS_2", "ctobany", "ctob2", "ctobcomb")

for (var in varlist) {


means[[var]] = svymean(as.formula(paste0("~", var[[1]])), subset(s, hsms %in% "HS"),
na.rm = T)
}

conf = list() #create an empty list to store the confidence intervals


varlist = c("CELCIGT_2", "CCIGT_2", "CCIGAR_2", "csmokeless", "CHOOKAH_2", "CPIPE_2",
"CBIDIS_2", "ctobany", "ctob2", "ctobcomb")

for (var in varlist) {


conf[[var]] = confint(svymean(as.formula(paste0("~", var[[1]])), subset(s, hsms %in% "HS"
), na.rm = T))
}
w w w . z a t u m c o r p . c o m P a g e | 15

Map(cbind, means, conf)


$CELCIGT_2
2.5 % 97.5 %
CELCIGT_2 11.6729 9.576734 13.76907

$CCIGT_2
2.5 % 97.5 %
CCIGT_2 7.657248 6.463722 8.850773

$CCIGAR_2
2.5 % 97.5 %
CCIGAR_2 7.744072 6.525348 8.962796

$csmokeless
2.5 % 97.5 %
csmokeless 5.455253 4.045135 6.865371

$CHOOKAH_2
2.5 % 97.5 %
CHOOKAH_2 3.434495 2.730435 4.138555

$CPIPE_2
2.5 % 97.5 %
CPIPE_2 0.8692909 0.6751347 1.063447

$CBIDIS_2
2.5 % 97.5 %
CBIDIS_2 0.7258421 0.5164538 0.9352305

$ctobany
2.5 % 97.5 %
ctobany 19.60851 17.05665 22.16037

$ctob2
2.5 % 97.5 %
ctob2 9.301875 7.757809 10.84594

$ctobcomb
2.5 % 97.5 %
ctobcomb 12.98673 11.21423 14.75922

#To get results for middle school students, just run the exact same command
above, changing only the category within the subset function. Currently, it is
subset(s, hsms %in% "HS"). Change it to subset(s, hsms %in% "MS")

TIPS ON SUMMARY STATISTICS WITH 95% CONFIDENCE INTERVALS

1. Use the svymean function to generate overall estimates. Use the svyby function to generate stratified estimates.
Use the svytable function to generate weighted population counts. Use the svychisq function to compare
estimates.

2. Using regular functions that do not account for the complex survey design (e.g., mean instead of svymean) may
yield invalid results

3. Percentages and counts generated are only estimates of the true population parameter; number of decimal places
should reflect this and not display an unreasonable degree of precision. Round percentages to the nearest 1 decimal
place, population counts to the nearest 100,000.

4. 95% confidence intervals do not have to be provided when a complete census of the study population is taken.
Similarly, parametrically-computed confidence intervals are not scientifically justifiable for non-probability samples
because there are no associated sampling errors (no randomness in selection); there is thus no mathematical basis for
computing standard errors for such samples. 95% confidence intervals in those cases can be computed using
bootstrapping; the quartiles at 0.025 and 0.975 yield the boot-strapped 95% confidence intervals.
w w w . z a t u m c o r p . c o m P a g e | 16

STEP 6: Compute weighted population counts for all outcomes for middle and high school

students separately. For reference, the results from the MMWR are shown below. The estimates of interest are
shown in orange.

# We will use the svytable function to compute the weighted population counts for
high school and middle school students separately. (RESULTS PRESENTED BELOW ARE
ONLY FOR HIGH SCHOOL STUDENTS). The numbers below 0 represent the weighted counts
of non-users of the specified tobacco product. The numbers below 100 are the
weighted counts of users of the specified tobacco product. For example, for
current electronic cigarette use (celcigt), 1723292 high school students (~1.7
million) reported current use.

wcount = list() #create an empty list to store the weighted counts

varlist = c("CELCIGT_2", "CCIGT_2", "CCIGAR_2", "csmokeless", "CHOOKAH_2", "CPIPE_2",


"CBIDIS_2", "ctobany", "ctob2", "ctobcomb")
for (var in varlist) {
wcount[[var]] = svytable(as.formula(paste0("~", as.factor(var[[1]]))), subset(s, hsms
%in% "HS"))
}

wcount
$CELCIGT_2
CELCIGT_2
0 100
13039887 1723292

$CCIGT_2
CCIGT_2
0 100
13549932 1123588

$CCIGAR_2
CCIGAR_2
w w w . z a t u m c o r p . c o m P a g e | 17

0 100
13525742 1135367

$csmokeless
csmokeless
0 100
14084680.4 812689.2

$CHOOKAH_2
CHOOKAH_2
0 100
14162043 503694

$CPIPE_2
CPIPE_2
0 100
14501264.1 127163.6

$CBIDIS_2
CBIDIS_2
0 100
14522248.4 106179.3

$ctobany
ctobany
0 100
12027983 2933779

$ctob2
ctob2
0 100
13570037 1391724

$ctobcomb
ctobcomb
0 100
13009354 1941645

#To get results for middle school students, just run the exact same command
above, changing only the category within the subset function. Currently, it is
subset(s, hsms %in% "HS"). Change it to subset(s, hsms %in% "MS")

STEP 7: Generate subpopulation estimates.

TIPS FOR GENERATING SUBPOPULATION ESTIMATES

1. Deleting cases from a survey data set can be problematic since it can lead to wrong estimation of the standard
errors. For example, if you wanted to analyze the smoking prevalence among only high school students, and you
dropped all observations for middle school students, this would be inappropriate because the standard errors of the
estimates would be incorrectly estimated. In calculating subpopulation estimates, only the cases defined by the
subpopulation are to be used in the calculation of the estimate, however all cases in the dataset should be used in the
calculation of the standard errors.

2. The svyby function can be used for subgroup analyses. The stratification variable should be stored in numeric
format (don’t recode the variables as string! )

3. Suppression rules are used when dealing with subpopulation estimates to ensure that only precise estimates are
presented. Common suppression rules are: relative standard errors >30% or cell sample sizes < 30 persons.
w w w . z a t u m c o r p . c o m P a g e | 18

For simplicity, we will generate stratified prevalence estimates for sex and race/ethnicity separately. For each
variable, results are analyzed for all products and school levels simultaneously. The desired estimates are
shown below:

# We will use the svyby function to compute sex and race-stratified prevalence
estimates among high school and middle school students separately. (RESULTS
PRESENTED BELOW ARE ONLY FOR HIGH SCHOOL STUDENTS). We are introducing the
function round here (estimates rounded to 1 decimal place).

> ###########stratified estimates by sex among high school students


#generate means

means.hs = list()
varlist = c("CELCIGT_2", "CCIGT_2", "CCIGAR_2", "csmokeless", "CHOOKAH_2", "CPIPE_2",
"CBIDIS_2", "ctobany", "ctob2", "ctobcomb")

for (var in varlist) {


means.hs[[var]] = round(svyby(as.formula(paste0("~", var[[1]])), ~as.numeric(sex), subset
(s, hsms %in% "HS"), svymean, na.rm = T), 1)
}

#generate confidence intervals


conf.hs = list()
varlist = c("CELCIGT_2", "CCIGT_2", "CCIGAR_2", "csmokeless", "CHOOKAH_2", "CPIPE_2",
"CBIDIS_2", "ctobany", "ctob2", "ctobcomb")

for (var in varlist) {


conf.hs[[var]] = round(confint(svyby(as.formula(paste0("~", var[[1]])), ~as.numeric(sex),
subset(s, hsms %in% "HS"), svymean, na.rm = T)), 1)
}

Map(cbind, means.hs, conf.hs)


$CELCIGT_2
as.numeric(sex) CELCIGT_2 se 2.5 % 97.5 %
1 1 9.9 1.0 7.9 12.0
2 2 13.3 1.2 10.9 15.7

$CCIGT_2
as.numeric(sex) CCIGT_2 se 2.5 % 97.5 %
1 1 7.6 0.8 6.1 9.1
w w w . z a t u m c o r p . c o m P a g e | 19

2 2 7.6 0.7 6.3 8.9

$CCIGAR_2
as.numeric(sex) CCIGAR_2 se 2.5 % 97.5 %
1 1 6.3 0.7 4.9 7.7
2 2 9.0 0.8 7.5 10.5

$csmokeless
as.numeric(sex) csmokeless se 2.5 % 97.5 %
1 1 3.1 0.4 2.2 3.9
2 2 7.6 1.1 5.6 9.7

$CHOOKAH_2
as.numeric(sex) CHOOKAH_2 se 2.5 % 97.5 %
1 1 3.3 0.4 2.5 4.0
2 2 3.4 0.5 2.5 4.3

$CPIPE_2
as.numeric(sex) CPIPE_2 se 2.5 % 97.5 %
1 1 0.5 0.1 0.3 0.8
2 2 1.0 0.2 0.8 1.3

$CBIDIS_2
as.numeric(sex) CBIDIS_2 se 2.5 % 97.5 %
1 1 0.6 0.1 0.4 0.8
2 2 0.7 0.2 0.4 1.0

$ctobany
as.numeric(sex) ctobany se 2.5 % 97.5 %
1 1 17.6 1.2 15.2 19.9
2 2 21.4 1.6 18.3 24.5

$ctob2
as.numeric(sex) ctob2 se 2.5 % 97.5 %
1 1 7.7 0.8 6.2 9.2
2 2 10.7 0.9 8.8 12.5

$ctobcomb
as.numeric(sex) ctobcomb se 2.5 % 97.5 %
1 1 12.2 0.9 10.4 14.1
2 2 13.5 1.1 11.4 15.5

#To get results for middle school students, just run the exact same command
above, changing only the category within the subset function. Currently, it is
subset(s, hsms %in% "HS"). Change it to subset(s, hsms %in% "MS")

[RESULTS NOT SHOWN]

/*Generating stratified estimates by race:


#To stratify results by race, just run the exact same command above, changing
only the second argument, from sex to race4.n. Currently, it is ~as.numeric(sex).
Change it to ~as.numeric(race4.f)

[RESULTS NOT SHOWN]

STEP 8: Perform statistical testing of subgroup estimates.


w w w . z a t u m c o r p . c o m P a g e | 20

TIPS ON STATISTICAL TESTING OF GROUP DIFFERENCES

1. Computed 95% Confidence intervals are a measure of the degree of precision of an estimate and should not be used
in lieu of a formal comparison of two estimates. Non-overlap of 95% confidence intervals always indicates that two
estimates differ statistically; however, the presence of an overlap does not preclude statistical significance. A formal
statistical test should therefore always be performed (e.g., a chi-squared test). The type of test will depend on the
variable type (e.g., categorical or continuous) and the underlying assumptions regarding distributions of the data
(parametric or non-parametric). Non-parametric tests are those that make no assumptions regarding parameters or
distributions. The table below shows some appropriate tests for bivariate testing based on variable type and
assumptions regarding underlying distributions.

SCENARIO PARAMETRIC NON-PARAMETRIC


Categorical with categorical, Chi-square, logistic regression Fisher’s exact test
independent
Categorical with categorical, Conditional logistic regression, GEE
correlated

Categorical with continuous, ANOVA, Z statistic, Regression (linear or Kruskal-Wallis test (non-parametric
independent logistic), t-test for independent samples alternative to ANOVA)
Categorical with continuous, Repeated-measures ANOVA, Mixed- Sign test; Wilcoxon Signed Rank
correlated effects models; GEE, paired t-test Test.
Continuous with continuous Pearson’s correlation Spearman’s correlation
Count with categorical Poisson regression Mann Whitney U Test
Nested comparisons (e.g., Nested Z test:
nested multi-year estimates) 𝑋 −𝑋
𝑍=
𝑆𝐸 + 𝑆𝐸 − 2𝑃 ∗ 𝑆𝐸

# Use the svychisq function to compare estimates. Chi-squared test for gender and
racial differences in the use of the different tobacco products, among middle and
high school students separately (RESULTS SHOWN FOR ONLY HIGH SCHOOL STUDENTS).

chi.hs = list()
outcomes = c("CELCIGT_2", "CCIGT_2", "CCIGAR_2", "csmokeless", "CHOOKAH_2", "CPIPE_2", "C
BIDIS_2", "ctobany", "ctob2", "ctobcomb")
predictors = c("sex", "race4.f")

for (p in predictors) {
chi.hs[[p]] = lapply(colnames(s[,outcomes]), function(x) svychisq(as.formula(paste("~", x
"+", p[[1]])), subset(s, hsms %in% "HS")))
}
w w w . z a t u m c o r p . c o m P a g e | 21

chi.hs
$sex
$sex[[1]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 15.595, ndf = 1, ddf = 53, p-value = 0.000233

$sex[[2]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 2.9471e-05, ndf = 1, ddf = 53, p-value = 0.9957

$sex[[3]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 10.424, ndf = 1, ddf = 53, p-value = 0.002137

$sex[[4]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 74.036, ndf = 1, ddf = 53, p-value = 1.227e-11

$sex[[5]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 0.073306, ndf = 1, ddf = 53, p-value = 0.7876

$sex[[6]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 6.7252, ndf = 1, ddf = 53, p-value = 0.01226

$sex[[7]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 0.14044, ndf = 1, ddf = 53, p-value = 0.7093

$sex[[8]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 13.733, ndf = 1, ddf = 53, p-value = 0.0005044

$sex[[9]]

Pearson's X^2: Rao & Scott adjustment


w w w . z a t u m c o r p . c o m P a g e | 22

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 16.055, ndf = 1, ddf = 53, p-value = 0.0001933

$sex[[10]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 1.9494, ndf = 1, ddf = 53, p-value = 0.1685

$race4.f
$race4.f[[1]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 10.379, ndf = 2.1937, ddf = 116.2700, p-value = 4.021e-05

$race4.f[[2]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 8.9039, ndf = 2.6127, ddf = 138.4700, p-value = 4.961e-05

$race4.f[[3]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 2.0323, ndf = 2.691, ddf = 142.620, p-value = 0.1187

$race4.f[[4]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 13.346, ndf = 2.8453, ddf = 150.8000, p-value = 1.614e-07

$race4.f[[5]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 4.7787, ndf = 2.4707, ddf = 130.9500, p-value = 0.005959

$race4.f[[6]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 3.7582, ndf = 2.3621, ddf = 125.1900, p-value = 0.01993

$race4.f[[7]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 2.9015, ndf = 2.4564, ddf = 130.1900, p-value = 0.04763
w w w . z a t u m c o r p . c o m P a g e | 23

$race4.f[[8]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 5.8917, ndf = 2.2729, ddf = 120.4700, p-value = 0.002429

$race4.f[[9]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 8.7209, ndf = 2.5662, ddf = 136.0100, p-value = 6.859e-05

$race4.f[[10]]

Pearson's X^2: Rao & Scott adjustment

data: svychisq(as.formula(paste("~", x, "+", p[[1]])), subset(s, hsms %in% "HS"))


F = 2.323, ndf = 2.474, ddf = 131.120, p-value = 0.08983

#To get results for middle school students, just run the exact same command
above, changing only the category within the subset function. Currently, it is
subset(s, hsms %in% "HS"). Change it to subset(s, hsms %in% "MS")

STEP 9: Check! Check! Check!

 Check and re-check your code to ensure there are no bugs and all variables have been recoded
correctly.
 Check to make sure the results in your spreadsheets or tables are the same as those in your R
console.
 Check to see that imprecise estimates are not reported. For subgroup analyses, cells with fewer than
30 people may not provide precise estimates. Consider combining similar categories to increase cell
sample size. Relative Standard Errors (RSEs) in the range of 30% to 50% have been used acceptably
in the scientific literature (with prevalence estimates above the cut-off being statistically unreliable).
Estimates above the threshold should ideally be suppressed. RSEs are calculated by dividing the
estimate (mean or percentage) by the standard error.
 Check to ensure proper statistical tests have been conducted. Ninety-five percent confidence intervals
are merely an eyeball test and should not be used as a definitive statistical test to compare two
prevalence estimates. The absence of an overlap ALWAYS indicates a statistically significant
difference between the two estimates being compared. However, the absence of an overlap does NOT
always preclude significance.
 Check the numbers and percentages for correctness in tables and figures, and that they correspond
with information in the text. Ensure tables and figures are able to stand alone with the appropriate
descriptive title and footnotes.
 Check to ensure the description of the methods provides sufficient information so the results could be
duplicated by someone with access to the same data and information. This includes providing within
w w w . z a t u m c o r p . c o m P a g e | 24

the manuscript detailed descriptions of analytical and/or statistical approaches used with clear
definitions of variables used.

STEP 10: Code right, then write right!

 When reporting sample sizes, use the unweighted numbers, NOT the weighted population counts. The
unweighted numbers are the persons who actually completed the survey. For example in the 2017
National Youth Tobacco Survey, a total of 17,872 students in middle and high school participated in the
survey, and the total weighted population count was 27.1 million. The number to be reported as the
sample size is the 17,872 number, NOT the 27.1 million number.
 Report the response rate for the survey.
 It is generally not enough to report only the p-value. There is several valuable information that cannot
be revealed solely by a p value such as the effect size or the consistency of a finding. Presenting
information on both the point estimates and the 95% confidence intervals is preferable because it
provides these estimates of magnitude of effect and consistency.
 When reporting percentages, use weighted NOT unweighted percentages. Otherwise, results may not
be valid because the unweighted results are from a sample whose distributions (e.g., age, sex, race)
may be very different from the target population.
 Inferences from the weighted analyses should be made to the target population rather than the
sampled population. For example, weighted prevalence of current e-cigarette use among high school
students was 11.7% from the 2017 NYTS. Appropriate language to report this result would be “11.7%
of U.S. high school students reported current e-cigarette use”, not “11.7% of sampled high school
students who participated in the survey reported current e-cigarette use”.
 Typically, percentages are expressed to one decimal place, measures of association (e.g., odds ratios,
prevalence ratios, etc.) to two decimal places, and p-values to three decimal places.
 Do not report p-value as 0 (e.g., 0.0000). Rather, express it as < 0.0001
 Provide the percentage of respondents with missing data for key outcomes.
 Describe any sensitivity analyses and rationale.

Suggested Citation: Step-by-Step Guide To Analyses of Complex Survey Data in R. Available at


www.zatumcorp.com. Accessed MM/DD/YYYY.
For comments or questions, please email at info@zatumcorp.com

You might also like