Professional Documents
Culture Documents
Data Analyses R Manual NYTS
Data Analyses R Manual NYTS
Data Analyses R Manual NYTS
c o m P a ge |1
! Not (for example !y$var %in% NA means not missing the variable var within the dataset y)
= Used for assignment. Also, denotes mathematical equality.
== Used within a subset or conditional statement.
> Greater than
>= Greater than or equal to
< Less than
<= Less than or equal to
NA Not available (i.e., missing). NA is very “infectious” and every operation conducted with NA returns
NA, e.g., 5 + NA = NA; 5-NA = NA; 5/NA = NA; 5*NA = NA, etc. NA differs from NaN (not a number,
e.g., 0/0 = NaN) or Inf (infinite, e.g., 5/0 = Inf).
& And
| Or
: to (i.e., range). Commonly used with the recode function, e.g., y$var2 =
car::recode(y$var, “1:10 = 1; 11:hi = 2”). The highest and lowest values in R are
represented by the functions hi and lo respectively
# Comment
c Concatenate or combine: use this operator to group variables or categories meant to receive the same
action or operation. E.g., y$var2 = car::recode(y$var, “c(1,2,3,4,5,6,7,8,9,10)
= 1; 11:hi = 2”)
%in% Within. For example, y$var %in% c(1:10) states a condition assessing if var contains any of the
values 1 to 10.
$ Indexing operator. This comes between a variable name and the dataframe it belongs to. R can hold
many data frames so, this helps R know which dataframe you are referring to. For example, y$var
means the variable “var” that is in the dataframe y.
[ Square bracket. Used as indexing operator. Can be used in place of $ above. For example, y$var is
equivalent to y[“var”]
( Round bracket. Used to pass a function to an object. Everything in R is either a verb (function, does an
action) or a noun (an object, receives an action). Use the round bracket to separate a function from an
object. E.g., mean(y$var, na.rm = T); here the function is mean and the object is var.
{ Curly bracket. Used for loop statements
%>% Piping operator
STEP 2: Read the Excel file into R with the read_excel function.
Note that “file.choose()” which appears many times in the box below allows you to manually select where
the dataset is located on your computer. A pop-up box will appear, and you can navigate to select the file
without having to type a file path.
5. FROM SPSS
library(memisc)
y <- as.data.set(spss.system.file('C:\\Users\\Zatum\\file2016.sav'))
library(foreign)
y <- read.spss(file.choose())
6. FROM STATA
library(haven)
y <- read_dta(file.choose())
library(foreign)
y <- read.dta(file.choose())
library(readstata13)
y=read.dta13(file.choose())
7. FROM CSV
y <- read.csv(file.choose())
ww w. z a t u m co r p .c o m P a ge |4
After reading the data into R, check the dimensions and other properties of the data.
The output above (rows by columns) tells us that there are 17,872 observations in the dataset, and 373
variables. To see all variables in the dataset, type colnames(y)
Variables could be factor; numeric; character/string, or logical. R has a series of functions to determine the structure
of a variable. Take the variable finweight.
To see, its structure, type
> str(y$finwgt)
num [1:17872] 1234 1234 1234 1234 1234 ...
R also has several logical functions which test whether a variable meets a criterion.
> is.numeric(y$ finwgt)
[1] TRUE
2. To compute percentages using the svymean function, you MUST recode your outcome so that cases are assigned
a value of 1 (or 100) and non-cases are assigned a value of 0. in several surveys, responses of “yes” are classified as 1,
while responses of “no” are classified as “2”. You may therefore need to recode 2 to 0. Assigning cases and non-cases
values of 0 and 1 respectively produces results as proportions (e.g., 0.56). Assigning cases and non-cases values of 0
and 100 respectively produces results as percentages (e.g., 56%).
3. When recoding or collapsing outcome variables (e.g., creating a binary variable), responses of “don’t know”, “not
sure” should be excluded because of potential for misclassification; otherwise, justification should be provided for
collapsing them with another category. For outcomes measured on a Likert scale, computing the mean of the raw
responses is not recommended given truncation as well as the fact that the ensuing results have no meaningful
interpretation. The variable(s) could instead be dichotomized based on a priori determined study objectives, e.g.,
“strongly agree”/”agree” vs other responses.
4. Skip patterns: it is very important to recognize that some surveys use skip patterns, meaning that only individuals
eligible for a given question answered it based on their responses to one or more preceding filter questions. To ensure
the accurate denominator is being assessed, both the filter question(s) and the final question may need to be
incorporated as appropriate. For example, in several adult surveys, smoking status is determined by first asking
respondents if they have smoked up to 100 cigarettes in their lifetime. Those who answer yes are then asked if they
smoke now. Those who answer no to the first question are not asked the second question at all (i.e., skipped to the
next question). In creating a variable describing current smoking among all participants with these questions, a value
of 1 will be assigned to those who answered “yes” to both questions (i.e., have smoked 100+ cigarettes AND smoke
now). A value of 0 will assigned to those who answered yes to the first question but no to the second question (i.e.,
have smoked 100+ cigarettes but do not smoke now). A value of 0 will also be assigned to those who answered no to
the first question and were skipped from answering the second question.
5. When creating a composite variable from two or more variables (e.g., the measure of “any tobacco use” in the 2017
NYTS denoting use of at least one of seven different tobacco product types), missing values can be handled in one of
two ways. The first (and more stringent approach) would be to analyze only individuals with complete information on
all variables of interest (listwise deletion). Under this approach, even individuals with information on all but one
variable would be excluded. The second and more conservative approach would be to exclude individuals only if they
were missing information on all variables of interest. Thus, an individual with information present for only one
variable would still be included in the analyses. The first approach may lead to loss of sample size and precision; also,
the sheer magnitude of those excluded potentially increases the magnitude of selection bias if missingness was not at
random. The second approach however increases the likelihood for misclassification bias. For example, if an
individual only had information for one tobacco product (say cigars), for which he reported being a non-user, he
would be classified as a non-tobacco user, even if he used other forms of tobacco (for which data are missing).
Absence of evidence is not evidence of absence! Information should be provided on how missing data were dealt
with. Note that this MMWR used the second approach.
ww w. z a t u m co r p .c o m P a ge |6
This article examined current use prevalence of 10 tobacco products: cigarettes, cigars, smokeless tobacco,
electronic cigarettes, hookah, pipe tobacco, bidis, any tobacco product, 2+ tobacco products, and any
combustible tobacco use. Results were stratified by school level (middle and high school), sex (male and
female) and race (white, black, Hispanic, other race).
$CCIGAR
x
1 2
977 16439
$CSLT
x
1 2
510 16795
$CELCIGT
x
1 2
1360 16210
$CHOOKAH
x
1 2
508 16880
$CPIPE
x
1 2
137 17191
$CSNUS
x
1 2
249 17079
$CDISSOLV
x
1 2
105 17223
$CBIDIS
x
1 2
105 17223
#What does lapply do? Here, it takes the subset of columns of y that are inside
#the c() function, and to each of these variables, it applies the table function.
7 Multi-race, non-Hispanic
School level mshs MS Middle School Middle
HS High School High
Sex sex 1 Female Female
2 Male Male
$SEX
x
1 2
8815 8881
$hsms
x
HS MS
10172 7700
summary(y$finwgt)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.73 630.24 1219.92 1518.33 1951.53 6505.08
table(y$stratum)
BR1 BR2 BR3 BR4 BU1 BU2 BU3 BU4 HR1 HR2 HR3 HR4 HU1 HU2 HU3 HU4
1865 968 1087 1021 1205 930 1318 711 2493 249 707 323 1546 1654 697 1098
$BR2
$BR3
$BR4
$BU1
$BU2
$BU3
$BU4
$HR1
115262 129751 245033 259810 274380 373245 516335 585762 600815 602062 672182 758663
278 170 148 167 189 164 189 72 531 206 205 174
$HR2
686767 692818
32 217
$HR3
$HR4
686434 690913
138 185
$HU1
$HU2
$HU3
$HU4
$CCIGAR_2
x
0 100
16439 977
$CSLT_2
x
0 100
16795 510
$CELCIGT_2
x
0 100
16210 1360
$CHOOKAH_2
x
w w w . z a t u m c o r p . c o m P a g e | 11
0 100
16880 508
$CPIPE_2
x
0 100
17191 137
$CSNUS_2
x
0 100
17079 249
$CDISSOLV_2
x
0 100
17223 105
$CBIDIS_2
x
0 100
17223 105
#create composite variable for any smokeless tobacco (i.e., dissolvable/snus/ chewing
tobacco, snuff, or dip)
table(y$csmokeless)
0 100
17000 706
#Note that the apply function is different from lapply. The “1” that appears just
after the square bracket tells R to perform operations on rows (“2” is for column
operations). In plain language, the code above says: “For each individual row,
examine the three variables provided ("CSLT", "CSNUS", "CDISSOLV") and if an
individual has missing information on all three variables, assign them a value of
NA. Otherwise, if they have a value of 1 on any of the three variables, assign
them a value of 100. For all other people who meet neither of the two criteria
just mentioned, assign them a value of 0”.
# create composite variable for any tobacco use
y$ctobany = apply(y[,c("CCIGT", "CCIGAR", "CSLT", "CELCIGT", "CHOOKAH", "CPIPE", "CSNUS"
, "CDISSOLV", "CBIDIS")], 1, function(x) ifelse(all(is.na(x)), NA, ifelse(any(x == 1, na.
rm = T), 100, 0)))
table(y$ctobany)
0 100
15312 2501
table(y$ctobcomb)
0 100
16084 1715
w w w . z a t u m c o r p . c o m P a g e | 12
table(y$ctob2)
0 100
16650 1163
#Assign value labels to sex
y$sex = factor(y$SEX,
levels=c(1:2),
labels =c("female", "male"))
table(y$sex)
female male
8815 8881
table(y$race4.c)
#numeric
y$race4.n = recode(y$RACE_M, "1 = 1; 2 = 2; 3 = 3; 4:7 = 4; else = NA")
table(y$race4.n)
1 2 3 4
7532 2983 4614 1955
#factor
y$race4.f = factor(y$race4.n,
levels=c(1:4),
labels =c("White", "Black", ))
> table(y$race4.f)
#While race4.f (factor variable) and race4.c (character variable) look alike,
they aren’t the same. Notice how “White” comes first in race4.f without having to
trick R by assigning numbers in front. This is because, behind the string label
is a real number which is the basis for ordering.
STEP 4: Install the survey package in R. Set data to survey mode using the weight, PSU, and
stratum variables.
library(survey)
s = svydesign(data = y, id = ~ y$psu, strata = ~ y$stratum, weights = ~ y$finwgt, nest=T)
1. Some surveys (e.g., certain telephone-based surveys that use random-digit dialing) may not involve a multi-stage
selection process and hence PSUs may not be available. In such cases, setting the data to survey mode will require
using only the weight and stratum variables:
2. Similarly, some surveys may involve a multi-stage selection process but may not involve stratification. In such
cases, the PSU variable will be present, but the stratum variable will be missing. In such cases, setting the data to
survey mode will require using only the weight and PSU variables:
3. The weight variable is used to accurately estimate means and percentages. The PSU and strata variables are used
to accurately estimate measures of variance (e.g., 95% confidence intervals).
4. Since each survey year is individually weighted to represent the population for that year, when appending data
from multiple years to increase sample size for a point estimate, the weights have to be adjusted by dividing by the
number of years pooled. The newly adjusted weight variable would then be used to set the data to survey mode.
5. For the weight, stratum, or PSU variables, it is either all or none (all individuals have the variable, or no individual
does). There cannot be missing values for any of these variables. When appending data from multiple years, inspect
carefully to ensure that information is complete for all observations in the dataset. Otherwise, results may be
erroneous and invalid.
6. Certain analytic techniques (e.g., inverse proportionality weighting) create weights that are different from the
survey weights that came with the dataset. Since only one set of weights can be used in analyses, compute new
weights by multiplying the survey weights within the dataset with the weights generated from the statistical
procedure. These newly created weights can then be used to set the data to survey mode.
7. Since the dataframe (in this case, y) and the created survey object (in this case, s) are distinct from each other,
making changes in the dataframe (e.g, creating new variables) requires creating a new survey object using the
svydesign function above. Note that all complex survey analytical procedures are based off the survey object, not
the parent dataframe.
w w w . z a t u m c o r p . c o m P a g e | 14
STEP 5: Compute overall prevalence estimates for all the outcomes for middle and high school
students separately. For reference, the results from the MMWR are shown below. The estimates of interest are
shown in red.
#For clarity, a simpler (but slightly longer) code is used below. Means and
confidence intervals are computed separately and then put together using the Map
function(The R survey package does not automatically compute the 95% confidence
intervals; you have to generate them yourself). Following the stratification
approach in the MMWR report, the analyses are subset to high school and middle
school students separately using the subset function (RESULTS SHOWN ONLY FOR HIGH
SCHOOL STUDENTS).
$CCIGT_2
2.5 % 97.5 %
CCIGT_2 7.657248 6.463722 8.850773
$CCIGAR_2
2.5 % 97.5 %
CCIGAR_2 7.744072 6.525348 8.962796
$csmokeless
2.5 % 97.5 %
csmokeless 5.455253 4.045135 6.865371
$CHOOKAH_2
2.5 % 97.5 %
CHOOKAH_2 3.434495 2.730435 4.138555
$CPIPE_2
2.5 % 97.5 %
CPIPE_2 0.8692909 0.6751347 1.063447
$CBIDIS_2
2.5 % 97.5 %
CBIDIS_2 0.7258421 0.5164538 0.9352305
$ctobany
2.5 % 97.5 %
ctobany 19.60851 17.05665 22.16037
$ctob2
2.5 % 97.5 %
ctob2 9.301875 7.757809 10.84594
$ctobcomb
2.5 % 97.5 %
ctobcomb 12.98673 11.21423 14.75922
#To get results for middle school students, just run the exact same command
above, changing only the category within the subset function. Currently, it is
subset(s, hsms %in% "HS"). Change it to subset(s, hsms %in% "MS")
1. Use the svymean function to generate overall estimates. Use the svyby function to generate stratified estimates.
Use the svytable function to generate weighted population counts. Use the svychisq function to compare
estimates.
2. Using regular functions that do not account for the complex survey design (e.g., mean instead of svymean) may
yield invalid results
3. Percentages and counts generated are only estimates of the true population parameter; number of decimal places
should reflect this and not display an unreasonable degree of precision. Round percentages to the nearest 1 decimal
place, population counts to the nearest 100,000.
4. 95% confidence intervals do not have to be provided when a complete census of the study population is taken.
Similarly, parametrically-computed confidence intervals are not scientifically justifiable for non-probability samples
because there are no associated sampling errors (no randomness in selection); there is thus no mathematical basis for
computing standard errors for such samples. 95% confidence intervals in those cases can be computed using
bootstrapping; the quartiles at 0.025 and 0.975 yield the boot-strapped 95% confidence intervals.
w w w . z a t u m c o r p . c o m P a g e | 16
STEP 6: Compute weighted population counts for all outcomes for middle and high school
students separately. For reference, the results from the MMWR are shown below. The estimates of interest are
shown in orange.
# We will use the svytable function to compute the weighted population counts for
high school and middle school students separately. (RESULTS PRESENTED BELOW ARE
ONLY FOR HIGH SCHOOL STUDENTS). The numbers below 0 represent the weighted counts
of non-users of the specified tobacco product. The numbers below 100 are the
weighted counts of users of the specified tobacco product. For example, for
current electronic cigarette use (celcigt), 1723292 high school students (~1.7
million) reported current use.
wcount
$CELCIGT_2
CELCIGT_2
0 100
13039887 1723292
$CCIGT_2
CCIGT_2
0 100
13549932 1123588
$CCIGAR_2
CCIGAR_2
w w w . z a t u m c o r p . c o m P a g e | 17
0 100
13525742 1135367
$csmokeless
csmokeless
0 100
14084680.4 812689.2
$CHOOKAH_2
CHOOKAH_2
0 100
14162043 503694
$CPIPE_2
CPIPE_2
0 100
14501264.1 127163.6
$CBIDIS_2
CBIDIS_2
0 100
14522248.4 106179.3
$ctobany
ctobany
0 100
12027983 2933779
$ctob2
ctob2
0 100
13570037 1391724
$ctobcomb
ctobcomb
0 100
13009354 1941645
#To get results for middle school students, just run the exact same command
above, changing only the category within the subset function. Currently, it is
subset(s, hsms %in% "HS"). Change it to subset(s, hsms %in% "MS")
1. Deleting cases from a survey data set can be problematic since it can lead to wrong estimation of the standard
errors. For example, if you wanted to analyze the smoking prevalence among only high school students, and you
dropped all observations for middle school students, this would be inappropriate because the standard errors of the
estimates would be incorrectly estimated. In calculating subpopulation estimates, only the cases defined by the
subpopulation are to be used in the calculation of the estimate, however all cases in the dataset should be used in the
calculation of the standard errors.
2. The svyby function can be used for subgroup analyses. The stratification variable should be stored in numeric
format (don’t recode the variables as string! )
3. Suppression rules are used when dealing with subpopulation estimates to ensure that only precise estimates are
presented. Common suppression rules are: relative standard errors >30% or cell sample sizes < 30 persons.
w w w . z a t u m c o r p . c o m P a g e | 18
For simplicity, we will generate stratified prevalence estimates for sex and race/ethnicity separately. For each
variable, results are analyzed for all products and school levels simultaneously. The desired estimates are
shown below:
# We will use the svyby function to compute sex and race-stratified prevalence
estimates among high school and middle school students separately. (RESULTS
PRESENTED BELOW ARE ONLY FOR HIGH SCHOOL STUDENTS). We are introducing the
function round here (estimates rounded to 1 decimal place).
means.hs = list()
varlist = c("CELCIGT_2", "CCIGT_2", "CCIGAR_2", "csmokeless", "CHOOKAH_2", "CPIPE_2",
"CBIDIS_2", "ctobany", "ctob2", "ctobcomb")
$CCIGT_2
as.numeric(sex) CCIGT_2 se 2.5 % 97.5 %
1 1 7.6 0.8 6.1 9.1
w w w . z a t u m c o r p . c o m P a g e | 19
$CCIGAR_2
as.numeric(sex) CCIGAR_2 se 2.5 % 97.5 %
1 1 6.3 0.7 4.9 7.7
2 2 9.0 0.8 7.5 10.5
$csmokeless
as.numeric(sex) csmokeless se 2.5 % 97.5 %
1 1 3.1 0.4 2.2 3.9
2 2 7.6 1.1 5.6 9.7
$CHOOKAH_2
as.numeric(sex) CHOOKAH_2 se 2.5 % 97.5 %
1 1 3.3 0.4 2.5 4.0
2 2 3.4 0.5 2.5 4.3
$CPIPE_2
as.numeric(sex) CPIPE_2 se 2.5 % 97.5 %
1 1 0.5 0.1 0.3 0.8
2 2 1.0 0.2 0.8 1.3
$CBIDIS_2
as.numeric(sex) CBIDIS_2 se 2.5 % 97.5 %
1 1 0.6 0.1 0.4 0.8
2 2 0.7 0.2 0.4 1.0
$ctobany
as.numeric(sex) ctobany se 2.5 % 97.5 %
1 1 17.6 1.2 15.2 19.9
2 2 21.4 1.6 18.3 24.5
$ctob2
as.numeric(sex) ctob2 se 2.5 % 97.5 %
1 1 7.7 0.8 6.2 9.2
2 2 10.7 0.9 8.8 12.5
$ctobcomb
as.numeric(sex) ctobcomb se 2.5 % 97.5 %
1 1 12.2 0.9 10.4 14.1
2 2 13.5 1.1 11.4 15.5
#To get results for middle school students, just run the exact same command
above, changing only the category within the subset function. Currently, it is
subset(s, hsms %in% "HS"). Change it to subset(s, hsms %in% "MS")
1. Computed 95% Confidence intervals are a measure of the degree of precision of an estimate and should not be used
in lieu of a formal comparison of two estimates. Non-overlap of 95% confidence intervals always indicates that two
estimates differ statistically; however, the presence of an overlap does not preclude statistical significance. A formal
statistical test should therefore always be performed (e.g., a chi-squared test). The type of test will depend on the
variable type (e.g., categorical or continuous) and the underlying assumptions regarding distributions of the data
(parametric or non-parametric). Non-parametric tests are those that make no assumptions regarding parameters or
distributions. The table below shows some appropriate tests for bivariate testing based on variable type and
assumptions regarding underlying distributions.
Categorical with continuous, ANOVA, Z statistic, Regression (linear or Kruskal-Wallis test (non-parametric
independent logistic), t-test for independent samples alternative to ANOVA)
Categorical with continuous, Repeated-measures ANOVA, Mixed- Sign test; Wilcoxon Signed Rank
correlated effects models; GEE, paired t-test Test.
Continuous with continuous Pearson’s correlation Spearman’s correlation
Count with categorical Poisson regression Mann Whitney U Test
Nested comparisons (e.g., Nested Z test:
nested multi-year estimates) 𝑋 −𝑋
𝑍=
𝑆𝐸 + 𝑆𝐸 − 2𝑃 ∗ 𝑆𝐸
# Use the svychisq function to compare estimates. Chi-squared test for gender and
racial differences in the use of the different tobacco products, among middle and
high school students separately (RESULTS SHOWN FOR ONLY HIGH SCHOOL STUDENTS).
chi.hs = list()
outcomes = c("CELCIGT_2", "CCIGT_2", "CCIGAR_2", "csmokeless", "CHOOKAH_2", "CPIPE_2", "C
BIDIS_2", "ctobany", "ctob2", "ctobcomb")
predictors = c("sex", "race4.f")
for (p in predictors) {
chi.hs[[p]] = lapply(colnames(s[,outcomes]), function(x) svychisq(as.formula(paste("~", x
"+", p[[1]])), subset(s, hsms %in% "HS")))
}
w w w . z a t u m c o r p . c o m P a g e | 21
chi.hs
$sex
$sex[[1]]
$sex[[2]]
$sex[[3]]
$sex[[4]]
$sex[[5]]
$sex[[6]]
$sex[[7]]
$sex[[8]]
$sex[[9]]
$sex[[10]]
$race4.f
$race4.f[[1]]
$race4.f[[2]]
$race4.f[[3]]
$race4.f[[4]]
$race4.f[[5]]
$race4.f[[6]]
$race4.f[[7]]
$race4.f[[8]]
$race4.f[[9]]
$race4.f[[10]]
#To get results for middle school students, just run the exact same command
above, changing only the category within the subset function. Currently, it is
subset(s, hsms %in% "HS"). Change it to subset(s, hsms %in% "MS")
Check and re-check your code to ensure there are no bugs and all variables have been recoded
correctly.
Check to make sure the results in your spreadsheets or tables are the same as those in your R
console.
Check to see that imprecise estimates are not reported. For subgroup analyses, cells with fewer than
30 people may not provide precise estimates. Consider combining similar categories to increase cell
sample size. Relative Standard Errors (RSEs) in the range of 30% to 50% have been used acceptably
in the scientific literature (with prevalence estimates above the cut-off being statistically unreliable).
Estimates above the threshold should ideally be suppressed. RSEs are calculated by dividing the
estimate (mean or percentage) by the standard error.
Check to ensure proper statistical tests have been conducted. Ninety-five percent confidence intervals
are merely an eyeball test and should not be used as a definitive statistical test to compare two
prevalence estimates. The absence of an overlap ALWAYS indicates a statistically significant
difference between the two estimates being compared. However, the absence of an overlap does NOT
always preclude significance.
Check the numbers and percentages for correctness in tables and figures, and that they correspond
with information in the text. Ensure tables and figures are able to stand alone with the appropriate
descriptive title and footnotes.
Check to ensure the description of the methods provides sufficient information so the results could be
duplicated by someone with access to the same data and information. This includes providing within
w w w . z a t u m c o r p . c o m P a g e | 24
the manuscript detailed descriptions of analytical and/or statistical approaches used with clear
definitions of variables used.
When reporting sample sizes, use the unweighted numbers, NOT the weighted population counts. The
unweighted numbers are the persons who actually completed the survey. For example in the 2017
National Youth Tobacco Survey, a total of 17,872 students in middle and high school participated in the
survey, and the total weighted population count was 27.1 million. The number to be reported as the
sample size is the 17,872 number, NOT the 27.1 million number.
Report the response rate for the survey.
It is generally not enough to report only the p-value. There is several valuable information that cannot
be revealed solely by a p value such as the effect size or the consistency of a finding. Presenting
information on both the point estimates and the 95% confidence intervals is preferable because it
provides these estimates of magnitude of effect and consistency.
When reporting percentages, use weighted NOT unweighted percentages. Otherwise, results may not
be valid because the unweighted results are from a sample whose distributions (e.g., age, sex, race)
may be very different from the target population.
Inferences from the weighted analyses should be made to the target population rather than the
sampled population. For example, weighted prevalence of current e-cigarette use among high school
students was 11.7% from the 2017 NYTS. Appropriate language to report this result would be “11.7%
of U.S. high school students reported current e-cigarette use”, not “11.7% of sampled high school
students who participated in the survey reported current e-cigarette use”.
Typically, percentages are expressed to one decimal place, measures of association (e.g., odds ratios,
prevalence ratios, etc.) to two decimal places, and p-values to three decimal places.
Do not report p-value as 0 (e.g., 0.0000). Rather, express it as < 0.0001
Provide the percentage of respondents with missing data for key outcomes.
Describe any sensitivity analyses and rationale.