Week 01

ISyE6421: Biostatistics
ISYE6421
• Lectures: TR 2:00-3:15pm, IC#109

• Instructor: Dr. Yajun Mei
(Pronounced as “YA-JUNE MAY”)
Email: <ymei@isye.gatech.edu>
Office Hours: after class. Please use piazza for
class-related questions
• TA: TBD
My academic pathway
• Undergraduate: Math, Peking Univ., BS in 1996

• Work as a computer programmer in a Chinese bank,
1996-1998
• Graduate: PhD in Math with a minor in EE, Caltech,
1998-2003 (advisor: Dr. Gary Lorden)
• Post Doc in biostatistics: Fred Hutchinson Cancer
Research Center, eattle, 2003-Sep 2005
(supervisor: Dr. Sarah Holte)
• New Research Fellow: SAMSI & Duke, Fall 2005
• Joined ISyE of GT since Jan 2006, and currently a full
professor.
Lecture 1
Agenda
• Course organization
• Introduction to Biostatistics
• Biostatistical Design of Medical

Studies
More about the Course
• This is an introductory graduate course in Biostatistics.

• Different from a mathematical / statistical course:
This course is well described by a problem solving and
practical approach, with a strong emphasis on conceptual
understanding
• Recommended books. Notes/slides provided.

• Van Belle, Heagerty, Fisher, and Lumley, “Biostatistics,
a methodology for the health sciences”
• Rosner, “Fundamentals of Biostatistics”
• Pinheriro & Bates, “Mixed effect models in S and S-Splus”
(can google more e-books or books on biostat and/or R).
4
Organization of the Course
Topics:
• Biostatistical Design of Medical Studies
• Software: R (or SAS, BUGS, MATLAB)
• Basic statistical inference
• Categorical Data: OR, RR, Mantel-Haenszel
• Continuous Data: Parametric/Nonparametric
• Review of Linear and Logistic Regression
• Multiple Comparisons
• False Discovery Rate
• Survival analysis: censoring, Kaplan-Meier, Cox.
• Applications (Genetics,-omics, CHIP-Seq)
• Longitudinal data analysis
• Sample size calculation for studies
• Brief Introduction to Causal Inference 5
Course Organization (cont.)
Grading:
• Homework: 20%
• Class participation: 10% (peer comments)
• Midterm: 30% (Thursday, March 10)

• You are allowed to bring 4 double-sided cheat sheets
or 8 one-sided cheat sheets.
• Calculators are allowed.
• Class project on bio or health related topics:

40% (a team of 1-3 students)
➢ Progress report: 2% (due March 17)
➢ Oral presentation: 18% (file due 9am on April 19)
➢ Final written report: 20% (file due April 26)
6
This Lecture
• Introduction to Biostatistics
• Biostatistical Design of Medical

Studies
7
What is Statistics?
Examples:
• Parents of a child with a genetic defect consider
whether or not they should have another child
• To choose the best therapy, a physician must
compare the prognosis, or future course, of a
patient under several therapies
• Does smoking cause cancer?
Key elements: uncertainty, variation, inference.
Some Definitions of Statistics:
• “May be regarded as mathematics applied to
observational data….as the study of (1) population; (2)
variation; (iii) methods of the reduction of data.” (Fisher)
• “= Uncertainty and Behavior.” (Savage)
• “the art of learning from data.” (Ross)
8
Why Biostatistics?
• Biostatistics: a field of science that applies

statistical methods to medical and biological
problems
• Why focusing on biostatistics (not statistics)?

• Some statistical methods are used more heavily in
biostatistics than in other field (e.g., survival
analysis)
• To show how to apply stat methods, examples are
drawn from the biological, medical and health care
areas
9
Goal of this course
Some Biostatistical Problems:
1. Is a new drug more effective in treating an illness than a
previously used drug?
2. Does the use of a seat belt decrease the chance of death
in a car accident?
3. How does the frequency of laboratory tests influence the
quality of medical care?
4. Are growth curves for boys and girls significnatly
different?
Objective: you’re expected to be able to

• understand specified stat concepts and procedures
• identify procedures appropriate (and inappropriate) to a
given situation
• carry out appropriate specified stat procedures
10
Lecture 1
Biostatistical Design of Medical Studies:
• Introduce some of the principles of

biostatistical design. Many of the idea are
expanded later.
Reminder: statistics is not an end in itself but a

tool to be used in investigating the world
around us.
11
Types of Studies: I
• Observational Study: collects data from an
existing situation. The data collection does not
intentionally interface with the running of the
system
(Remark: the act of observation may introduce change
into a system)
• Experiment: a study in which an investigator
deliberately sets one or more factors to a
specific level
(Remark: experiments lead to stronger scientific
inferences than do observational studies, which are
always open to misinterpretation due to a lack of
knowledge in a given field)
12
Three types of Experiments
• Laboratory experiment: an experiment that takes
place in an environment (called a laboratory) where
experimental manipulation is facilitated
• Comparative experiment: an experiment that

compares two or more techniques, treatments, or levels
of a variable.
• Crossover experiment: if the same experimental

unit receives more than one treatment or is investigated
under more than one condition of experiment. Different
treatment given during nonoverlapping time periods.
Remark: A clinical study is one that takes place in the

setting of clinical medicine.
13
Types of Studies: II
Suppose that we want to study the

characteristics of cases of a disease. There are
two possible approaches:
• Longitudinal study: collects information on
study units over a specified time period.
(repeated measurements per subject)
• Cross-sectional study: collects data on study

units at a fixed time
(drawback: more subjects with long duration, large
between-subject variation)
14
Types of Studies: III
We now turn to concepts of Prospective

Studies and Retrospective Studies,
usually involving human population
• Cohort /Prospective Study

• Case-Control /Retrospective Study
15
Cohort Study / Prospective Study
Cohort: (from www.m-w.com)
1. one of 10 divisions of an ancient Roman

legion
2. a group of warriors or soldiers
3. Band, Group
4. a group of individuals having a statistical
factor (as age or class membership) in
common in a demographic study
16
Cohort Study / Prospective Study
• A cohort of people is a group of people whose

membership is clearly defined.
▪ all students enrolling in GT for spring 2008
▪ All smokers in US as of Jan 1 2008 where a person is
defined to be a smoker if (s)he smoked one or more
cigarettes during the preceding calendar year
• An Endpoint is a clearly defined outcome or
event associated with an experiment or study
unit.
• Presence of a particular disease
• A Cohort (Prospective) study is one in which a
cohort of people is followed for the occurrence
or non-occurrence of specified endpoints or
events or measurements.
17
Cohort Study: Example
• Important: in Cohort study, information about the risk
factor (exposure to disease) is determined prior to the
observation of disease status.
• Can conduct a prospective study from existing data ---
historical prospective
Example: For each of 49000 newborn infants, record Apgar

scores at 1 and 5 minutes. For those with low score (<8
at 5 min), then record Apgar score at 10,15,20 minutes. All
children followed to age of 7 years. A psychological and
development assessment is performed at age 7.
• Result: low Apgar scores are a risk factor for

the development of Cerebral Palsy (CP).
18
Cohort Studies: Advantage
Advantage:
• Strongest observational design for establishing
cause and effects relationship
• Very efficient for study of rare exposure
• Clear temporal relationship between exposure
and disease
• can yield information on multiple exposures, or
on multiple outcomes of a particular exposure
• May yield information on incidence of disease
19
Cohort Studies: disadvantage
Disadvantage:
• Time consuming
• Often require a large sample size
• Expensive
• Not efficient for the study of rare disease
• Losses to follow-up may diminish validity
• Change over time in diagnostic methods may
lead to biased results
20
Case-Control / Retrospective Studies
• A retrospective study is one in which people

having a particular outcome or endpoint are
identified and studied.
so cases is group of people who have a particular disease or criteria and
their old habits/medical history is used to compare with control group
• A Case-control study selects all cases, usually

of a disease, that meet fixed criteria. A group,
called controls, that serve as a comparison for
the cases is also selected. The cases and
controls are compared with respect to various
characteristics.
• Attempt to relate their prior health habits to their
current disease status
21
Case-Control : Example
• Cases: all Patients with Eosinophila-myalgia

Syndrome (EMS) in Minneapolis-St. Paul
• Control: randomly selected telephone numbers
in same area
• Interview subjects and asked about potential
risk factors and use of L-tryptophan
• 61 of 63 (97%) case subjects used L-
tryptophan, but only 101 of 5188 control
subjects used.
• Results: after recall of L-tryptophan in 1990,
number of EMS cases in US decreases from
1500 in 1988/1989 to near 0.
22
Case-Control Studies: Advantage
Advantage:
• Efficient for the study of rare disease
• Efficient for the study of chronic disease
• Tend to require a smaller sample size than
cohort studies
• Less expensive than cohort studies
• May be completed more rapidly than cohort
studies
23
Case-Control Studies: Disadvantage
Disadvantage:
• Risk of disease cannot be estimated directly
• Not efficient for the study of rare exposures
• More susceptible to selection bias than cohort
studies
• Information on exposure may be less accurate
than that available in cohort studies
24
Cohort vs. Case-Control
Exposure Disease
+ -
+ n11 n12
Rows fixed in
Cohort
- n21 n22
• Cohort (Prospective) Study: The totals for (row)

“exposure +” and “exposure –” are fixed, and column
totals will vary depending on the association.
its retrospective, so columns are fixed ie we know the end result and we are checking if certain kind of exposure caused
it so it can vary
• Case-Control (retrospective) Study: The totals

for (column) “disease +” and “disease –” are fixed, and
row totals are random.
cohort is like you fix a group with certain characteristics and observe them over a period of time to see if
disease occurs or no. Its perspective ie looking forward and hence exposure is know like who got exposed and
not, but outcome ie who got disease is not known at beginning of study. 25
Summary: various types of studies
• Observational study vs. experiment
• Longitudinal vs. cross-sectional
• Cohort vs. Case-control
Remark: A clinical study is one that takes place

in the setting of clinical medicine.
26
Steps necessary to perform a study
1. A question or problem area of interest is considered.

This does not involve biostatistics.
2. A study is to be designed to answer the question. Must
consider at least the following:
a. Identify the data to be collected (variables to be
measured, sample size)
b. An appropriate analytical model needs to be
developed for describing and processing the data
c. What inferences does one hope to make from the
study? What conclusions might one draw from the
study? To what population(s) is the conclusion
applicable?
3. The study is carried out and the data are collected
4. The data are analyzed and conclusions and inferences
are drawn
5. The results are used, e.g., publication, plan new study.
27
Ethical Issues
Some points relevant to experimentations with human:

1. Investigators are responsible for the conduct of an ethical
study to the extent that they may be expected to know what
is involved in the study
2. In proposed studies involving humans or animals, there
should be review by people not concerned or connected with
the study or the investigators
3. People participating in an experiment should understand and
sign an informed consent form
4. Subjects should be free to withdraw at any time, or to refuse
initial participation, without being penalized or jeopardized
with respect to current and future care and activities
5. When possible, animal studies should be done prior to
human experimentation.
28
• Data collection: Design of forms (what data are to
be collected, clarify of questions, pretesting of forms
and pilot studies, layout and appearance).
• Data editing and verification (validity check,
consistency check, missing forms)
• Data handling
• Amount of Data collected: sample size
• Inference from a study
Remark: in many studies, 15% of the expenses

has been in data handling, processing and
analysis.
29
Summary
Biostatistical design of medical studies:
• Various Types of Studies

• Steps necessary to perform a study
• Ethics
• Data handling, processing and analysis
Next Topic: 2x2 Contingency Table (cohort vs.

case-control)
30
Topic: 2x2 Table
• 2x2 Contingency Table

(cohort vs. case-control)
• Technical Notation:
➢ Relative Risk (RR): can be estimated in
cohort studies, but not in case-control
studies
➢ Odds Ratio (OR): can be estimated in both
cohort and case-control studies
31
Recall: Cohort vs. Case-Control
Exposure Disease
+ -
+ n11 n12
- n21 n22
• Cohort (Prospective) Study: The totals for (row)

“exposure +” and “exposure –” are fixed, and column
totals will vary depending on the association.
• Case-Control (retrospective) Study: The totals

for (column) “disease +” and “disease –” are fixed, and
row totals are random.
32
Comparing Two Proportions
Disease + Disease - Total
Exposure + n11 n12 n1

Exposure - n21 n22 n2
Disease in exposed population ~ b(n1, p1),

Disease in unexposed ~ b(n2, p2)
Want to test null Hypothesis H0: p1 = p2
33
Example: ABO Hemolytic
• Bucher et a. (1976) studied the occurrence

of hemolytic disease in newborns from ABO
incompatibility between parents (i.e.,
father has antigens that the mother lacks)
• The authors reviewed 7464 consecutive
infants born at North Carolina Hospital
during Oct 1965 to March 1973
• One problem considered in the paper is the
racial differences in the incidence of ABO
hemolytic disease.
34
Example
ABO Hemolytic Disease
Total
Yes No
This is the
exposure Black infant 43 3541 3584
White infant 17 3814 3831
H0: no racial differences in disease rates
Four Methods:
• Small sample: Fisher’s exact test
• Large sample: Three tests
35
I. Fisher’s exact test
Disease + Disease -

1. It is for small samples
2. Given the row and column totals, what

is the distribution of n11?
3. based on hyper-geometric distribution

the hypergeometric distribution is a discrete probability distribution that describes the probability of k {\displaystyle k} k
successes (random draws for which the object drawn has a specified feature) in n {\displaystyle n} n draws, without
replacement, from a finite population of size N {\displaystyle N} N that contains exactly K {\displaystyle K} K objects with that
feature, wherein each draw is either a success or a failure. 36
Fisher’s exact test: Example
Total
Yes No
Black infant 43 3541 3584
data1 <- matrix(c(43, 17, 3541, 3814), nr=2)

fisher.test(data1)
p-value = 0.0003712
Reject H0 at 5% level
Accept the null hypothesis ie there is a racial difference in disease
rates - ie true odds ratio is not equal to 1 ie odds are different for 37
each race
II. Large Sample Test A
Disease + Disease -

“Exposure +” ~ b(n1, p1), two sample test -

checkout on
“exposure –” ~ b(n2, p2) internet
Null Hypothesis H0: p1 = p2

Key idea:
where the overall disease rate

ෝ𝟏 + 𝒏𝟐 𝒑
𝒏𝟏 𝒑 ෝ𝟐 𝒏𝟏𝟏 + 𝒏𝟐𝟏
ෝ=
𝒑 =
𝒏𝟏 + 𝒏𝟐 𝒏𝟏 + 𝒏𝟐 38
Large Sample Test A: Example
Total
Yes No
Since T>¸ Z0.025=1.96, reject H0 at 5% level

39
III. Large Sample Test B
Disease + Disease -

Under H0
40
Large Sample Test B: Example

Total
Yes No
Since |T*|> Z0.025=1.96, reject H0 at 5% level

41
IV: Chi-Square Test of independence

Total S=n11+n21 F =n12+n22 n
Expected Frequency table

Exposure + n1S/n n1F/n n1
Exposure - n2S/n n2F/n n2
Total S=n11+n21 F =n12+n22 n
42
IV: Chi-Square Test of independence
43
Chi-Square Test of independence: Example

Total
Yes No
Total 60 7355 7415
x <- c(43, 3541, 17, 3814) ; n <- 7415;
y <- c(60*3584, 7355*3584, 60*3831, 7355*3831)/n;
> sum((x - y) * (x-y) / y)
[1] 13.18662
Since |T** |> Z20.025=3.84, reject H0 at 5% level
44
Chi-Square Test: R code
> data1 <- matrix(c(43, 17, 3541, 3814), nr=2)

> chisq.test(data1)
Pearson's Chi-squared test with Yates'

continuity correction
data: data1
X-squared = 12.2615, df = 1, p-value =
0.0004624
45
Summary: Tests of Two ind. Bin RV
• Small sample: Fisher’s exact test

• Large sample: Three tests
46
Measures of Effects for Bin RV
• While the previous tests will allow us to

determine whether an association between
two binary variables, they do not provide a
measure the strength of the association
• Want to estimate the magnitude of the effect

(or summarize the association)
▪ Risk Difference
▪ Relative risk
▪ Odds ratio
47
1. Risk Difference
• Let
p1 = probability of developing disease
for exposed individuals;
p2 = probability of developing disease
for unexposed individuals
• Risk Difference = p1 – p2
• Relative risk, or Risk Ratio, RR= p1/p2

48
2. Relative risk
• Relative risk, or Risk Ratio, RR= p1/p2

• Point Estimate:
• Interval estimation: is normally
distributed with
Thus 100%(1-) CI for RR is

[exp(c-), exp(c+)], where
49
Relative risk
• Disadvantage: being constraint by p2
• For example, if p2 = 0.5, then

RR = p1 / p2 <= 1/0.5 = 2;
Similarly, if p2 = 0.8, then

RR = p1 / p2 <= 1/0.8 = 1.25.
50
3. Odds Ratio
odds = prob of success / prob of failure
Odds ratio = odds of exposed group / odds of unexposed group
• If the probability of a success =p, then the

odds in favor of success = p / q, where q=1-p
• The odds ratio:
and is estimated by
• The disease-odds ratio is the odds in favor of

disease for the exposed group divided by the
odds in favor of disease for unexposed group.
51
Example (Continued)
Total
Yes No
52
Hypothetical Case-Control Study
Disease + Disease -
Sample Exposed + a b
Exposed - c d
Disease + Disease -
Population Exposed + A B
Exposed - C D
This is a case control so we finalize the end outcome ie column sum is fixed and rows proportion can vary as we trace back the cases
Assume we sample random fraction f1, f2 from

disease + and disease – groups, respectively.
i.e., a =f1 A, c = f1 C, b = f2 B, d = f2 D
53
Sample RR and Population RR
Disease + Disease -
Exposed - c d
Disease + Disease -
Exposed - C D
a =f1 A, c = f1 C, b = f2 B, d = f2 D
probability of people who have disease has this exposure
probability of people who have disease n not exposed

actual relative risk
54
Sample RR and Population RR
Disease + Disease -
Exposed - c d
Disease + Disease -
Exposed - C D
a =f1 A, c = f1 C, b = f2 B, d = f2 D
Relative risk
estimation for a
case control study 55
Hypothetical Case-Control Study
• They are same only if f_1 = f_2, that is, if the

sampling fraction of subjects with disease and
without disease are the same
• This is unlikely in a case-control study since the
usual sampling strategy is to oversampling
subjects with disease
56
Sample OR and Population OR
Disease + Disease -
Exposed - c d
Disease + Disease -
Exposed - C D
a =f1 A, c = f1 C, b = f2 B, d = f2 D
57
Sample OR and Population OR
Disease + Disease -
Exposed - c d
Disease + Disease -
Exposed - C D
a =f1 A, c = f1 C, b = f2 B, d = f2 D
58
Hypothetical Case-Control Study: OR
• Thus odds ratio estimated from our sample is

unbiased estimate of the odds ratio from
reference population
• If the disease is rare, i.e., pi are very small,
then RR ≈ OR.
59
Estimation of OR
Disease + Disease -
Exposed - c d
Log(OR) is normally distributed and hence you

can apply CLT and find confidence interval
60
Summary: RR and OR

ie p1 =p1 and hence RR = 1=p1/

• If no association, then RR=OR=1 p2, also, OR = 1 by formula
• If RR or OR are greater than 1, exposed group

has an increased risk of disease
• If RR or OR are less than 1, unexposed group
has increased risk of disease
61
RR or OR?
Disease + Disease -
Exposure + n11 n12
Exposure - n21 n22
Totals for Can one

Type of Study estimate the:
Column Row Relative Odds
Risk ? Ratio?
Cohort Random Fixed Yes Yes
(Prospective)
Case-Control Fixed Random No Yes
(Retrospective) as estimation is
biased
62
Example: Smoking-Perinatal Mortality
• Meryer, et al. (1976). All births in 10 Ontario (Canada)

teaching hospitals during 1960-1961. want to study the
association of perinatal events and maternal smoking
during pregnancy. Data are:
Maternal Perinatal Mortality

Smoking
Yes No Total
Yes 619 20,443 21,062
No 634 26,682 27,316
Total 1,253 47,125 48,378
63
Smoking-Perinatal Mortality: OR
We get only the association between two variables or groups
• We estimate that smoking during pregnancy is

associated with an increase risk of perinatal
mortality that is 1.27 times large.
• Note: We have not concluded that smoking causes
the mortality, only that there is an association.
64
Smoking-Perinatal Mortality: Tests
• Might there really be no association and the

estimated RR or OR differ from 1 merely by
chances?
• Test the hypothesis of no association by using
Fisher’s exact test (for small samples) of the
chi-squared test (for large samples).
> x <- matrix(c(619,634,20443,26682), nr=2)
> chisq.test(x)
X-squared = 17.7562, df = 1, p-value = 2.511e-05
> fisher.test(x)
p-value = 2.462e-05
Thus the association is statistically significant at

5% level
65
Smoking-Perinatal Mortality: CI
• Now there is statistically significant

association, what can one say about the
accuracy of the estimates of OR?
• CI for OR
• First, 95% CI for log(OR):
or 0.2390 § 0.1122 or (0.1268, 0.3512)

• Second , the 95% CI for OR is
66
Summary
• Relative Risk (RR): can be estimated in

cohort studies, but not in case-control studies
• Odds Ratio (OR): can be estimated in both

cohort and case-control studies
67

Week 01

Uploaded by

Copyright:

Available Formats

You might also like

Week 01

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 01

Uploaded by

Copyright:

Available Formats

ISyE6421: Biostatistics

• Lectures: TR 2:00-3:15pm, IC#109

• Undergraduate: Math, Peking Univ., BS in 1996

• Biostatistical Design of Medical

• This is an introductory graduate course in Biostatistics.

• Recommended books. Notes/slides provided.

• Class participation: 10% (peer comments)

• Midterm: 30% (Thursday, March 10)

• Class project on bio or health related topics:

• Biostatistical Design of Medical

• Biostatistics: a field of science that applies

• Why focusing on biostatistics (not statistics)?

Objective: you’re expected to be able to

Biostatistical Design of Medical Studies:

• Introduce some of the principles of

Reminder: statistics is not an end in itself but a

• Comparative experiment: an experiment that

• Crossover experiment: if the same experimental

Remark: A clinical study is one that takes place in the

Suppose that we want to study the

• Cross-sectional study: collects data on study

We now turn to concepts of Prospective

• Cohort /Prospective Study

Cohort: (from www.m-w.com)

1. one of 10 divisions of an ancient Roman

• A cohort of people is a group of people whose

Example: For each of 49000 newborn infants, record Apgar

• Result: low Apgar scores are a risk factor for

• A retrospective study is one in which people

• A Case-control study selects all cases, usually

• Cases: all Patients with Eosinophila-myalgia

• Cohort (Prospective) Study: The totals for (row)

• Case-Control (retrospective) Study: The totals

• Observational study vs. experiment

• Longitudinal vs. cross-sectional

• Cohort vs. Case-control

Remark: A clinical study is one that takes place

1. A question or problem area of interest is considered.

Some points relevant to experimentations with human:

Remark: in many studies, 15% of the expenses

Biostatistical design of medical studies:

• Various Types of Studies

Next Topic: 2x2 Contingency Table (cohort vs.

• 2x2 Contingency Table

• Cohort (Prospective) Study: The totals for (row)

• Case-Control (retrospective) Study: The totals

Disease + Disease - Total

Exposure + n11 n12 n1

Disease in exposed population ~ b(n1, p1),

• Bucher et a. (1976) studied the occurrence

H0: no racial differences in disease rates

Exposure + n11 n12 n1

1. It is for small samples

2. Given the row and column totals, what

3. based on hyper-geometric distribution

H0: no racial differences in disease rates

data1 <- matrix(c(43, 17, 3541, 3814), nr=2)

Exposure + n11 n12 n1