Week 01

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

ISyE6421: Biostatistics

ISYE6421

• Lectures: TR 2:00-3:15pm, IC#109


• Instructor: Dr. Yajun Mei
(Pronounced as “YA-JUNE MAY”)
Email: <ymei@isye.gatech.edu>
Office Hours: after class. Please use piazza for
class-related questions

• TA: TBD
My academic pathway

• Undergraduate: Math, Peking Univ., BS in 1996


• Work as a computer programmer in a Chinese bank,
1996-1998
• Graduate: PhD in Math with a minor in EE, Caltech,
1998-2003 (advisor: Dr. Gary Lorden)
• Post Doc in biostatistics: Fred Hutchinson Cancer
Research Center, eattle, 2003-Sep 2005
(supervisor: Dr. Sarah Holte)
• New Research Fellow: SAMSI & Duke, Fall 2005
• Joined ISyE of GT since Jan 2006, and currently a full
professor.
Lecture 1

Agenda
• Course organization

• Introduction to Biostatistics

• Biostatistical Design of Medical


Studies
More about the Course

• This is an introductory graduate course in Biostatistics.


• Different from a mathematical / statistical course:
This course is well described by a problem solving and
practical approach, with a strong emphasis on conceptual
understanding

• Recommended books. Notes/slides provided.


• Van Belle, Heagerty, Fisher, and Lumley, “Biostatistics,
a methodology for the health sciences”
• Rosner, “Fundamentals of Biostatistics”
• Pinheriro & Bates, “Mixed effect models in S and S-Splus”
(can google more e-books or books on biostat and/or R).

4
Organization of the Course
Topics:
• Biostatistical Design of Medical Studies
• Software: R (or SAS, BUGS, MATLAB)
• Basic statistical inference
• Categorical Data: OR, RR, Mantel-Haenszel
• Continuous Data: Parametric/Nonparametric
• Review of Linear and Logistic Regression
• Multiple Comparisons
• False Discovery Rate
• Survival analysis: censoring, Kaplan-Meier, Cox.
• Applications (Genetics,-omics, CHIP-Seq)
• Longitudinal data analysis
• Sample size calculation for studies
• Brief Introduction to Causal Inference 5
Course Organization (cont.)
Grading:
• Homework: 20%

• Class participation: 10% (peer comments)

• Midterm: 30% (Thursday, March 10)


• You are allowed to bring 4 double-sided cheat sheets
or 8 one-sided cheat sheets.
• Calculators are allowed.

• Class project on bio or health related topics:


40% (a team of 1-3 students)
➢ Progress report: 2% (due March 17)
➢ Oral presentation: 18% (file due 9am on April 19)
➢ Final written report: 20% (file due April 26)
6
This Lecture

• Introduction to Biostatistics

• Biostatistical Design of Medical


Studies

7
What is Statistics?
Examples:
• Parents of a child with a genetic defect consider
whether or not they should have another child
• To choose the best therapy, a physician must
compare the prognosis, or future course, of a
patient under several therapies
• Does smoking cause cancer?
Key elements: uncertainty, variation, inference.
Some Definitions of Statistics:
• “May be regarded as mathematics applied to
observational data….as the study of (1) population; (2)
variation; (iii) methods of the reduction of data.” (Fisher)
• “= Uncertainty and Behavior.” (Savage)
• “the art of learning from data.” (Ross)

8
Why Biostatistics?

• Biostatistics: a field of science that applies


statistical methods to medical and biological
problems

• Why focusing on biostatistics (not statistics)?


• Some statistical methods are used more heavily in
biostatistics than in other field (e.g., survival
analysis)
• To show how to apply stat methods, examples are
drawn from the biological, medical and health care
areas

9
Goal of this course
Some Biostatistical Problems:
1. Is a new drug more effective in treating an illness than a
previously used drug?
2. Does the use of a seat belt decrease the chance of death
in a car accident?
3. How does the frequency of laboratory tests influence the
quality of medical care?
4. Are growth curves for boys and girls significnatly
different?

Objective: you’re expected to be able to


• understand specified stat concepts and procedures
• identify procedures appropriate (and inappropriate) to a
given situation
• carry out appropriate specified stat procedures

10
Lecture 1

Biostatistical Design of Medical Studies:

• Introduce some of the principles of


biostatistical design. Many of the idea are
expanded later.

Reminder: statistics is not an end in itself but a


tool to be used in investigating the world
around us.

11
Types of Studies: I
• Observational Study: collects data from an
existing situation. The data collection does not
intentionally interface with the running of the
system
(Remark: the act of observation may introduce change
into a system)
• Experiment: a study in which an investigator
deliberately sets one or more factors to a
specific level
(Remark: experiments lead to stronger scientific
inferences than do observational studies, which are
always open to misinterpretation due to a lack of
knowledge in a given field)

12
Three types of Experiments
• Laboratory experiment: an experiment that takes
place in an environment (called a laboratory) where
experimental manipulation is facilitated

• Comparative experiment: an experiment that


compares two or more techniques, treatments, or levels
of a variable.

• Crossover experiment: if the same experimental


unit receives more than one treatment or is investigated
under more than one condition of experiment. Different
treatment given during nonoverlapping time periods.

Remark: A clinical study is one that takes place in the


setting of clinical medicine.

13
Types of Studies: II

Suppose that we want to study the


characteristics of cases of a disease. There are
two possible approaches:
• Longitudinal study: collects information on
study units over a specified time period.
(repeated measurements per subject)

• Cross-sectional study: collects data on study


units at a fixed time
(drawback: more subjects with long duration, large
between-subject variation)

14
Types of Studies: III

We now turn to concepts of Prospective


Studies and Retrospective Studies,
usually involving human population

• Cohort /Prospective Study


• Case-Control /Retrospective Study

15
Cohort Study / Prospective Study

Cohort: (from www.m-w.com)

1. one of 10 divisions of an ancient Roman


legion
2. a group of warriors or soldiers
3. Band, Group
4. a group of individuals having a statistical
factor (as age or class membership) in
common in a demographic study

16
Cohort Study / Prospective Study

• A cohort of people is a group of people whose


membership is clearly defined.
▪ all students enrolling in GT for spring 2008
▪ All smokers in US as of Jan 1 2008 where a person is
defined to be a smoker if (s)he smoked one or more
cigarettes during the preceding calendar year
• An Endpoint is a clearly defined outcome or
event associated with an experiment or study
unit.
• Presence of a particular disease
• A Cohort (Prospective) study is one in which a
cohort of people is followed for the occurrence
or non-occurrence of specified endpoints or
events or measurements.

17
Cohort Study: Example
• Important: in Cohort study, information about the risk
factor (exposure to disease) is determined prior to the
observation of disease status.
• Can conduct a prospective study from existing data ---
historical prospective

Example: For each of 49000 newborn infants, record Apgar


scores at 1 and 5 minutes. For those with low score (<8
at 5 min), then record Apgar score at 10,15,20 minutes. All
children followed to age of 7 years. A psychological and
development assessment is performed at age 7.

• Result: low Apgar scores are a risk factor for


the development of Cerebral Palsy (CP).

18
Cohort Studies: Advantage

Advantage:
• Strongest observational design for establishing
cause and effects relationship
• Very efficient for study of rare exposure
• Clear temporal relationship between exposure
and disease
• can yield information on multiple exposures, or
on multiple outcomes of a particular exposure
• May yield information on incidence of disease

19
Cohort Studies: disadvantage

Disadvantage:
• Time consuming
• Often require a large sample size
• Expensive
• Not efficient for the study of rare disease
• Losses to follow-up may diminish validity
• Change over time in diagnostic methods may
lead to biased results

20
Case-Control / Retrospective Studies

• A retrospective study is one in which people


having a particular outcome or endpoint are
identified and studied.
so cases is group of people who have a particular disease or criteria and
their old habits/medical history is used to compare with control group

• A Case-control study selects all cases, usually


of a disease, that meet fixed criteria. A group,
called controls, that serve as a comparison for
the cases is also selected. The cases and
controls are compared with respect to various
characteristics.
• Attempt to relate their prior health habits to their
current disease status

21
Case-Control : Example

• Cases: all Patients with Eosinophila-myalgia


Syndrome (EMS) in Minneapolis-St. Paul
• Control: randomly selected telephone numbers
in same area
• Interview subjects and asked about potential
risk factors and use of L-tryptophan
• 61 of 63 (97%) case subjects used L-
tryptophan, but only 101 of 5188 control
subjects used.
• Results: after recall of L-tryptophan in 1990,
number of EMS cases in US decreases from
1500 in 1988/1989 to near 0.

22
Case-Control Studies: Advantage

Advantage:
• Efficient for the study of rare disease
• Efficient for the study of chronic disease
• Tend to require a smaller sample size than
cohort studies
• Less expensive than cohort studies
• May be completed more rapidly than cohort
studies

23
Case-Control Studies: Disadvantage

Disadvantage:
• Risk of disease cannot be estimated directly
• Not efficient for the study of rare exposures
• More susceptible to selection bias than cohort
studies
• Information on exposure may be less accurate
than that available in cohort studies

24
Cohort vs. Case-Control
Exposure Disease
+ -
+ n11 n12
Rows fixed in
Cohort

- n21 n22

• Cohort (Prospective) Study: The totals for (row)


“exposure +” and “exposure –” are fixed, and column
totals will vary depending on the association.
its retrospective, so columns are fixed ie we know the end result and we are checking if certain kind of exposure caused
it so it can vary

• Case-Control (retrospective) Study: The totals


for (column) “disease +” and “disease –” are fixed, and
row totals are random.
cohort is like you fix a group with certain characteristics and observe them over a period of time to see if
disease occurs or no. Its perspective ie looking forward and hence exposure is know like who got exposed and
not, but outcome ie who got disease is not known at beginning of study. 25
Summary: various types of studies

• Observational study vs. experiment

• Longitudinal vs. cross-sectional

• Cohort vs. Case-control

Remark: A clinical study is one that takes place


in the setting of clinical medicine.

26
Steps necessary to perform a study

1. A question or problem area of interest is considered.


This does not involve biostatistics.
2. A study is to be designed to answer the question. Must
consider at least the following:
a. Identify the data to be collected (variables to be
measured, sample size)
b. An appropriate analytical model needs to be
developed for describing and processing the data
c. What inferences does one hope to make from the
study? What conclusions might one draw from the
study? To what population(s) is the conclusion
applicable?
3. The study is carried out and the data are collected
4. The data are analyzed and conclusions and inferences
are drawn
5. The results are used, e.g., publication, plan new study.

27
Ethical Issues

Some points relevant to experimentations with human:


1. Investigators are responsible for the conduct of an ethical
study to the extent that they may be expected to know what
is involved in the study
2. In proposed studies involving humans or animals, there
should be review by people not concerned or connected with
the study or the investigators
3. People participating in an experiment should understand and
sign an informed consent form
4. Subjects should be free to withdraw at any time, or to refuse
initial participation, without being penalized or jeopardized
with respect to current and future care and activities
5. When possible, animal studies should be done prior to
human experimentation.

28
• Data collection: Design of forms (what data are to
be collected, clarify of questions, pretesting of forms
and pilot studies, layout and appearance).
• Data editing and verification (validity check,
consistency check, missing forms)
• Data handling
• Amount of Data collected: sample size
• Inference from a study

Remark: in many studies, 15% of the expenses


has been in data handling, processing and
analysis.

29
Summary

Biostatistical design of medical studies:

• Various Types of Studies


• Steps necessary to perform a study
• Ethics
• Data handling, processing and analysis

Next Topic: 2x2 Contingency Table (cohort vs.


case-control)

30
Topic: 2x2 Table

• 2x2 Contingency Table


(cohort vs. case-control)

• Technical Notation:
➢ Relative Risk (RR): can be estimated in
cohort studies, but not in case-control
studies
➢ Odds Ratio (OR): can be estimated in both
cohort and case-control studies

31
Recall: Cohort vs. Case-Control
Exposure Disease
+ -
+ n11 n12
- n21 n22

• Cohort (Prospective) Study: The totals for (row)


“exposure +” and “exposure –” are fixed, and column
totals will vary depending on the association.

• Case-Control (retrospective) Study: The totals


for (column) “disease +” and “disease –” are fixed, and
row totals are random.

32
Comparing Two Proportions

Disease + Disease - Total

Exposure + n11 n12 n1


Exposure - n21 n22 n2

Disease in exposed population ~ b(n1, p1),


Disease in unexposed ~ b(n2, p2)
Want to test null Hypothesis H0: p1 = p2

33
Example: ABO Hemolytic

• Bucher et a. (1976) studied the occurrence


of hemolytic disease in newborns from ABO
incompatibility between parents (i.e.,
father has antigens that the mother lacks)
• The authors reviewed 7464 consecutive
infants born at North Carolina Hospital
during Oct 1965 to March 1973
• One problem considered in the paper is the
racial differences in the incidence of ABO
hemolytic disease.

34
Example
ABO Hemolytic Disease
Total
Yes No
This is the
exposure Black infant 43 3541 3584
White infant 17 3814 3831

H0: no racial differences in disease rates

Four Methods:
• Small sample: Fisher’s exact test
• Large sample: Three tests
35
I. Fisher’s exact test

Disease + Disease -

Exposure + n11 n12 n1


Exposure - n21 n22 n2

1. It is for small samples

2. Given the row and column totals, what


is the distribution of n11?

3. based on hyper-geometric distribution


the hypergeometric distribution is a discrete probability distribution that describes the probability of k {\displaystyle k} k
successes (random draws for which the object drawn has a specified feature) in n {\displaystyle n} n draws, without
replacement, from a finite population of size N {\displaystyle N} N that contains exactly K {\displaystyle K} K objects with that
feature, wherein each draw is either a success or a failure. 36
Fisher’s exact test: Example
ABO Hemolytic Disease
Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831

H0: no racial differences in disease rates

data1 <- matrix(c(43, 17, 3541, 3814), nr=2)


fisher.test(data1)
p-value = 0.0003712

Reject H0 at 5% level
Accept the null hypothesis ie there is a racial difference in disease
rates - ie true odds ratio is not equal to 1 ie odds are different for 37
each race
II. Large Sample Test A

Disease + Disease -

Exposure + n11 n12 n1


Exposure - n21 n22 n2

“Exposure +” ~ b(n1, p1), two sample test -


checkout on
“exposure –” ~ b(n2, p2) internet

Null Hypothesis H0: p1 = p2


Key idea:

where the overall disease rate


ෝ𝟏 + 𝒏𝟐 𝒑
𝒏𝟏 𝒑 ෝ𝟐 𝒏𝟏𝟏 + 𝒏𝟐𝟏
ෝ=
𝒑 =
𝒏𝟏 + 𝒏𝟐 𝒏𝟏 + 𝒏𝟐 38
Large Sample Test A: Example
ABO Hemolytic Disease
Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831

H0: no racial differences in disease rates

Since T>¸ Z0.025=1.96, reject H0 at 5% level


39
III. Large Sample Test B

Disease + Disease -

Exposure + n11 n12 n1


Exposure - n21 n22 n2

Under H0

40
Large Sample Test B: Example

ABO Hemolytic Disease


Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831

H0: no racial differences in disease rates

Since |T*|> Z0.025=1.96, reject H0 at 5% level


41
IV: Chi-Square Test of independence

Disease + Disease - Total


Exposure + n11 n12 n1
Exposure - n21 n22 n2
Total S=n11+n21 F =n12+n22 n

Expected Frequency table

Disease + Disease - Total


Exposure + n1S/n n1F/n n1
Exposure - n2S/n n2F/n n2
Total S=n11+n21 F =n12+n22 n

42
IV: Chi-Square Test of independence

43
Chi-Square Test of independence: Example

ABO Hemolytic Disease


Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831
Total 60 7355 7415
H0: no racial differences in disease rates
x <- c(43, 3541, 17, 3814) ; n <- 7415;
y <- c(60*3584, 7355*3584, 60*3831, 7355*3831)/n;
> sum((x - y) * (x-y) / y)
[1] 13.18662
Since |T** |> Z20.025=3.84, reject H0 at 5% level
44
Chi-Square Test: R code

> data1 <- matrix(c(43, 17, 3541, 3814), nr=2)


> chisq.test(data1)

Pearson's Chi-squared test with Yates'


continuity correction

data: data1
X-squared = 12.2615, df = 1, p-value =
0.0004624

45
Summary: Tests of Two ind. Bin RV

• Small sample: Fisher’s exact test


• Large sample: Three tests

46
Measures of Effects for Bin RV

• While the previous tests will allow us to


determine whether an association between
two binary variables, they do not provide a
measure the strength of the association

• Want to estimate the magnitude of the effect


(or summarize the association)
▪ Risk Difference
▪ Relative risk
▪ Odds ratio

47
1. Risk Difference

• Let
p1 = probability of developing disease
for exposed individuals;
p2 = probability of developing disease
for unexposed individuals

• Risk Difference = p1 – p2

• Relative risk, or Risk Ratio, RR= p1/p2


48
2. Relative risk

• Relative risk, or Risk Ratio, RR= p1/p2


• Point Estimate:
• Interval estimation: is normally
distributed with

Thus 100%(1-) CI for RR is


[exp(c-), exp(c+)], where

49
Relative risk

• Disadvantage: being constraint by p2

• For example, if p2 = 0.5, then


RR = p1 / p2 <= 1/0.5 = 2;

Similarly, if p2 = 0.8, then


RR = p1 / p2 <= 1/0.8 = 1.25.

50
3. Odds Ratio
odds = prob of success / prob of failure
Odds ratio = odds of exposed group / odds of unexposed group

• If the probability of a success =p, then the


odds in favor of success = p / q, where q=1-p

• The odds ratio:

and is estimated by

• The disease-odds ratio is the odds in favor of


disease for the exposed group divided by the
odds in favor of disease for unexposed group.

51
Example (Continued)
ABO Hemolytic Disease
Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831

52
Hypothetical Case-Control Study

Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D
This is a case control so we finalize the end outcome ie column sum is fixed and rows proportion can vary as we trace back the cases

Assume we sample random fraction f1, f2 from


disease + and disease – groups, respectively.
i.e., a =f1 A, c = f1 C, b = f2 B, d = f2 D

53
Sample RR and Population RR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D

a =f1 A, c = f1 C, b = f2 B, d = f2 D
probability of people who have disease has this exposure

probability of people who have disease n not exposed


actual relative risk

54
Sample RR and Population RR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D

a =f1 A, c = f1 C, b = f2 B, d = f2 D

Relative risk
estimation for a
case control study 55
Hypothetical Case-Control Study

• They are same only if f_1 = f_2, that is, if the


sampling fraction of subjects with disease and
without disease are the same
• This is unlikely in a case-control study since the
usual sampling strategy is to oversampling
subjects with disease

56
Sample OR and Population OR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D

a =f1 A, c = f1 C, b = f2 B, d = f2 D

57
Sample OR and Population OR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D

a =f1 A, c = f1 C, b = f2 B, d = f2 D

58
Hypothetical Case-Control Study: OR

• Thus odds ratio estimated from our sample is


unbiased estimate of the odds ratio from
reference population
• If the disease is rare, i.e., pi are very small,
then RR ≈ OR.

59
Estimation of OR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

Log(OR) is normally distributed and hence you


can apply CLT and find confidence interval

60
Summary: RR and OR
Disease + Disease - Total

Exposure + n11 n12 n1


Exposure - n21 n22 n2

ie p1 =p1 and hence RR = 1=p1/


• If no association, then RR=OR=1 p2, also, OR = 1 by formula

• If RR or OR are greater than 1, exposed group


has an increased risk of disease
• If RR or OR are less than 1, unexposed group
has increased risk of disease
61
RR or OR?
Disease + Disease -
Exposure + n11 n12
Exposure - n21 n22

Totals for Can one


Type of Study estimate the:
Column Row Relative Odds
Risk ? Ratio?
Cohort Random Fixed Yes Yes
(Prospective)
Case-Control Fixed Random No Yes
(Retrospective) as estimation is
biased

62
Example: Smoking-Perinatal Mortality

• Meryer, et al. (1976). All births in 10 Ontario (Canada)


teaching hospitals during 1960-1961. want to study the
association of perinatal events and maternal smoking
during pregnancy. Data are:

Maternal Perinatal Mortality


Smoking
Yes No Total
Yes 619 20,443 21,062
No 634 26,682 27,316
Total 1,253 47,125 48,378

63
Smoking-Perinatal Mortality: OR

We get only the association between two variables or groups

• We estimate that smoking during pregnancy is


associated with an increase risk of perinatal
mortality that is 1.27 times large.
• Note: We have not concluded that smoking causes
the mortality, only that there is an association.

64
Smoking-Perinatal Mortality: Tests

• Might there really be no association and the


estimated RR or OR differ from 1 merely by
chances?
• Test the hypothesis of no association by using
Fisher’s exact test (for small samples) of the
chi-squared test (for large samples).
> x <- matrix(c(619,634,20443,26682), nr=2)
> chisq.test(x)
X-squared = 17.7562, df = 1, p-value = 2.511e-05
> fisher.test(x)
p-value = 2.462e-05

Thus the association is statistically significant at


5% level

65
Smoking-Perinatal Mortality: CI

• Now there is statistically significant


association, what can one say about the
accuracy of the estimates of OR?
• CI for OR
• First, 95% CI for log(OR):

or 0.2390 § 0.1122 or (0.1268, 0.3512)


• Second , the 95% CI for OR is

66
Summary

• Relative Risk (RR): can be estimated in


cohort studies, but not in case-control studies

• Odds Ratio (OR): can be estimated in both


cohort and case-control studies

67

You might also like