Week 2: Counting Data

Week 2: Counting Data
• Categorical Data in cohort and case-

control study
• Statistical methods for 2x2 Table, i.e.,

comparing two proportions
(Fisher’s exact test, 3 large-sample tests,
Relative Risk (RR), Odds ratio (OR))
• Combinations of 2x2 Tables

(Confounder, Mantel-Haenszel Method)
1
Recall: Cohort vs. Case-Control
Exposure Disease
+ -
+ n11 n12
- n21 n22
• Cohort (Prospective) Study: The totals for (row)

“exposure +” and “exposure –” are fixed, and column
totals will vary depending on the association.
• Case-Control (retrospective) Study: The totals

for (column) “disease +” and “disease –” are fixed, and
row totals are random.
2
Comparing Two Proportions
Disease + Disease - Total
Exposure + n11 n12 n1

Exposure - n21 n22 n2
Disease in exposed population ~ b(n1, p1),

Disease in unexposed ~ b(n2, p2)
Want to test null Hypothesis H0: p1 = p2
• There are four statistical methods available:

Fisher’s exact test and 3 large-sample tests.
3
Example: ABO Hemolytic
• Bucher et a. (1976) studied the occurrence

of hemolytic disease in newborns from ABO
incompatibility between parents (i.e.,
father has antigens that the mother lacks)
• The authors reviewed 7464 consecutive
infants born at North Carolina Hospital
during Oct 1965 to March 1973
• One problem considered in the paper is the
racial differences in the incidence of ABO
hemolytic disease.
4
Example
ABO Hemolytic Disease
Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831
H0: no racial differences in disease rates
Four Methods:
• Small sample: Fisher’s exact test
• Large sample: Three tests
5
I. Fisher’s exact test
Disease + Disease -

1. It is for small samples
2. Given the row and column totals, what

is the distribution of n11?
3. based on hyper-geometric distribution
6
Fisher’s exact test: Example
Total
Yes No
data1 <- matrix(c(43, 17, 3541, 3814), nr=2)

fisher.test(data1)
p-value = 0.0003712
Reject H0 at 5% level
7
II. Large Sample Test A
Disease + Disease -

“Exposure +” ~ b(n1, p1),

“exposure –” ~ b(n2, p2)
Null Hypothesis H0: p1 = p2
Key idea:
where the overall disease rate

ෝ𝟏 + 𝒏𝟐 𝒑
𝒏𝟏 𝒑 ෝ𝟐 𝒏𝟏𝟏 + 𝒏𝟐𝟏
ෝ=
𝒑 =
𝒏𝟏 + 𝒏𝟐 𝒏𝟏 + 𝒏𝟐 8
Large Sample Test A: Example
Total
Yes No
Since |T| > Z0.025=1.96, reject H0 at 5% level

9
III. Large Sample Test B
Disease + Disease -

Under H0
10
Large Sample Test B: Example

Total
Yes No
Since |T*|> Z0.025=1.96, reject H0 at 5% level

11
IV: Chi-Square Test of independence

Total S=n11+n21 F =n12+n22 n
Expected Frequency table

Exposure + n1S/n n1F/n n1
Exposure - n2S/n n2F/n n2
Total S=n11+n21 F =n12+n22 n
12
IV: Chi-Square Test of independence
13
Chi-Square Test of independence: Example

Total
Yes No
Total 60 7355 7415
x <- c(43, 3541, 17, 3814) ; n <- 7415;
y <- c(60*3584, 7355*3584, 60*3831, 7355*3831)/n;
> sum((x - y) * (x-y) / y)
[1] 13.18662
Since |T** |> Z20.025=3.84, reject H0 at 5% level
14
Chi-Square Test: R code
> data1 <- matrix(c(43, 17, 3541, 3814), nr=2)

> chisq.test(data1)
Pearson's Chi-squared test with Yates'

continuity correction
data: data1
X-squared = 12.2615, df = 1, p-value =
0.0004624
15
Summary: Tests of Two ind. Bin RV
• Small sample: Fisher’s exact test

• Large sample: Three tests
16
Measures of Effects for Bin RV
• While the previous tests will allow us to

determine whether an association between
two binary variables, they do not provide a
measure the strength of the association
• Want to estimate the magnitude of the effect

(or summarize the association)
▪ Risk Difference
▪ Relative risk
▪ Odds ratio
17
1. Risk Difference
• Let
p1 = probability of developing disease
for exposed individuals;
p2 = probability of developing disease
for unexposed individuals
• Risk Difference = p1 – p2
• Relative risk, or Risk Ratio, RR= p1/p2

18
2. Relative risk
• Relative risk, or Risk Ratio, RR= p1/p2

• Point Estimate:
• Interval estimation: is normally
distributed with
Thus 100%(1-) CI for RR is

[exp(c-), exp(c+)], where
19
Relative risk
• Disadvantage: being constraint by p2
• For example, if p2 = 0.5, then

RR = p1 / p2 <= 1/0.5 = 2;
Similarly, if p2 = 0.8, then

RR = p1 / p2 <= 1/0.8 = 1.25.
20
3. Odds Ratio
• If the probability of a success =p, then the

odds in favor of success = p / q, where q=1-p
• The odds ratio:
and is estimated by
• The disease-odds ratio is the odds in favor of

disease for the exposed group divided by the
odds in favor of disease for unexposed group.
21
Example (Continued)
Total
Yes No
22
Hypothetical Case-Control Study
Disease + Disease -
Sample Exposed + a b
Exposed - c d
Disease + Disease -
Population Exposed + A B
Exposed - C D
Assume we sample random fraction f1, f2 from

disease + and disease – groups, respectively.
i.e., a =f1 A, c = f1 C, b = f2 B, d = f2 D
23
Sample RR and Population RR
Disease + Disease -
Exposed - c d
Disease + Disease -
Exposed - C D
a =f1 A, c = f1 C, b = f2 B, d = f2 D
24
Sample RR and Population RR
Disease + Disease -
Exposed - c d
Disease + Disease -
Exposed - C D
a =f1 A, c = f1 C, b = f2 B, d = f2 D
25
Hypothetical Case-Control Study
• They are same only if f_1 = f_2, that is, if the

sampling fraction of subjects with disease and
without disease are the same
• This is unlikely in a case-control study since the
usual sampling strategy is to oversampling
subjects with disease
26
Sample OR and Population OR
Disease + Disease -
Exposed - c d
Disease + Disease -
Exposed - C D
a =f1 A, c = f1 C, b = f2 B, d = f2 D
27
Sample OR and Population OR
Disease + Disease -
Exposed - c d
Disease + Disease -
Exposed - C D
a =f1 A, c = f1 C, b = f2 B, d = f2 D
28
Hypothetical Case-Control Study: OR
• Thus odds ratio estimated from our sample is

unbiased estimate of the odds ratio from
reference population
• If the disease is rare, i.e., pi are very small,
then RR ≈ OR.
29
Estimation of OR
Disease + Disease -
Exposed - c d
30
Summary: RR and OR

• If no association, then RR=OR=1

• If RR or OR are greater than 1, exposed group
has an increased risk of disease
• If RR or OR are less than 1, unexposed group
has increased risk of disease
31
RR or OR?
Disease + Disease -
Exposure + n11 n12
Exposure - n21 n22
Totals for Can one

Type of Study estimate the:
Column Row Relative Odds
Risk ? Ratio?
Cohort Random Fixed Yes Yes
(Prospective)
Case-Control Fixed Random No Yes
(Retrospective)
32
Example: Smoking-Perinatal Mortality
• Meryer, et al. (1976). All births in 10 Ontario (Canada)

teaching hospitals during 1960-1961. want to study the
association of perinatal events and maternal smoking
during pregnancy. Data are:
Maternal Perinatal Mortality

Smoking
Yes No Total
Yes 619 20,443 21,062
No 634 26,682 27,316
Total 1,253 47,125 48,378
33
Smoking-Perinatal Mortality: OR
• We estimate that smoking during pregnancy is

associated with an increase risk of perinatal
mortality that is 1.27 times large.
• Note: We have not concluded that smoking causes
the mortality, only that there is an association.
34
Smoking-Perinatal Mortality: Tests
• Might there really be no association and the

estimated RR or OR differ from 1 merely by
chances?
• Test the hypothesis of no association by using
Fisher’s exact test (for small samples) of the
chi-squared test (for large samples).
> x <- matrix(c(619,634,20443,26682), nr=2)
> chisq.test(x)
X-squared = 17.7562, df = 1, p-value = 2.511e-05
> fisher.test(x)
p-value = 2.462e-05
Thus the association is statistically significant at

5% level
35
Smoking-Perinatal Mortality: CI
• Now there is statistically significant

association, what can one say about the
accuracy of the estimates of OR?
• CI for OR
• First, 95% CI for log(OR):
or 0.2390 § 0.1122 or (0.1268, 0.3512)

• Second , the 95% CI for OR is
36
Summary
• Relative Risk (RR): can be estimated in

cohort studies, but not in case-control studies
• Odds Ratio (OR): can be estimated in both

cohort and case-control studies
37
New Topic: Mantel-Haenszel Method
Combination of 2 x 2 Tables
• The study of association is made in separate
subgroups of the data, where the subgroups
are defined by the third variable, which is
associated with both disease and exposure.
• How to combine the information across tables
to make a single, unifying statement?
• Need to “adjust” for the effect of
“confounding” variables
• Answer: Mantel-Haenszel Method
38
Example 1
• Consider a study investigating the

relationship between smoking and
aortic stenosis, a narrowing or stricture
of the aorta that impedes the flow of
blood to the body
• Gender is associated with both variables
• Begin analysis by examining the effects
among males and females separately.
39
Smoking and Aortic Stenosis
Males Females
Aortic Smoker Aortic Smoker
Stenosis Yes No Stenosis Yes No
Yes 37 25 Yes 14 29
No 24 20 No 19 47
• In both groups, the odds of developing Aortic Stenosis

are higher among smokers than among nonsmokers.
• Possible they are estimating the same population value
40
Sum Two tables
Males Females
Yes 37 25 Yes 14 29
No 24 20 No 19 47
Aortic Smoker
Stenosis
Yes No
Yes 51 54
No 43 67
41
Smoking and Aortic Stenosis
• Odds of developing aortic stenosis among

smokers relative to nonsmokers:
(regardless of gender)
• If the effects of gender is ignored, the strength
of the association between smoking and aortic
stenosis appears greater than it is for either
males or females alone.
• This is an example of Simpson’s paradox,
which occurs when a confounder is present
Simpson paradox when odds ratio increases when two subgroups are pooled where
subgroups can be separated by a confounder variable
42
Example 2: lung cancer & Drinking
Lung Cancer
Drinking Status Yes No Total
Heavy drinker 33 1667 1700
Nondrinker 27 2273 2300
Total 60 3940 4000
• Heavy drinking seems to be a risk factor of

lung cancer
43
Lung cancer & drinking after controlling smoking
Drinking Lung Cancer

Status Yes No
Heavy 33 1667
Nondrinker 27 2273
Smokers Nonsmokers
Drinking Lung Cancer Drinking Lung Cancer
Status Yes No Status Yes No
Heavy 24 776 Heavy 9 891
Nondrinker 6 194 Nondrinker 21 2079
44
Lung cancer & drinking after smoking
• Conclusion: After controlling for the

confounding variable smoking, we find no
relationship between lung cancer and drinking
status
• This illustrates that by combining tables with

no association, the combination may show
association iesegregated
odds ratio is higher on combined data tables than over two
data tables
• On the other hand, there can be association

within each table that disappears in the pooled
data set
45
Example 3: Confounder “hide” association
Subgroup 1 Subgroup 2
Exposure Disease Exposure Disease
+ - + -
+ 60 100 + 50 10
- 10 50 - 100 60
Exposure Disease
+ -
+ 110 110
- 110 110
46
Control Confounder?
• When is it reasonable to control for a

confounder when exploring the relationship
between an exposure and disease?
• A confounder is in the causal pathway between
exposure and disease if (1) the exposure is
causally related to the confounder and (2) the
confounder is causally related to disease.
• It is inappropriate to include a (third) variable
con as a confounder if it is in causal pathway
• The decision which confounder is in the causal
pathway is made on basis of Biological rather
than statistical considerations.
Confounder must be controlled if there is causal relation between confounder
variable and exposure and also a causal relationship between exposure and
disease
47
Confounder
• Positive Confounder is a confounder that either
1. is positively related to both exposure and disease, or
2. is negatively related to both exposure and disease
• Negative Confounder is a confounder that either
1. is positively related to disease and negatively related
to exposure, or
2. is negatively related to disease and positively related
to disease
• If a positive (negative) confounder exists,
“individual” ORs is lower (greater) than the
“pooled” OR.
if confounder impacts exposure & disease - both in same direction either +ve/-ve
then its a positive confounder. Odds ratio of combined - goes up than individual.
Viceversa
48
What type of Confounder is Gender?
Males Females
Yes 37 25 Yes 14 29
No 24 20 No 19 47
49
Males Females
Yes 37 25 Yes 14 29
No 24 20 No 19 47
Gender Smoker Gender Aortic Stenosis

Yes No Yes No
Male 61 45 Male 62 44
Female 33 76 Female 43 66
Positive Confounder!
50
Males Females
Yes 37 25 Yes 14 29
No 24 20 No 19 47
Aortic Smoker
Stenosis
Yes No
Yes 51 54
No 43 67
Positive Confounder since “pooled” OR is greater!
as order ratios are different dont have to
combine different groups as they dont 51
have any relationship and treat two
groups separately
Mantel-Haenszel Method
• Combine the information in a number of

2x2 tables
• Example: Rosenberg, et al. (1998)

studied the relationship between the
consumption of caffeinated coffee and
nonfatal myocardial infarction among
adults males under age 55.
Two samples: smokers and nonsmokers
52
Coffee and Myocardial Infarction
Smokers Nonsmokers
Myocardial Coffee Myocardial Coffee
Infarction Yes No Infarction Yes No
Yes 1011 81 Yes 383 66
No 390 77 No 365 123
• In both groups, the odds of suffering a myocardial infarction

are greater among coffee drinkers than among non-coffee
drinkers.
• Can we make a single overall statement about the relation?
53
Smokers Nonsmokers
Yes 1011 81 Yes 383 66
No 390 77 No 365 123
Myocardial Coffee
Infarction Yes No
Yes 1394 147
No 755 200
• If smoking is a confounder, then we cannot simply sum two

tables. (Data suggests smoking is indeed a confounder)
54
Step 1: Test of Homogeneity
• Before combining the information, we must
first verify that the population odd ratios are
constant across the different strata (subgroup)
• If they are not, it is not beneficial to compute a
single summary value for overall OR. It would
be better to treat data in different tables from
different populations and report a different OR
for each subgroup.
• To see whether ORi is uniform across g tables,
we want to test H0 means individual OR for subgroups is equal to combined
H0: OR1 = OR2 =  = ORg separate these subgrroups

total OR ie the confounder variable doesnt exist which
H1: not all ORs are the same H1 means we wont

combine those
groups together
and H0 is opposite 55
Test of Homogeneity
Exposure + Exposure -
Disease + ai bi
Disease - ci di
• Logarithm of estimated OR is
• Weight for Table i is

(add 0.5 to each value if some values are 0)
• Weighted average is
So for test
statistics, if value is
• Test statistic:
small we accept
the null hypothesis
as they all would
have same OR
56
Test of Homogeneity
• Test statistic:
• Why weight?
• The purpose is to weight strata with lower

variance (usually corresponds to strata with
more subjects) more heavily More subject Odds ratio should get more
weight in average OR value calculation
• Under H0, X2 will be small since each log OR

will relatively close to each other and to the
average log OR.
• Under H0, X2 » 2 with df =g-1.
Each subgroup has its own OR and own weight and is compared against average OR
of the cumulative subgroups and then weighted sum is calculated for their differences
57
Test of Homogeneity: Example
Smokers Nonsmokers
Yes 1011 81 Yes 383 66
No 390 77 No 365 123
Log of e in
mid term
58
• The test statistic X2 = 0.896

• Under H0, X2 is chi-squared distributed with g-
1 = 2-1 =1 df. Why is it two degrees of freedom??? here
• P-value = P(2 ¸ 0.896) = 0.344
• We cannot reject H0; the data do not indicate
that the population OR relating coffee
consumption and nonfatal myocardial
infarction differ for smokers and nonsmokers
• So we assume OR for two subgroups are
estimating the same quantity, and use Mantel-
Haenszel method to combine this information
X2 g-1 = z^2 + ..+
So H0 - the subgroups indicate that there is no difference between OR for subgroups and
z ^g-1
cumulative population and hence you can combine all subgroups using MH method
r code 59
--> 1-
pchisq(Xvalue, df)
Exposure + Exposure -
Disease + ai bi
Disease - ci di
Step 2: Summary Odds Ratio

• The estimated summary OR is
ni = ai+bi+ci+di
where ni = total # of observations in i-th table

• The weighted average of log(OR)
60
Summary Odds Ratio: Example
Smokers Nonsmokers
Yes 1011 81 Yes 383 66
No 390 77 No 365 123
• Once differences in smoking status have been

taken into account, OR = 2.18
61
just additional information - slide
Confidence Interval--- 1
• Estimated summary OR is
• Its estimated standard error is
where
• A 95% CI for summary OR is
62
Confidence Interval--- 2
• The weighted average of log(OR) is
• The estimated standard error of Y is
• A 95% CI for log(OR) is
63
CI of Summary Odds Ratio
• In the example,
𝟏 𝟏
𝒔ෞ𝒆 𝒀 = = = 𝟎. 𝟏𝟐𝟎
𝒘𝟏 +𝒘𝟐 𝟑𝟒.𝟔𝟐+𝟑𝟒.𝟗𝟑
• A 95% CI for log(OR) is
• The 95% CI for the summary odds ratio is
• After adjusting for the effects of smoking, we are 95%

confident that men who drink caffeinated coffee have
odds of experiencing nonfatal myocardial infarction that
are between 1.73 and 2.78 times greater than the odds
for men who do not drink coffee.
64
Step 3: Test of Association
• To test whether the summary odds ratio is
equal to 1 if 1 in normal or 0 in log scale exist
then there is no correlation
• One way is to simply refer to the 95% CI for

the summary odds ratio to see whether it
contains the value 1.
(need log(OR) is approximately normal)
• An alternative method to test H0: OR =1 is to

use the chi-square test.
65
Test of Association
Exposure + Exposure - Total
Disease + ai bi ai+bi
Disease - ci di ci+di
Total ai + ci bi+di ni
• ai = Observed value for “exposure + and disease +”

• Ei =expected value = (ai+bi)(ai+ci)/ni
• variance
Standard deviation of ai is
• Test statistic:
( under H0)
66
Test of Association
Exposure + Exposure - Total
Disease + ai bi ai+bi
Disease - ci di ci+di
Total ai + ci bi+di ni
• Test statistic:
• Use this test only if
• Which row or column is designated as first is

arbitrary. The test statistic and p-values are the
same regardless of the order of rows/columns.
rotate - names and +/-ve, he may change in exam so be careful how columns
and rows should be for right calculation 67
Test of Association: Example
Smokers Nonsmokers
Yes 1011 81 Yes 383 66
No 390 77 No 365 123
68
pvalue = 1- pchisq(43.68, 1) # in
• The test statistic
X2 = 43.68 R
• Under H0, X2 is 2 distributed with df=1, and

P-value = P(2 > 43.68) < 0.001
• Thus we reject H0 of no association between
exposure and disease. So after adjusting for
difference in smoking status, drinking
caffeinated coffee face higher risk of nonfatal
myocardial infarction
• This is a single study examining the effects of
coffee consumption on human health; other
studies have reported conflicting results. At
present time it appears that, in moderate
amounts, coffee is a fairly safe beverage for
otherwise healthy individuals.
69
Summary
• Confounder:
• Positive Confounder
• Negative Confounder
• Mantel-Haenszel Method
1. Test of Homogeneity
2. Summary Odds Ratio
3. Test of Association
70

Week 2: Counting Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 2: Counting Data

Uploaded by

Copyright:

Available Formats

Week 2: Counting Data

• Categorical Data in cohort and case-

• Statistical methods for 2x2 Table, i.e.,

• Combinations of 2x2 Tables

• Cohort (Prospective) Study: The totals for (row)

• Case-Control (retrospective) Study: The totals

Disease + Disease - Total

Exposure + n11 n12 n1

Disease in exposed population ~ b(n1, p1),

• There are four statistical methods available:

• Bucher et a. (1976) studied the occurrence

H0: no racial differences in disease rates

Exposure + n11 n12 n1

1. It is for small samples

2. Given the row and column totals, what

3. based on hyper-geometric distribution

H0: no racial differences in disease rates

data1 <- matrix(c(43, 17, 3541, 3814), nr=2)

Exposure + n11 n12 n1

“Exposure +” ~ b(n1, p1),

where the overall disease rate

H0: no racial differences in disease rates

Since |T| > Z0.025=1.96, reject H0 at 5% level

Exposure + n11 n12 n1

ABO Hemolytic Disease

H0: no racial differences in disease rates

Since |T*|> Z0.025=1.96, reject H0 at 5% level

Disease + Disease - Total

Expected Frequency table

Disease + Disease - Total

ABO Hemolytic Disease

> data1 <- matrix(c(43, 17, 3541, 3814), nr=2)

Pearson's Chi-squared test with Yates'

• Small sample: Fisher’s exact test

• While the previous tests will allow us to

• Want to estimate the magnitude of the effect

• Relative risk, or Risk Ratio, RR= p1/p2

• Relative risk, or Risk Ratio, RR= p1/p2

Thus 100%(1-) CI for RR is

• Disadvantage: being constraint by p2

• For example, if p2 = 0.5, then

Similarly, if p2 = 0.8, then

• If the probability of a success =p, then the

• The odds ratio:

• The disease-odds ratio is the odds in favor of

Assume we sample random fraction f1, f2 from

• They are same only if f_1 = f_2, that is, if the

• Thus odds ratio estimated from our sample is

Exposure + n11 n12 n1

• If no association, then RR=OR=1

Totals for Can one

• Meryer, et al. (1976). All births in 10 Ontario (Canada)

Maternal Perinatal Mortality

• We estimate that smoking during pregnancy is

• Might there really be no association and the

Thus the association is statistically significant at

• Now there is statistically significant

or 0.2390 § 0.1122 or (0.1268, 0.3512)

• Relative Risk (RR): can be estimated in

• Odds Ratio (OR): can be estimated in both

• Consider a study investigating the

• In both groups, the odds of developing Aortic Stenosis