Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Week 2: Counting Data

• Categorical Data in cohort and case-


control study

• Statistical methods for 2x2 Table, i.e.,


comparing two proportions
(Fisher’s exact test, 3 large-sample tests,
Relative Risk (RR), Odds ratio (OR))

• Combinations of 2x2 Tables


(Confounder, Mantel-Haenszel Method)

1
Recall: Cohort vs. Case-Control
Exposure Disease
+ -
+ n11 n12
- n21 n22

• Cohort (Prospective) Study: The totals for (row)


“exposure +” and “exposure –” are fixed, and column
totals will vary depending on the association.

• Case-Control (retrospective) Study: The totals


for (column) “disease +” and “disease –” are fixed, and
row totals are random.

2
Comparing Two Proportions

Disease + Disease - Total

Exposure + n11 n12 n1


Exposure - n21 n22 n2

Disease in exposed population ~ b(n1, p1),


Disease in unexposed ~ b(n2, p2)
Want to test null Hypothesis H0: p1 = p2

• There are four statistical methods available:


Fisher’s exact test and 3 large-sample tests.

3
Example: ABO Hemolytic

• Bucher et a. (1976) studied the occurrence


of hemolytic disease in newborns from ABO
incompatibility between parents (i.e.,
father has antigens that the mother lacks)
• The authors reviewed 7464 consecutive
infants born at North Carolina Hospital
during Oct 1965 to March 1973
• One problem considered in the paper is the
racial differences in the incidence of ABO
hemolytic disease.

4
Example
ABO Hemolytic Disease
Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831

H0: no racial differences in disease rates

Four Methods:
• Small sample: Fisher’s exact test
• Large sample: Three tests
5
I. Fisher’s exact test

Disease + Disease -

Exposure + n11 n12 n1


Exposure - n21 n22 n2

1. It is for small samples

2. Given the row and column totals, what


is the distribution of n11?

3. based on hyper-geometric distribution

6
Fisher’s exact test: Example
ABO Hemolytic Disease
Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831

H0: no racial differences in disease rates

data1 <- matrix(c(43, 17, 3541, 3814), nr=2)


fisher.test(data1)
p-value = 0.0003712

Reject H0 at 5% level
7
II. Large Sample Test A

Disease + Disease -

Exposure + n11 n12 n1


Exposure - n21 n22 n2

“Exposure +” ~ b(n1, p1),


“exposure –” ~ b(n2, p2)
Null Hypothesis H0: p1 = p2
Key idea:

where the overall disease rate


ෝ𝟏 + 𝒏𝟐 𝒑
𝒏𝟏 𝒑 ෝ𝟐 𝒏𝟏𝟏 + 𝒏𝟐𝟏
ෝ=
𝒑 =
𝒏𝟏 + 𝒏𝟐 𝒏𝟏 + 𝒏𝟐 8
Large Sample Test A: Example
ABO Hemolytic Disease
Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831

H0: no racial differences in disease rates

Since |T| > Z0.025=1.96, reject H0 at 5% level


9
III. Large Sample Test B

Disease + Disease -

Exposure + n11 n12 n1


Exposure - n21 n22 n2

Under H0

10
Large Sample Test B: Example

ABO Hemolytic Disease


Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831

H0: no racial differences in disease rates

Since |T*|> Z0.025=1.96, reject H0 at 5% level


11
IV: Chi-Square Test of independence

Disease + Disease - Total


Exposure + n11 n12 n1
Exposure - n21 n22 n2
Total S=n11+n21 F =n12+n22 n

Expected Frequency table

Disease + Disease - Total


Exposure + n1S/n n1F/n n1
Exposure - n2S/n n2F/n n2
Total S=n11+n21 F =n12+n22 n

12
IV: Chi-Square Test of independence

13
Chi-Square Test of independence: Example

ABO Hemolytic Disease


Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831
Total 60 7355 7415
H0: no racial differences in disease rates
x <- c(43, 3541, 17, 3814) ; n <- 7415;
y <- c(60*3584, 7355*3584, 60*3831, 7355*3831)/n;
> sum((x - y) * (x-y) / y)
[1] 13.18662
Since |T** |> Z20.025=3.84, reject H0 at 5% level
14
Chi-Square Test: R code

> data1 <- matrix(c(43, 17, 3541, 3814), nr=2)


> chisq.test(data1)

Pearson's Chi-squared test with Yates'


continuity correction

data: data1
X-squared = 12.2615, df = 1, p-value =
0.0004624

15
Summary: Tests of Two ind. Bin RV

• Small sample: Fisher’s exact test


• Large sample: Three tests

16
Measures of Effects for Bin RV

• While the previous tests will allow us to


determine whether an association between
two binary variables, they do not provide a
measure the strength of the association

• Want to estimate the magnitude of the effect


(or summarize the association)
▪ Risk Difference
▪ Relative risk
▪ Odds ratio

17
1. Risk Difference

• Let
p1 = probability of developing disease
for exposed individuals;
p2 = probability of developing disease
for unexposed individuals

• Risk Difference = p1 – p2

• Relative risk, or Risk Ratio, RR= p1/p2


18
2. Relative risk

• Relative risk, or Risk Ratio, RR= p1/p2


• Point Estimate:
• Interval estimation: is normally
distributed with

Thus 100%(1-) CI for RR is


[exp(c-), exp(c+)], where

19
Relative risk

• Disadvantage: being constraint by p2

• For example, if p2 = 0.5, then


RR = p1 / p2 <= 1/0.5 = 2;

Similarly, if p2 = 0.8, then


RR = p1 / p2 <= 1/0.8 = 1.25.

20
3. Odds Ratio

• If the probability of a success =p, then the


odds in favor of success = p / q, where q=1-p

• The odds ratio:

and is estimated by

• The disease-odds ratio is the odds in favor of


disease for the exposed group divided by the
odds in favor of disease for unexposed group.

21
Example (Continued)
ABO Hemolytic Disease
Total
Yes No
Black infant 43 3541 3584
White infant 17 3814 3831

22
Hypothetical Case-Control Study

Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D

Assume we sample random fraction f1, f2 from


disease + and disease – groups, respectively.
i.e., a =f1 A, c = f1 C, b = f2 B, d = f2 D

23
Sample RR and Population RR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D

a =f1 A, c = f1 C, b = f2 B, d = f2 D

24
Sample RR and Population RR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D

a =f1 A, c = f1 C, b = f2 B, d = f2 D

25
Hypothetical Case-Control Study

• They are same only if f_1 = f_2, that is, if the


sampling fraction of subjects with disease and
without disease are the same
• This is unlikely in a case-control study since the
usual sampling strategy is to oversampling
subjects with disease

26
Sample OR and Population OR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D

a =f1 A, c = f1 C, b = f2 B, d = f2 D

27
Sample OR and Population OR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

Disease + Disease -
Population Exposed + A B
Exposed - C D

a =f1 A, c = f1 C, b = f2 B, d = f2 D

28
Hypothetical Case-Control Study: OR

• Thus odds ratio estimated from our sample is


unbiased estimate of the odds ratio from
reference population
• If the disease is rare, i.e., pi are very small,
then RR ≈ OR.

29
Estimation of OR
Disease + Disease -
Sample Exposed + a b
Exposed - c d

30
Summary: RR and OR
Disease + Disease - Total

Exposure + n11 n12 n1


Exposure - n21 n22 n2

• If no association, then RR=OR=1


• If RR or OR are greater than 1, exposed group
has an increased risk of disease
• If RR or OR are less than 1, unexposed group
has increased risk of disease
31
RR or OR?
Disease + Disease -
Exposure + n11 n12
Exposure - n21 n22

Totals for Can one


Type of Study estimate the:
Column Row Relative Odds
Risk ? Ratio?
Cohort Random Fixed Yes Yes
(Prospective)
Case-Control Fixed Random No Yes
(Retrospective)

32
Example: Smoking-Perinatal Mortality

• Meryer, et al. (1976). All births in 10 Ontario (Canada)


teaching hospitals during 1960-1961. want to study the
association of perinatal events and maternal smoking
during pregnancy. Data are:

Maternal Perinatal Mortality


Smoking
Yes No Total
Yes 619 20,443 21,062
No 634 26,682 27,316
Total 1,253 47,125 48,378

33
Smoking-Perinatal Mortality: OR

• We estimate that smoking during pregnancy is


associated with an increase risk of perinatal
mortality that is 1.27 times large.
• Note: We have not concluded that smoking causes
the mortality, only that there is an association.

34
Smoking-Perinatal Mortality: Tests

• Might there really be no association and the


estimated RR or OR differ from 1 merely by
chances?
• Test the hypothesis of no association by using
Fisher’s exact test (for small samples) of the
chi-squared test (for large samples).
> x <- matrix(c(619,634,20443,26682), nr=2)
> chisq.test(x)
X-squared = 17.7562, df = 1, p-value = 2.511e-05
> fisher.test(x)
p-value = 2.462e-05

Thus the association is statistically significant at


5% level

35
Smoking-Perinatal Mortality: CI

• Now there is statistically significant


association, what can one say about the
accuracy of the estimates of OR?
• CI for OR
• First, 95% CI for log(OR):

or 0.2390 § 0.1122 or (0.1268, 0.3512)


• Second , the 95% CI for OR is

36
Summary

• Relative Risk (RR): can be estimated in


cohort studies, but not in case-control studies

• Odds Ratio (OR): can be estimated in both


cohort and case-control studies

37
New Topic: Mantel-Haenszel Method

Combination of 2 x 2 Tables
• The study of association is made in separate
subgroups of the data, where the subgroups
are defined by the third variable, which is
associated with both disease and exposure.
• How to combine the information across tables
to make a single, unifying statement?
• Need to “adjust” for the effect of
“confounding” variables
• Answer: Mantel-Haenszel Method

38
Example 1

• Consider a study investigating the


relationship between smoking and
aortic stenosis, a narrowing or stricture
of the aorta that impedes the flow of
blood to the body
• Gender is associated with both variables
• Begin analysis by examining the effects
among males and females separately.

39
Smoking and Aortic Stenosis
Males Females
Aortic Smoker Aortic Smoker
Stenosis Yes No Stenosis Yes No
Yes 37 25 Yes 14 29
No 24 20 No 19 47

• In both groups, the odds of developing Aortic Stenosis


are higher among smokers than among nonsmokers.
• Possible they are estimating the same population value

40
Sum Two tables
Males Females
Aortic Smoker Aortic Smoker
Stenosis Yes No Stenosis Yes No
Yes 37 25 Yes 14 29
No 24 20 No 19 47

Aortic Smoker
Stenosis
Yes No
Yes 51 54
No 43 67

41
Smoking and Aortic Stenosis

• Odds of developing aortic stenosis among


smokers relative to nonsmokers:

(regardless of gender)
• If the effects of gender is ignored, the strength
of the association between smoking and aortic
stenosis appears greater than it is for either
males or females alone.
• This is an example of Simpson’s paradox,
which occurs when a confounder is present
Simpson paradox when odds ratio increases when two subgroups are pooled where
subgroups can be separated by a confounder variable

42
Example 2: lung cancer & Drinking

Lung Cancer
Drinking Status Yes No Total
Heavy drinker 33 1667 1700
Nondrinker 27 2273 2300
Total 60 3940 4000

• Heavy drinking seems to be a risk factor of


lung cancer
43
Lung cancer & drinking after controlling smoking

Drinking Lung Cancer


Status Yes No
Heavy 33 1667
Nondrinker 27 2273

Smokers Nonsmokers
Drinking Lung Cancer Drinking Lung Cancer
Status Yes No Status Yes No
Heavy 24 776 Heavy 9 891
Nondrinker 6 194 Nondrinker 21 2079

44
Lung cancer & drinking after smoking

• Conclusion: After controlling for the


confounding variable smoking, we find no
relationship between lung cancer and drinking
status

• This illustrates that by combining tables with


no association, the combination may show
association iesegregated
odds ratio is higher on combined data tables than over two
data tables

• On the other hand, there can be association


within each table that disappears in the pooled
data set

45
Example 3: Confounder “hide” association

Subgroup 1 Subgroup 2
Exposure Disease Exposure Disease
+ - + -
+ 60 100 + 50 10
- 10 50 - 100 60

Exposure Disease
+ -
+ 110 110
- 110 110
46
Control Confounder?

• When is it reasonable to control for a


confounder when exploring the relationship
between an exposure and disease?
• A confounder is in the causal pathway between
exposure and disease if (1) the exposure is
causally related to the confounder and (2) the
confounder is causally related to disease.
• It is inappropriate to include a (third) variable
con as a confounder if it is in causal pathway
• The decision which confounder is in the causal
pathway is made on basis of Biological rather
than statistical considerations.
Confounder must be controlled if there is causal relation between confounder
variable and exposure and also a causal relationship between exposure and
disease

47
Confounder
• Positive Confounder is a confounder that either
1. is positively related to both exposure and disease, or
2. is negatively related to both exposure and disease
• Negative Confounder is a confounder that either
1. is positively related to disease and negatively related
to exposure, or
2. is negatively related to disease and positively related
to disease
• If a positive (negative) confounder exists,
“individual” ORs is lower (greater) than the
“pooled” OR.
if confounder impacts exposure & disease - both in same direction either +ve/-ve
then its a positive confounder. Odds ratio of combined - goes up than individual.
Viceversa
48
What type of Confounder is Gender?

Males Females
Aortic Smoker Aortic Smoker
Stenosis Yes No Stenosis Yes No
Yes 37 25 Yes 14 29
No 24 20 No 19 47

49
What type of Confounder is Gender?

Males Females
Aortic Smoker Aortic Smoker
Stenosis Yes No Stenosis Yes No
Yes 37 25 Yes 14 29
No 24 20 No 19 47

Gender Smoker Gender Aortic Stenosis


Yes No Yes No
Male 61 45 Male 62 44
Female 33 76 Female 43 66

Positive Confounder!
50
What type of Confounder is Gender?

Males Females
Aortic Smoker Aortic Smoker
Stenosis Yes No Stenosis Yes No
Yes 37 25 Yes 14 29
No 24 20 No 19 47

Aortic Smoker
Stenosis
Yes No
Yes 51 54
No 43 67
Positive Confounder since “pooled” OR is greater!
as order ratios are different dont have to
combine different groups as they dont 51
have any relationship and treat two
groups separately
Mantel-Haenszel Method

• Combine the information in a number of


2x2 tables

• Example: Rosenberg, et al. (1998)


studied the relationship between the
consumption of caffeinated coffee and
nonfatal myocardial infarction among
adults males under age 55.
Two samples: smokers and nonsmokers

52
Coffee and Myocardial Infarction

Smokers Nonsmokers
Myocardial Coffee Myocardial Coffee
Infarction Yes No Infarction Yes No
Yes 1011 81 Yes 383 66
No 390 77 No 365 123

• In both groups, the odds of suffering a myocardial infarction


are greater among coffee drinkers than among non-coffee
drinkers.
• Can we make a single overall statement about the relation?

53
Coffee and Myocardial Infarction

Smokers Nonsmokers
Myocardial Coffee Myocardial Coffee
Infarction Yes No Infarction Yes No
Yes 1011 81 Yes 383 66
No 390 77 No 365 123

Myocardial Coffee
Infarction Yes No
Yes 1394 147
No 755 200

• If smoking is a confounder, then we cannot simply sum two


tables. (Data suggests smoking is indeed a confounder)
54
Mantel-Haenszel Method
Step 1: Test of Homogeneity
• Before combining the information, we must
first verify that the population odd ratios are
constant across the different strata (subgroup)
• If they are not, it is not beneficial to compute a
single summary value for overall OR. It would
be better to treat data in different tables from
different populations and report a different OR
for each subgroup.
• To see whether ORi is uniform across g tables,
we want to test H0 means individual OR for subgroups is equal to combined

H0: OR1 = OR2 =  = ORg separate these subgrroups


total OR ie the confounder variable doesnt exist which

H1: not all ORs are the same H1 means we wont


combine those
groups together
and H0 is opposite 55
Test of Homogeneity
Exposure + Exposure -
Disease + ai bi
Disease - ci di

• Logarithm of estimated OR is

• Weight for Table i is


(add 0.5 to each value if some values are 0)

• Weighted average is
So for test
statistics, if value is

• Test statistic:
small we accept
the null hypothesis
as they all would
have same OR
56
Test of Homogeneity
• Test statistic:

• Why weight?

• The purpose is to weight strata with lower


variance (usually corresponds to strata with
more subjects) more heavily More subject Odds ratio should get more
weight in average OR value calculation

• Under H0, X2 will be small since each log OR


will relatively close to each other and to the
average log OR.
• Under H0, X2 » 2 with df =g-1.
Each subgroup has its own OR and own weight and is compared against average OR
of the cumulative subgroups and then weighted sum is calculated for their differences
57
Test of Homogeneity: Example

Smokers Nonsmokers
Myocardial Coffee Myocardial Coffee
Infarction Yes No Infarction Yes No
Yes 1011 81 Yes 383 66
No 390 77 No 365 123

Log of e in
mid term

58
Coffee and Myocardial Infarction

• The test statistic X2 = 0.896


• Under H0, X2 is chi-squared distributed with g-
1 = 2-1 =1 df. Why is it two degrees of freedom??? here
• P-value = P(2 ¸ 0.896) = 0.344
• We cannot reject H0; the data do not indicate
that the population OR relating coffee
consumption and nonfatal myocardial
infarction differ for smokers and nonsmokers
• So we assume OR for two subgroups are
estimating the same quantity, and use Mantel-
Haenszel method to combine this information
X2 g-1 = z^2 + ..+
So H0 - the subgroups indicate that there is no difference between OR for subgroups and
z ^g-1
cumulative population and hence you can combine all subgroups using MH method
r code 59
--> 1-
pchisq(Xvalue, df)
Mantel-Haenszel Method
Exposure + Exposure -
Disease + ai bi
Disease - ci di

Step 2: Summary Odds Ratio


• The estimated summary OR is

ni = ai+bi+ci+di

where ni = total # of observations in i-th table


• The weighted average of log(OR)

60
Summary Odds Ratio: Example

Smokers Nonsmokers
Myocardial Coffee Myocardial Coffee
Infarction Yes No Infarction Yes No
Yes 1011 81 Yes 383 66
No 390 77 No 365 123

• Once differences in smoking status have been


taken into account, OR = 2.18
61
just additional information - slide

Confidence Interval--- 1

• Estimated summary OR is

• Its estimated standard error is

where

• A 95% CI for summary OR is

62
Confidence Interval--- 2

• The weighted average of log(OR) is

• The estimated standard error of Y is

• A 95% CI for log(OR) is

63
CI of Summary Odds Ratio

• In the example,
𝟏 𝟏
𝒔ෞ𝒆 𝒀 = = = 𝟎. 𝟏𝟐𝟎
𝒘𝟏 +𝒘𝟐 𝟑𝟒.𝟔𝟐+𝟑𝟒.𝟗𝟑

• A 95% CI for log(OR) is

• The 95% CI for the summary odds ratio is

• After adjusting for the effects of smoking, we are 95%


confident that men who drink caffeinated coffee have
odds of experiencing nonfatal myocardial infarction that
are between 1.73 and 2.78 times greater than the odds
for men who do not drink coffee.

64
Mantel-Haenszel Method
Step 3: Test of Association
• To test whether the summary odds ratio is
equal to 1 if 1 in normal or 0 in log scale exist
then there is no correlation

• One way is to simply refer to the 95% CI for


the summary odds ratio to see whether it
contains the value 1.
(need log(OR) is approximately normal)

• An alternative method to test H0: OR =1 is to


use the chi-square test.

65
Test of Association
Exposure + Exposure - Total
Disease + ai bi ai+bi
Disease - ci di ci+di
Total ai + ci bi+di ni

• ai = Observed value for “exposure + and disease +”


• Ei =expected value = (ai+bi)(ai+ci)/ni
• variance
Standard deviation of ai is
• Test statistic:

( under H0)

66
Test of Association
Exposure + Exposure - Total
Disease + ai bi ai+bi
Disease - ci di ci+di
Total ai + ci bi+di ni

• Test statistic:

• Use this test only if

• Which row or column is designated as first is


arbitrary. The test statistic and p-values are the
same regardless of the order of rows/columns.
rotate - names and +/-ve, he may change in exam so be careful how columns
and rows should be for right calculation 67
Test of Association: Example
Smokers Nonsmokers
Myocardial Coffee Myocardial Coffee
Infarction Yes No Infarction Yes No
Yes 1011 81 Yes 383 66
No 390 77 No 365 123

68
Coffee and Myocardial Infarction
pvalue = 1- pchisq(43.68, 1) # in
• The test statistic
X2 = 43.68 R

• Under H0, X2 is 2 distributed with df=1, and


P-value = P(2 > 43.68) < 0.001
• Thus we reject H0 of no association between
exposure and disease. So after adjusting for
difference in smoking status, drinking
caffeinated coffee face higher risk of nonfatal
myocardial infarction
• This is a single study examining the effects of
coffee consumption on human health; other
studies have reported conflicting results. At
present time it appears that, in moderate
amounts, coffee is a fairly safe beverage for
otherwise healthy individuals.

69
Summary

• Confounder:
• Positive Confounder
• Negative Confounder
• Mantel-Haenszel Method
1. Test of Homogeneity
2. Summary Odds Ratio
3. Test of Association

70

You might also like