Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Analysis of nominal data case study: blood type

and cancer

Dr Alberto Corrias

Department of Biomedical Engineering. National University of Singapore

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 1 / 42
PollEv LUCKY DRAW

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 2 / 42
PollEv game: Overall standings

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 3 / 42
ABO blood types


Diagram is by InvictaHOG is in the Public Domain
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 4 / 42
ABO blood types

Link to video: https://youtu.be/L06TJTMVkBo

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 5 / 42
ABO blood types by country


Image credit: Rick Wicklin (SAS)
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 6 / 42
ABO blood types and cancer: a long history

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 7 / 42
Our case study

Two main studies


Wolpin BM et al J Natl Cancer Inst; 101(6): 424–431. 2009
Dandona M et al. J Natl Cancer Inst; 102(2): 135-137. 2010

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 8 / 42
Data collection (1996-2005)

Blood type and pancreatic cancer


Blood type Healthy Pancreatic cancer Total
O 46,238 91 46,329
A 38,667 118 38,785
B 8,464 33 8,497
AB 14,152 56 14,208
Total 107,521 298 107,819

4X2 contingency table (4 rows and 2 columns)

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 9 / 42
Our objective

What we want to test


H0 : incidence of pancreatic cancer is the same in all blood types.
H1 : incidence of pancreatic cancer is NOT the same in all blood
types.

Another way of expressing it


H0 : the distribution of blood types among pancreatic cancer
patients is the same as for healthy individuals.
H1 : the distribution of blood types among pancreatic cancer
patients is NOT the same as for healthy individuals.

To address the issue we will conduct a χ2 test (see theory).

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 10 / 42
Building the ”expected” table (if H0 was true)

Blood type and pancreatic cancer


Blood type Healthy Pancreatic cancer Total
O 46,238 91 46,329
A 38,667 118 38,785
B 8,464 33 8,497
AB 14,152 56 14,208
Total 107,521 298 107,819

1 Total incidence of pancreatic cancer is 298/107,819=0.002764


(0.2764%)
2 Each blood type should get 0.2764% of the total cases†
O group: 46, 329 ∗ 0.002764 = 128.04
A group: 38, 785 ∗ 0.002764 = 107.20
B group: 8, 497 ∗ 0.002764 = 23.49
AB group: 14, 208 ∗ 0.002764 = 39.27

Approximating up to 2 decimal figures
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 11 / 42
Building the ”expected” table (i.e., if H0 was true)
Blood type and pancreatic cancer
Blood type Healthy Pancreatic cancer Total
O 46,238 91 46,329
A 38,667 118 38,785
B 8,464 33 8,497
AB 14,152 56 14,208
Total 107,521 298 107,819

Expected incidence
Blood
Healthy Pancreatic cancer Total
type
O 46,329-128.04=46,200.96 128.04 46,329
A 38,785-107.20=38,677.8 107.20 38,785
B 8,497-23.49=8,473.51 23.49 8,497
AB 14,208-39.27=14,168.73 39.27 14,208
Total 107,521 298 107,819
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 12 / 42
About the ”expected” table

When describing the table of expected values, it is often said that


this table is not real, only virtual. Why? Because...
1 It can contain negative numbers
2 It can contain non-integer numbers
3 It only contains integer numbers

Answer: because it can contain numbers that are not integers. In


our example, it is impossible to have 134.35 individuals! The
table is only useful for calculation purposes.

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 13 / 42
Building the ”expected” table: practical tip

Blood type and pancreatic cancer


Type Healthy Cancer Total
O 46,238 91 Trow 1
A 38,667 118 Trow 2 In practice, you do not need to
B 8,464 33 Trow 3 compute the percentage. Each
AB 14,152 56 Trow 4 element in the expected table
Total Tcol1 Tcol2 Ttotal
follows the rule
Expected incidence
Type Healthy Cancer Total Trow Tcol
O E1,1 E1,2 Trow 1 Erow ,col =
Ttotal
A E2,1 E2,2 Trow 2
B E3,1 E3,2 Trow 3
AB E4,1 E4,2 Trow 4
Total Tcol1 Tcol2 Ttotal

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 14 / 42
Building the χ2 statistic

X (Observed − Expected # in table cell)2 X (O − E )2


χ2 = =
Expected # in table cell E

In our case there are 8 cells (4 rows, 2 columns)

(O1,1 − E1,1 )2 (O2,1 − E2,1 )2 (O3,1 − E3,1 )2 (O4,1 − E4,1 )2


χ2 = + + + +
E1,1 E2,1 E3,1 E4,1
(O1,2 − E1,2 )2 (O2,2 − E2,2 )2 (O3,2 − E3,2 )2 (O4,2 − E4,2 )2
+ + + = 22.9
E1,2 E2,2 E3,2 E4,2

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 15 / 42
About the χ2 statistic
Imagine we collect the following data
Observed incidence
Type Healthy Cancer Total
O 100 50 150
A 100 50 150
B 100 50 150
AB 100 50 150
Total 400 200 600

What is the value of the χ2 statistic?


0
1
600

Answer: 0. Explanation: the expected table is identical to the
observed table.

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 16 / 42
Finding the χ2crit

If we choose α = 0.05
to identify χ2crit = 7.8.
Note number of dof =
(4-1)(2-1)=3

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 17 / 42
Performing the χ2 test

0.2

0.1 This purple area is the p-value


χ2 (3)
0
χ2crit χ2

No Rejection Region Rejection Region


We found χ2 = 22.9, χ2crit = 7.8, hence χ2 > χ2crit and we reject the
NULL hypothesis. As a consequence, we already know p < 0.05† .


Exact p value can be computed using any statistical software
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 18 / 42
Rejection of the NULL hypothesis: statistical meaning

Statistical meaning
If, in fact, the the distribution of blood types among pancreatic
cancer patients is the same as for healthy individuals, then the
probability to randomly draw samples of the population that led us to
the computed χ2 value is less than 5% (the exact value of the
probability is the p value).

Statistical meaning (another equivalent way)


If, in fact, the the distribution of blood types among pancreatic
cancer patients is the same as for healthy individuals, the probability
that the differences we observed are simply due to random sampling is
less than 5% (the exact value of the probability is the p value).

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 19 / 42
Rejection of the NULL hypothesis: how it is reported

As reported in the real paper


Dandona et al. JNCI: 102(2):135-137. 2010
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 20 / 42
Of rows and columns...
Consider the following contingency tables
Table 1
X Y Table 2
Type 1 160 100 Type 1 Type 2 Type 3 Type 4
Type 2 240 50 X 160 240 1000 580
Type 3 1000 185 Y 100 50 185 98
Type 4 580 98

The value χ2 statistic for Table 1 is 84.4. The value of the χ2


statistic for Table 2 is...
1 > 84.4

2 < 84.4

3 = 84.4

4 It is impossible to tell

Answer: 84.4! Transposing the contingency table does not change the
result.

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 21 / 42
Pairwise comparisons

After rejecting the NULL hypothesis, we may be interested in


knowing which group displayed higher incidence of cancer. We can
have 6 comparisons
1 Type O versus Type A
2 Type O versus Type B
3 Type O versus Type AB
4 Type A versus Type B
5 Type A versus Type AB
6 Type B versus Type AB

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 22 / 42
Pairwise comparisons: Type O versus Type A

Original Data
Type Healthy Cancer Total Type O Vs Type A
O 46,238 91 46,329 Type Healthy Cancer Total
A 38,667 118 38,785 O 46,238 91 46,329
B 8,464 33 8,497 A 38,667 118 38,785
AB 14,152 56 14,208 Total 84,905 209 85,114
Total 107,521 298 107,819

The expected table for the Type O versus Type A case is calculated
(as before) by applying the formula

Trow Tcol
Erow ,col =
Ttotal

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 23 / 42
Type O versus Type A: building the expected table

Type O Vs Type A 46, 329 ∗ 84, 905


Type Healthy Cancer Total X = = 46, 215.24
85, 114
O 46,238 91 46,329
A 38,667 118 38,785 Note: for 2X2 tables, the other 3
Total 84,905 209 85,114 numbers can also be computed by
subtraction.
Type O Vs Type A (EXPECTED) Y = 46, 329 − 46, 215.24 =
Type Healthy Cancer Total 113.76
O X Y 46,329
Z = 84, 905 − 46, 215.24 =
A Z W 38,785
38, 689.76
Total 84,905 209 85,114
W = 209 − 113.76 = 95.24

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 24 / 42
Type O versus Type A: computing the χ2 statistic

Type O Vs Type A (OBSERVED) Type O Vs Type A (EXPECTED)


Type Healthy Cancer Total Type Healthy Cancer Total
O 46,238 91 46,329 O 46,215.24 113.76 46,329
A 38,667 118 38,785 A 38,689.76 95.24 38,785
Total 84,905 209 85,114 Total 84,905 209 85,114

This is now a 2X2 case and we need to apply the Yates correction

(|O1,1 − E1,1 | − 0.5)2 (|O1,2 − E1,2 | − 0.5)2


χ2 = + +
E1,1 E1,2
(|O2,1 − E2,1 | − 0.5)2 (|O2,2 − E2,2 | − 0.5)2
+ +
E2,1 E2,2

(|46, 238 − 46, 215.24| − 0.5)2 (|91 − 113.76| − 0.5)2


χ2 = + +... = 9.58
46, 215.24 113.76

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 25 / 42
All pairwise comparisons

We need to do 6 pairwise comparisons, therefore we must take into


account the issue of compounding errors†
1 Type O versus Type A: χ2 = 9.58
2 Type O versus Type B: χ2 = 10.9
3 Type O versus Type AB: χ2 = 16.7
4 Type A versus Type B: χ2 = 1.3
5 Type A versus Type AB: χ2 = 2.3
6 Type B versus Type AB: χ2 = 0.0018
Assuming we chose a 95% confidence level, we need to compare our
computed χ2 statistics with χ2crit−BONF


See ANOVA lecture: every pairwise test carries the chance of committing type
I error.
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 26 / 42
Determining χ2crit−BONF (e.g., for a case of 95% confidence)

We have 4 blood types to compare: 6 possible pairwise


comparisons
Each pairwise comparison is from a 2X2 contingency table:
number of degrees of freedom is (2 − 1)(2 − 1) = 1
This blue area to the right
of χ2crit is 0.05

0.2
This green area to the right
0.1 of χ2crit−BONF is 0.05/6
χ2 (1)
0
χ2crit χ2crit−BONF

No Rejection Region Rejection Region

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 27 / 42
Performing 6 pairwise χ2 tests

Applying Bonferroni correction


O vs A O vs B O vs AB A vs B Avs AB B vs AB
2
χ 9.58 10.9 16.7 1.3 2.3 0.0018
χ2crtit−BONF 6.96 6.96 6.96 6.96 6.96 6.96
Don’t Don’t Don’t
Result Reject Reject Reject
reject reject reject

We Reject the NULL hypothesis in the O versus A, B and AB


comparisons. We conclude that the incidence of pancreatic cancer
among people with blood type O is different from that of people with
other blood types.

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 28 / 42
Statistical statements

The comparison of cancer incidence between blood type A and B


yielded a χ2 = 1.3 against a χ2BONF = 6.96. Consider the
statement ”We conclude that there is no difference in cancer
incidence between blood type A and B”. Is this statement
correct? Why?

Answer: No. We never conclude that the NULL hypothesis is


true. A better statement would be ”Our data failed to support
any difference in cancer incidence between blood types A and B”

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 29 / 42
Examining the rejection cases

Observed incidence
Type Healthy Cancer Total
O 46,238 91 46,329 We observe two aspects
A 38,667 118 38,785
(from previous slide)
B 8,464 33 8,497
AB 14,152 56 14,208
Comparisons with O type
Total 107,521 298 107,829 yielded statistically significant
differences
Expected incidence
Type Healthy Cancer Total
For type O, observed is less
O 46,200.95 128.05 46,329 than the expected while for all
A 38,677.80 107.20 38,785 other types observed is more
B 8,473.52 23.48 8,497 than the expected.
AB 14168.73 39.27 14,208
Total 107,521 298 107,829

Combining the two aspects, we can conclude that the incidence of


pancreatic cancer is less for people with blood type O.

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 30 / 42
How it is reported

In the real paper

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 31 / 42
A similar case study: cancer and ethnicity

Youtube link https://www.youtube.com/watch?v=XSDU3d2j4gY

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 32 / 42
Calculation question (3 points)

Consider the following contingency table relative to cancer incidence


in different US ethnicities
Healthy Cancer Total
White 2500 195
Black 220 3720
Asian American 540 560
Total

Using a 95% confidence level


1 Provide the value of the χ2 statistic and determine whether there
is any difference among the 3 races in terms of cancer incidence

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 33 / 42
Calculation question: solution
Healthy Cancer Total
White 2500 195 2,695
Black 3500 220 3720
Asian American 540 20 560
Total 6,540 435 6,975
Computing the expected table:
2695 ∗ 6540 3720 ∗ 6540
E1,1 = = 2526.92, E1,2 = = 3488...
6975 6975
Expected
Healthy Cancer Total
White 2,526.92 168.08 2,695
Black 3488 232 3720
Asian American 525.08 34.92 560
Total 6,540 435 6,975

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 34 / 42
Calculation question: solution (cont’d)

(2500 − 2526.92)2 (3500 − 3488)2


χ2 = + + ... = 12.1
2526.92 3488
χ2crit = 5.99 (from the table, 0.95 and 2 dof). Hence we reject the
NULL hypothesis and conclude that the race distribution among
healthy individuals is different from that of cancer patients, or,
alternatively the incidence of cancer among the 3 races is not the
same.
There are 3 pairwise comparisons (not required by the question)

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 35 / 42
Calculation question: solution (not required by the
question)

Data tables Expected tables

Healthy Cancer Total Healthy Cancer Total


White 2500 195 2,695 White 2520.65 174.35 2,695
Black 3500 220 3720 Black 3479.35 240.65 3720
Total 8000 415 8415 Total 6000 415 6415
Healthy Cancer Total Healthy Cancer Total
White 2500 195 2,695 White 2516.99 178.01 2,695
Asian Am 540 20 560 Asian Am 523.01 36.99 560
Total 3040 215 3255 Total 3040 215 3255
Healthy Cancer Total Healthy Cancer Total
Black 3500 220 3720 Black 3511.4 208.6 3720
Asian Am 540 20 560 Asian Am 528.6 31.4 560
Total 4040 240 4280 Total 4040 240 4280

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 36 / 42
Calculation question: solution (cont’d)
Computation of the χ2 statistic should have been made with the
Yates correction. For example,

(|2500 − 2520.65| − 0.5)2 (|195 − 174.35| − 0.5)2


χ2white vs black = + +...
2520.65 174.35
Results are (not required by the question)

χ2white vs black = 4.30

χ2white vs asian am = 9.51


χ2asianam vs black = 4.61
Although not required by the question, it is interesting to note that,
at 95% confidence level,χ2crit−BONF = 5.73, hence the only significant
difference that can be confirmed is between white and Asian
Americans, with Asian American having lower cancer incidence.
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 37 / 42
A slightly different problem

Our case study


We had two samples of subjects
1 The sample of healthy subjects
2 The sample of patients with pancreatic cancer
and we compared the proportions of the different blood types.

What is often done in practice


If national statistics of blood types are available, the ”healthy”
sample is sometimes not included in the study. The pancreatic cancer
patients are classified according to their blood types. The problem
becomes whether the distribution of blood types among pancreatic
cancer patients differs from the national distribution of blood types.

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 38 / 42
Testing against a known distribution
Our case study so far
Blood type Healthy Pancreatic cancer Total
O 46,238 91 46,329
A 38,667 118 38,785
B 8,464 33 8,497
AB 14,152 56 14,208
Total 107,521 298 107,819
Testing against known distribution
National Pancreatic cancer
Blood type
Average (%) patients
O 43% 91
A 36% 118
B 7.9% 33
AB 13.1% 56
Total 100% 298

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 39 / 42
Testing against a known distribution: building the χ2
statistic

The key idea is to use the national statistics data as ”Expected”


values.
X (O − E )2 (91 − 0.43 ∗ 298)2
χ2 = = +
E 0.43 ∗ 298
(118 − 0.36 ∗ 298)2 (33 − 0.079 ∗ 298)2 (56 − 0.131 ∗ 298)2
+ +
0.36 ∗ 298 0.079 ∗ 298 0.131 ∗ 298
Testing against known distribution
National Pancreatic cancer
Blood type
Average (%) patients
O 43% 91
A 36% 118
B 7.9% 33
AB 13.1% 56
Total 100% 298

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 40 / 42
Testing against a known distribution

Performing the test (exactly the same as before). We found


χ2 = 22.7. The value of χ2crit is still 7.8.
0.2

0.1 This purple area is the p-value


χ2 (3)
0
χ2crit χ2

No Rejection Region Rejection Region


We reject the NULL hypothesis and conclude that the distribution of
pancreatic cancer among the blood types differ from the distribution
according to national statistics.†


One can also proceed doing pairwise comparisons as we did earlier in this
class: we will not cover it for this case though.
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 41 / 42
Lecture Summary

1 Blood types an cancer

2 Contingency table and expected table

3 The χ2 statistic and the χ2 test

4 Pairwise comparisons after the χ2 test

5 Comparing against a known distribution

BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 42 / 42

You might also like