BloodTypesCancerCaseStudy

Analysis of nominal data case study: blood type
and cancer
Dr Alberto Corrias
Department of Biomedical Engineering. National University of Singapore
BN2102 Bioengineering Data Analysis Analysis of nominal data case study: blood type and cancer 1 / 42
PollEv LUCKY DRAW
PollEv game: Overall standings
ABO blood types
†
Diagram is by InvictaHOG is in the Public Domain
ABO blood types
Link to video: https://youtu.be/L06TJTMVkBo
ABO blood types by country
†
Image credit: Rick Wicklin (SAS)
ABO blood types and cancer: a long history
Our case study
Two main studies

Wolpin BM et al J Natl Cancer Inst; 101(6): 424–431. 2009
Dandona M et al. J Natl Cancer Inst; 102(2): 135-137. 2010
Data collection (1996-2005)
Blood type and pancreatic cancer

Blood type Healthy Pancreatic cancer Total
O 46,238 91 46,329
A 38,667 118 38,785
B 8,464 33 8,497
AB 14,152 56 14,208
Total 107,521 298 107,819
4X2 contingency table (4 rows and 2 columns)
Our objective
What we want to test

H0 : incidence of pancreatic cancer is the same in all blood types.
H1 : incidence of pancreatic cancer is NOT the same in all blood
types.
Another way of expressing it

H0 : the distribution of blood types among pancreatic cancer
patients is the same as for healthy individuals.
H1 : the distribution of blood types among pancreatic cancer
patients is NOT the same as for healthy individuals.
To address the issue we will conduct a χ2 test (see theory).
Building the ”expected” table (if H0 was true)

O 46,238 91 46,329
A 38,667 118 38,785
B 8,464 33 8,497
AB 14,152 56 14,208
Total 107,521 298 107,819
1 Total incidence of pancreatic cancer is 298/107,819=0.002764

(0.2764%)
2 Each blood type should get 0.2764% of the total cases†
O group: 46, 329 ∗ 0.002764 = 128.04
A group: 38, 785 ∗ 0.002764 = 107.20
B group: 8, 497 ∗ 0.002764 = 23.49
AB group: 14, 208 ∗ 0.002764 = 39.27
†
Approximating up to 2 decimal figures
Building the ”expected” table (i.e., if H0 was true)
O 46,238 91 46,329
A 38,667 118 38,785
B 8,464 33 8,497
AB 14,152 56 14,208
Total 107,521 298 107,819
Expected incidence
Blood
Healthy Pancreatic cancer Total
type
O 46,329-128.04=46,200.96 128.04 46,329
A 38,785-107.20=38,677.8 107.20 38,785
B 8,497-23.49=8,473.51 23.49 8,497
AB 14,208-39.27=14,168.73 39.27 14,208
Total 107,521 298 107,819
About the ”expected” table
When describing the table of expected values, it is often said that

this table is not real, only virtual. Why? Because...
1 It can contain negative numbers
2 It can contain non-integer numbers
3 It only contains integer numbers
Answer: because it can contain numbers that are not integers. In

our example, it is impossible to have 134.35 individuals! The
table is only useful for calculation purposes.
Building the ”expected” table: practical tip

Type Healthy Cancer Total
O 46,238 91 Trow 1
A 38,667 118 Trow 2 In practice, you do not need to
B 8,464 33 Trow 3 compute the percentage. Each
AB 14,152 56 Trow 4 element in the expected table
Total Tcol1 Tcol2 Ttotal
follows the rule
Expected incidence
Type Healthy Cancer Total Trow Tcol
O E1,1 E1,2 Trow 1 Erow ,col =
Ttotal
A E2,1 E2,2 Trow 2
B E3,1 E3,2 Trow 3
AB E4,1 E4,2 Trow 4
Total Tcol1 Tcol2 Ttotal
Building the χ2 statistic
X (Observed − Expected # in table cell)2 X (O − E )2

χ2 = =
Expected # in table cell E
In our case there are 8 cells (4 rows, 2 columns)
(O1,1 − E1,1 )2 (O2,1 − E2,1 )2 (O3,1 − E3,1 )2 (O4,1 − E4,1 )2

χ2 = + + + +
E1,1 E2,1 E3,1 E4,1
(O1,2 − E1,2 )2 (O2,2 − E2,2 )2 (O3,2 − E3,2 )2 (O4,2 − E4,2 )2
+ + + = 22.9
E1,2 E2,2 E3,2 E4,2
About the χ2 statistic
Imagine we collect the following data
Observed incidence
O 100 50 150
A 100 50 150
B 100 50 150
AB 100 50 150
Total 400 200 600
What is the value of the χ2 statistic?

0
1
600
∞
Answer: 0. Explanation: the expected table is identical to the
observed table.
Finding the χ2crit
If we choose α = 0.05
to identify χ2crit = 7.8.
Note number of dof =
(4-1)(2-1)=3
Performing the χ2 test
0.2
0.1 This purple area is the p-value

χ2 (3)
0
χ2crit χ2
No Rejection Region Rejection Region

We found χ2 = 22.9, χ2crit = 7.8, hence χ2 > χ2crit and we reject the
NULL hypothesis. As a consequence, we already know p < 0.05† .
†
Exact p value can be computed using any statistical software
Rejection of the NULL hypothesis: statistical meaning
Statistical meaning
If, in fact, the the distribution of blood types among pancreatic
cancer patients is the same as for healthy individuals, then the
probability to randomly draw samples of the population that led us to
the computed χ2 value is less than 5% (the exact value of the
probability is the p value).
Statistical meaning (another equivalent way)

If, in fact, the the distribution of blood types among pancreatic
cancer patients is the same as for healthy individuals, the probability
that the differences we observed are simply due to random sampling is
less than 5% (the exact value of the probability is the p value).
Rejection of the NULL hypothesis: how it is reported
As reported in the real paper
†
Dandona et al. JNCI: 102(2):135-137. 2010
Of rows and columns...
Consider the following contingency tables
Table 1
X Y Table 2
Type 1 160 100 Type 1 Type 2 Type 3 Type 4
Type 2 240 50 X 160 240 1000 580
Type 3 1000 185 Y 100 50 185 98
Type 4 580 98
The value χ2 statistic for Table 1 is 84.4. The value of the χ2

statistic for Table 2 is...
1 > 84.4
2 < 84.4
3 = 84.4
4 It is impossible to tell
Answer: 84.4! Transposing the contingency table does not change the
result.
Pairwise comparisons
After rejecting the NULL hypothesis, we may be interested in

knowing which group displayed higher incidence of cancer. We can
have 6 comparisons
1 Type O versus Type A
2 Type O versus Type B
3 Type O versus Type AB
4 Type A versus Type B
5 Type A versus Type AB
6 Type B versus Type AB
Pairwise comparisons: Type O versus Type A
Original Data
Type Healthy Cancer Total Type O Vs Type A
O 46,238 91 46,329 Type Healthy Cancer Total
A 38,667 118 38,785 O 46,238 91 46,329
B 8,464 33 8,497 A 38,667 118 38,785
AB 14,152 56 14,208 Total 84,905 209 85,114
Total 107,521 298 107,819
The expected table for the Type O versus Type A case is calculated
(as before) by applying the formula
Trow Tcol
Erow ,col =
Ttotal
Type O versus Type A: building the expected table
Type O Vs Type A 46, 329 ∗ 84, 905

Type Healthy Cancer Total X = = 46, 215.24
85, 114
O 46,238 91 46,329
A 38,667 118 38,785 Note: for 2X2 tables, the other 3
Total 84,905 209 85,114 numbers can also be computed by
subtraction.
Type O Vs Type A (EXPECTED) Y = 46, 329 − 46, 215.24 =
Type Healthy Cancer Total 113.76
O X Y 46,329
Z = 84, 905 − 46, 215.24 =
A Z W 38,785
38, 689.76
Total 84,905 209 85,114
W = 209 − 113.76 = 95.24
Type O versus Type A: computing the χ2 statistic
Type O Vs Type A (OBSERVED) Type O Vs Type A (EXPECTED)

Type Healthy Cancer Total Type Healthy Cancer Total
O 46,238 91 46,329 O 46,215.24 113.76 46,329
A 38,667 118 38,785 A 38,689.76 95.24 38,785
Total 84,905 209 85,114 Total 84,905 209 85,114
This is now a 2X2 case and we need to apply the Yates correction
(|O1,1 − E1,1 | − 0.5)2 (|O1,2 − E1,2 | − 0.5)2

χ2 = + +
E1,1 E1,2
(|O2,1 − E2,1 | − 0.5)2 (|O2,2 − E2,2 | − 0.5)2
+ +
E2,1 E2,2
(|46, 238 − 46, 215.24| − 0.5)2 (|91 − 113.76| − 0.5)2

χ2 = + +... = 9.58
46, 215.24 113.76
All pairwise comparisons
We need to do 6 pairwise comparisons, therefore we must take into

account the issue of compounding errors†
1 Type O versus Type A: χ2 = 9.58
2 Type O versus Type B: χ2 = 10.9
3 Type O versus Type AB: χ2 = 16.7
4 Type A versus Type B: χ2 = 1.3
5 Type A versus Type AB: χ2 = 2.3
6 Type B versus Type AB: χ2 = 0.0018
Assuming we chose a 95% confidence level, we need to compare our
computed χ2 statistics with χ2crit−BONF
†
See ANOVA lecture: every pairwise test carries the chance of committing type
I error.
Determining χ2crit−BONF (e.g., for a case of 95% confidence)
We have 4 blood types to compare: 6 possible pairwise

comparisons
Each pairwise comparison is from a 2X2 contingency table:
number of degrees of freedom is (2 − 1)(2 − 1) = 1
This blue area to the right
of χ2crit is 0.05
0.2
This green area to the right
0.1 of χ2crit−BONF is 0.05/6
χ2 (1)
0
χ2crit χ2crit−BONF
Performing 6 pairwise χ2 tests
Applying Bonferroni correction

O vs A O vs B O vs AB A vs B Avs AB B vs AB
2
χ 9.58 10.9 16.7 1.3 2.3 0.0018
χ2crtit−BONF 6.96 6.96 6.96 6.96 6.96 6.96
Don’t Don’t Don’t
Result Reject Reject Reject
reject reject reject
We Reject the NULL hypothesis in the O versus A, B and AB

comparisons. We conclude that the incidence of pancreatic cancer
among people with blood type O is different from that of people with
other blood types.
Statistical statements
The comparison of cancer incidence between blood type A and B

yielded a χ2 = 1.3 against a χ2BONF = 6.96. Consider the
statement ”We conclude that there is no difference in cancer
incidence between blood type A and B”. Is this statement
correct? Why?
Answer: No. We never conclude that the NULL hypothesis is

true. A better statement would be ”Our data failed to support
any difference in cancer incidence between blood types A and B”
Examining the rejection cases
Observed incidence
O 46,238 91 46,329 We observe two aspects
A 38,667 118 38,785
(from previous slide)
B 8,464 33 8,497
AB 14,152 56 14,208
Comparisons with O type
Total 107,521 298 107,829 yielded statistically significant
differences
Expected incidence
For type O, observed is less
O 46,200.95 128.05 46,329 than the expected while for all
A 38,677.80 107.20 38,785 other types observed is more
B 8,473.52 23.48 8,497 than the expected.
AB 14168.73 39.27 14,208
Total 107,521 298 107,829
Combining the two aspects, we can conclude that the incidence of

pancreatic cancer is less for people with blood type O.
How it is reported
In the real paper
A similar case study: cancer and ethnicity
Youtube link https://www.youtube.com/watch?v=XSDU3d2j4gY
Calculation question (3 points)
Consider the following contingency table relative to cancer incidence

in different US ethnicities
Healthy Cancer Total
White 2500 195
Black 220 3720
Asian American 540 560
Total
Using a 95% confidence level

1 Provide the value of the χ2 statistic and determine whether there
is any difference among the 3 races in terms of cancer incidence
Calculation question: solution
White 2500 195 2,695
Black 3500 220 3720
Asian American 540 20 560
Total 6,540 435 6,975
Computing the expected table:
2695 ∗ 6540 3720 ∗ 6540
E1,1 = = 2526.92, E1,2 = = 3488...
6975 6975
Expected
White 2,526.92 168.08 2,695
Black 3488 232 3720
Asian American 525.08 34.92 560
Total 6,540 435 6,975
Calculation question: solution (cont’d)
(2500 − 2526.92)2 (3500 − 3488)2

χ2 = + + ... = 12.1
2526.92 3488
χ2crit = 5.99 (from the table, 0.95 and 2 dof). Hence we reject the
NULL hypothesis and conclude that the race distribution among
healthy individuals is different from that of cancer patients, or,
alternatively the incidence of cancer among the 3 races is not the
same.
There are 3 pairwise comparisons (not required by the question)
Calculation question: solution (not required by the
question)
Data tables Expected tables
Healthy Cancer Total Healthy Cancer Total

White 2500 195 2,695 White 2520.65 174.35 2,695
Black 3500 220 3720 Black 3479.35 240.65 3720
Total 8000 415 8415 Total 6000 415 6415
White 2500 195 2,695 White 2516.99 178.01 2,695
Asian Am 540 20 560 Asian Am 523.01 36.99 560
Total 3040 215 3255 Total 3040 215 3255
Black 3500 220 3720 Black 3511.4 208.6 3720
Asian Am 540 20 560 Asian Am 528.6 31.4 560
Total 4040 240 4280 Total 4040 240 4280
Calculation question: solution (cont’d)
Computation of the χ2 statistic should have been made with the
Yates correction. For example,
(|2500 − 2520.65| − 0.5)2 (|195 − 174.35| − 0.5)2

χ2white vs black = + +...
2520.65 174.35
Results are (not required by the question)
χ2white vs black = 4.30
χ2white vs asian am = 9.51

χ2asianam vs black = 4.61
Although not required by the question, it is interesting to note that,
at 95% confidence level,χ2crit−BONF = 5.73, hence the only significant
difference that can be confirmed is between white and Asian
Americans, with Asian American having lower cancer incidence.
A slightly different problem
Our case study

We had two samples of subjects
1 The sample of healthy subjects
2 The sample of patients with pancreatic cancer
and we compared the proportions of the different blood types.
What is often done in practice

If national statistics of blood types are available, the ”healthy”
sample is sometimes not included in the study. The pancreatic cancer
patients are classified according to their blood types. The problem
becomes whether the distribution of blood types among pancreatic
cancer patients differs from the national distribution of blood types.
Testing against a known distribution
Our case study so far
O 46,238 91 46,329
A 38,667 118 38,785
B 8,464 33 8,497
AB 14,152 56 14,208
Total 107,521 298 107,819
Testing against known distribution
National Pancreatic cancer
Blood type
Average (%) patients
O 43% 91
A 36% 118
B 7.9% 33
AB 13.1% 56
Total 100% 298
Testing against a known distribution: building the χ2
statistic
The key idea is to use the national statistics data as ”Expected”

values.
X (O − E )2 (91 − 0.43 ∗ 298)2
χ2 = = +
E 0.43 ∗ 298
(118 − 0.36 ∗ 298)2 (33 − 0.079 ∗ 298)2 (56 − 0.131 ∗ 298)2
+ +
0.36 ∗ 298 0.079 ∗ 298 0.131 ∗ 298
Testing against known distribution
National Pancreatic cancer
Blood type
Average (%) patients
O 43% 91
A 36% 118
B 7.9% 33
AB 13.1% 56
Total 100% 298
Testing against a known distribution
Performing the test (exactly the same as before). We found

χ2 = 22.7. The value of χ2crit is still 7.8.
0.2
0.1 This purple area is the p-value

χ2 (3)
0
χ2crit χ2

We reject the NULL hypothesis and conclude that the distribution of
pancreatic cancer among the blood types differ from the distribution
according to national statistics.†
†
One can also proceed doing pairwise comparisons as we did earlier in this
class: we will not cover it for this case though.
Lecture Summary
1 Blood types an cancer
2 Contingency table and expected table
3 The χ2 statistic and the χ2 test
4 Pairwise comparisons after the χ2 test
5 Comparing against a known distribution

BloodTypesCancerCaseStudy

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BloodTypesCancerCaseStudy

Uploaded by

Copyright:

Available Formats

Analysis of nominal data case study: blood type

Department of Biomedical Engineering. National University of Singapore

Link to video: https://youtu.be/L06TJTMVkBo

Two main studies

Blood type and pancreatic cancer

4X2 contingency table (4 rows and 2 columns)

What we want to test

Another way of expressing it

To address the issue we will conduct a χ2 test (see theory).

Blood type and pancreatic cancer

1 Total incidence of pancreatic cancer is 298/107,819=0.002764

When describing the table of expected values, it is often said that

Answer: because it can contain numbers that are not integers. In

Blood type and pancreatic cancer

X (Observed − Expected # in table cell)2 X (O − E )2

In our case there are 8 cells (4 rows, 2 columns)

(O1,1 − E1,1 )2 (O2,1 − E2,1 )2 (O3,1 − E3,1 )2 (O4,1 − E4,1 )2

What is the value of the χ2 statistic?

0.1 This purple area is the p-value

No Rejection Region Rejection Region

Statistical meaning (another equivalent way)

As reported in the real paper

The value χ2 statistic for Table 1 is 84.4. The value of the χ2

After rejecting the NULL hypothesis, we may be interested in

Type O Vs Type A 46, 329 ∗ 84, 905

Type O Vs Type A (OBSERVED) Type O Vs Type A (EXPECTED)

(|O1,1 − E1,1 | − 0.5)2 (|O1,2 − E1,2 | − 0.5)2

(|46, 238 − 46, 215.24| − 0.5)2 (|91 − 113.76| − 0.5)2

We need to do 6 pairwise comparisons, therefore we must take into

We have 4 blood types to compare: 6 possible pairwise

No Rejection Region Rejection Region

Applying Bonferroni correction

We Reject the NULL hypothesis in the O versus A, B and AB

The comparison of cancer incidence between blood type A and B

Answer: No. We never conclude that the NULL hypothesis is

Combining the two aspects, we can conclude that the incidence of

In the real paper

Youtube link https://www.youtube.com/watch?v=XSDU3d2j4gY

Consider the following contingency table relative to cancer incidence

Using a 95% confidence level

(2500 − 2526.92)2 (3500 − 3488)2

Data tables Expected tables

Healthy Cancer Total Healthy Cancer Total

(|2500 − 2520.65| − 0.5)2 (|195 − 174.35| − 0.5)2

χ2white vs black = 4.30

χ2white vs asian am = 9.51

Our case study

What is often done in practice

The key idea is to use the national statistics data as ”Expected”

Performing the test (exactly the same as before). We found

0.1 This purple area is the p-value

No Rejection Region Rejection Region

1 Blood types an cancer

2 Contingency table and expected table

3 The χ2 statistic and the χ2 test

4 Pairwise comparisons after the χ2 test

5 Comparing against a known distribution

You might also like