Professional Documents
Culture Documents
Basic Biostatistics - Wakgari Module 17-21
Basic Biostatistics - Wakgari Module 17-21
Basic Biostatistics
Module 18-20
Wakgari Deressa (PhD)
December 2016
Course Contents
18. Introduction to measures of association
(categorical variables using chi-square
distribution)
19. Fishers Exact test (0.5)
20. Introduction to logistic regression
Module 18
Introduction to
Measures of Association
• For the most part, we have been
applying the techniques of Ho to
quantitative data:
– Correlation
– Linear regression
• Simple linear regression
• Multiple linear regression
• Categorical variables are very common in
health or social sciences
– Nominal
– Ordinal
• When both exposure and outcome variables
have only two possible values the data can
be displayed in a 2x2 table.
• Classification of a sample of subjects with
respect to exposure-outcome relationship
using a Contingency Table (two rows and two
columns = 2x2)
Outcome
Exposure Yes No Total Row
marginal
Yes a b a+b=n1 totals
No c d c+d=n2
Total a+c=m1 b+d=m2 N Grand total
Column marginal totals
Test of Association for rxc
contingency tables
rxc contingency table
r = number of rows (no. of categories of an
attribute A)
c = number of columns (no. of categories of an
attribute B)
Variable B
B1 B2 B3 ... Bc Total
A1 O11 O12 O13 ... O1c R1
A2 O21 O22 O23 ... O2c R2
A3 O31 O32 O33 ... O3c R3
Variable A
. . . . ... . .
. . . . ... . .
. . . . ... . .
Total C1 C2 C3 ... Cc n
Oij – Observed frequency of ith row and jth
column
Ri – Marginal total of the ith row
Cj – Marginal total of the jth column
n – Grand total
Chi-Squared Test
• In this section, we describe how to use a chi-
squared (X2) test to examine whether there is
an association between the row variable and
the column variable or whether the
distribution of individuals among the
categories of one variable is independent of
their distribution among the categories of the
other.
• It is used to test hypothesis where the data is
in the form of frequencies.
Chi-Squared Test
• If we have a variable X which has a standard
normal distribution, then X2 has a chi
squared distribution.
• Clearly X2 can have only positive values, and
its distribution is highly skewed.
• This distribution of X2 has one degree of
freedom, and is the simplest case of a more
general ‘family’ of Chi squared distribution.
Chi-Squared Test
• If we have several independent variables, each
of which has a standard normal distribution,
say X1, X2, X3, …, Xk, then the sum of the
squares of all the X’s, Xi2, has a Chi squared
distribution with k degrees of freedom.
• The Chi squared distribution with one df is the
square of a standard normal distribution, so
the 5% cut-off point for X2 is the square of the
5% two tailed cut-off for the normal
distribution, 1.962=3.84
The Chi-Square Distribution
2
Properties of c -Curves
1. Total area under c -curve equals 1.
2
c2
O
ij - Eij
2
i, j Eij
Where
i th raw total jth column total Ri C j
Eij
grand total n
Acceptance
region, 0.95
0.05
18.307 c210
Chi-Squared Test
Perform the calculation c2cal
Give conclusion
if c2cal > c2 reject H0
P<
if c2cal c2 accept H0
P
Guidelines for Interpreting Chi-
Squared Test
Example
• A village survey investigated 124 households and
recorded their water supply. By reviewing the
village’s health center morbidity records for a
period of three months prior to the survey, it was
possible to identify household members with a
history of diarrhoeal episodes. Is there any
association between diarrhoeal episodes and
source of water in the village?
Solution
• H0: There is no association between
diarreoal episodes and source of
water in the village
H1: There is an association
Diaaroeal episodes 49 6 4 59
Total 88 20 16 124
Chi-Squared Test
c2 – test with d.f.=(2-1)x(3-1)=2
= 0.05
c20.05, 2 = 5.991
Chi-Squared Test
Calculation of the expected values
65 X 88 65 X 20 65 X 16
E11 46.13 E12 10.48 E13 8.39
124 124 124
59 X 16
59 X 88 E22
59 X 20
9.52 E23 7.61
E21 41.87 124
124 124
Chi-Squared Test
Calculation of the test statistic
c 2
39 - 46.13 14 - 10.48 12 - 8.39 49 - 41.87 6 - 9.52 4 - 7.61
2
2
2
2
2
2
cal
46.13 10.48 8.39 41.87 9.52 7.61
1.102 1.179 1.556 1.214 1.299 1.715 8.06
Conclusion
Since c2cal > c2 (8.06 >5.991), we reject H0
and accept H1, i.e., the data indicate an
association between diarrhoeal episodes and
source of water in the village.
Observed X2 value of 8.06
Is > the critical value of 5.991
For p=0.05 at 2 df (0.05>p)
2x2 Contingency table
Variable A
A1 A2
Total
B1 a b a+b
Variable B
B2 c d c+d
nad - bc
2
c
2
Yat
E
n ad - bc - n 2
2
c Yat
2
• Linear regression?
Dot-plot: Data from The Table
Yes
Signs of coronary disease
No
0 20 40 60 80 100
AGE (years)
Logistic regression
Table 2: Prevalence (%) of signs of CD by age group
Diseased
20 - 29 5 0 0
30 - 39 6 1 17
40 - 49 7 2 29
50 - 59 7 4 57
60 - 69 5 4 80
70 - 79 2 2 100
80 - 89 1 1 100
Dot-plot: Data from the above Table
Diseased % 100
80
60
40
20
0
0 2 4 6 8
Age group
1
Best fit curve
.8
e x
P( y x )
1 e x
.4 Y .6
Probability
.2
0
-10
-∞ -5 0
Z
5
+∞ 10
The sigmoidal shape means that the probability of disease is low until a certain
exposure threshold is reached, after which the risk increases rapidly until all
but the most resilient subjects have become ill as a result of exposure.
• However, the values of the probabilities
calculated remain between 0 and 1, whereas
the linear regression may predict values
above 1 or less than 0.
• The linear regression model is not
appropriate when Y is a dichotomous
variable because the expected value (or
mean) of Y is the probability that Y=1 and,
therefore, is limited to the range 0 through 1,
inclusive.
Logistic Model
• In logistic regression we model log odds of
disease or outcome of interest, and the
models are referred to as logistic models.
P( y x )
ln x
1 - P( y x )
logit of P(y|x)
• Therefore, a logistic model for the association
between a binary exposure and disease:
log odds = ln[p/1-p] = + exposure
(Where “odds” is short for odds of disease)
P
ln α βx
1- P
Logit ( p) x
P
e αβx
1- P
• If we let
– Exposure 1=exposed and 0=unexposed
– Disease 1=disease and 0=disease free
log odds = α + β*Exposure
Risk Analysis Using a 2x2 Table
Y=1=Outcome Present=Success
Y=0=Outcome Absent=Failure
Interpretation of coefficient
Disease Y
Exposure X yes no
yes P ( y x 1) 1 - P ( y x 1)
no P ( y x 0) 1 - P ( y x 0)
e
P Oddsd e e OR e
e αβx e
1- P
Oddsd e e ln( OR )
Interpretation of coefficient
• = increase in logarithm of OR for a one
unit increase in x
• Interval estimation
(1.96SE )
95% CI e
Example
• Risk of developing coronary heart disease
(CD) by age (<55 and 55+ years)
CD
Present Absent
(1) (0)
55+ (1) 21 6
<55 (0) 22 51
Coefficient SE Coeff/SE
OR e 2.094 8.1
Wald Test 3.96 2 with 1df (p 0.05)
95% CI e (2.094 1.96 x 0.529 ) 2.9, 22.9
Multiple Logistic Regression
• More than one independent variable
– Dichotomous, ordinal, nominal, continuous …
P
ln α β1x 1 β2 x 2 ... βi x i
1- P
• Interpretation of i
– Increase in log-odds for a one unit increase in xi with all the
other xis constant
– Measures association between xi and log-odds adjusted for
all other xi
Effect modification
• Effect modification
– Can be modelled by including interaction terms
P
ln α β1x1 β2x 2 β3x1 x 2
1- P
Example
P Probability for cardiac arrest
Exc 1= lack of exercise, 0 = exercise
Smk 1= smokers, 0= non-smokers
P
ln α β1 Exc β2 Smk
1- P
0.7102 1.0047 Exc 0.7005 Smk
(SE 0.2614) (SE 0.2664)