Download as pdf or txt
Download as pdf or txt
You are on page 1of 82

Department of Preventive Medicine

School of Public Health, CHS, AAU

Basic Biostatistics
Module 18-20
Wakgari Deressa (PhD)

December 2016
Course Contents
18. Introduction to measures of association
(categorical variables using chi-square
distribution)
19. Fishers Exact test (0.5)
20. Introduction to logistic regression
Module 18
Introduction to
Measures of Association
• For the most part, we have been
applying the techniques of Ho to
quantitative data:
– Correlation
– Linear regression
• Simple linear regression
• Multiple linear regression
• Categorical variables are very common in
health or social sciences
– Nominal
– Ordinal
• When both exposure and outcome variables
have only two possible values the data can
be displayed in a 2x2 table.
• Classification of a sample of subjects with
respect to exposure-outcome relationship
using a Contingency Table (two rows and two
columns = 2x2)
Outcome
Exposure Yes No Total Row
marginal
Yes a b a+b=n1 totals
No c d c+d=n2
Total a+c=m1 b+d=m2 N Grand total
Column marginal totals
Test of Association for rxc
contingency tables
rxc contingency table
r = number of rows (no. of categories of an
attribute A)
c = number of columns (no. of categories of an
attribute B)
Variable B

B1 B2 B3 ... Bc Total
A1 O11 O12 O13 ... O1c R1
A2 O21 O22 O23 ... O2c R2
A3 O31 O32 O33 ... O3c R3
Variable A

. . . . ... . .
. . . . ... . .
. . . . ... . .

Ar Or1 Or2 Or3 ... Orc Rr

Total C1 C2 C3 ... Cc n
Oij – Observed frequency of ith row and jth
column
Ri – Marginal total of the ith row
Cj – Marginal total of the jth column
n – Grand total
Chi-Squared Test
• In this section, we describe how to use a chi-
squared (X2) test to examine whether there is
an association between the row variable and
the column variable or whether the
distribution of individuals among the
categories of one variable is independent of
their distribution among the categories of the
other.
• It is used to test hypothesis where the data is
in the form of frequencies.
Chi-Squared Test
• If we have a variable X which has a standard
normal distribution, then X2 has a chi
squared distribution.
• Clearly X2 can have only positive values, and
its distribution is highly skewed.
• This distribution of X2 has one degree of
freedom, and is the simplest case of a more
general ‘family’ of Chi squared distribution.
Chi-Squared Test
• If we have several independent variables, each
of which has a standard normal distribution,
say X1, X2, X3, …, Xk, then the sum of the
squares of all the X’s, Xi2, has a Chi squared
distribution with k degrees of freedom.
• The Chi squared distribution with one df is the
square of a standard normal distribution, so
the 5% cut-off point for X2 is the square of the
5% two tailed cut-off for the normal
distribution, 1.962=3.84
The Chi-Square Distribution
2
Properties of c -Curves
1. Total area under c -curve equals 1.
2

2. A c -curve starts at 0 on the horizontal axis


2

and extends indefinitely to the right,


approaching, but never touching the
horizontal axis.
3. A c -curve is right-skewed.
2

4. As the number of degrees of freedom


becomes larger, c -curves look increasingly
2

like normal curves.


Steps to the Chi-Squared Statistic
1. Identify the hypothesis you are testing
2. Present the observed data
3. Calculate the expected cell frequencies
4. Compute difference between observed and
expected frequencies for each cell
5. Compute (Difference Squared)/Expected
6. Calculate the Chi-square value
7. Determine the degrees of freedom.
8. Make statistical decision about Ho and give
conclusion.
Chi-Squared Test
 Hypothesis to be tested
H0 : There is no association between the
row and column variables
H1 : There is an association
or
H0 : The row and column variables are
independent
H1 : The two variables are dependent
Chi-Squared Test
 Test Statistic: c2-test with d.f. = (r-1)x(c-1)

c2  
O
ij - Eij 
2

i, j Eij
Where
i th raw total  jth column total Ri  C j
Eij  
grand total n

• Choose the level of significance (e.g. α=5%)


Chi-Squared Test
 Determine the critical value
(tabulated value of c2) Rejection
region

Acceptance
region, 0.95
0.05

18.307 c210
Chi-Squared Test
 Perform the calculation c2cal

 Give conclusion
 if c2cal > c2  reject H0
P<
 if c2cal  c2  accept H0
P
Guidelines for Interpreting Chi-
Squared Test
Example
• A village survey investigated 124 households and
recorded their water supply. By reviewing the
village’s health center morbidity records for a
period of three months prior to the survey, it was
possible to identify household members with a
history of diarrhoeal episodes. Is there any
association between diarrhoeal episodes and
source of water in the village?
Solution
• H0: There is no association between
diarreoal episodes and source of
water in the village
H1: There is an association

 Data: See the following table


Chi-Squared Test
No. of HHs according to water supply

River Well Tap Total


Status
No diarroeal episodes 39 14 12 65

Diaaroeal episodes 49 6 4 59

Total 88 20 16 124
Chi-Squared Test
 c2 – test with d.f.=(2-1)x(3-1)=2
  = 0.05
 c20.05, 2 = 5.991
Chi-Squared Test
 Calculation of the expected values
65 X 88 65 X 20 65 X 16
E11   46.13 E12   10.48 E13   8.39
124 124 124

59 X 16
59 X 88 E22 
59 X 20
 9.52 E23   7.61
E21   41.87 124
124 124
Chi-Squared Test
Calculation of the test statistic

c 2

39 - 46.13 14 - 10.48 12 - 8.39 49 - 41.87  6 - 9.52 4 - 7.61
2

2

2

2

2

2

cal
46.13 10.48 8.39 41.87 9.52 7.61
 1.102  1.179  1.556  1.214  1.299  1.715  8.06

 Conclusion
Since c2cal > c2 (8.06 >5.991), we reject H0
and accept H1, i.e., the data indicate an
association between diarrhoeal episodes and
source of water in the village.
Observed X2 value of 8.06
Is > the critical value of 5.991
For p=0.05 at 2 df (0.05>p)
2x2 Contingency table
Variable A
A1 A2
Total
B1 a b a+b
Variable B
B2 c d c+d

Total a+c b+d n


2x2 Contingency table
• For 2x2 table there is a short-cut formula to
calculate c2.

nad - bc
2
c 
2

(a  c)(b  d )(a  b)(c  d )


Assumptions of the c2 - test
 No expected frequency should be less than 1,
and no more than 20% of the expected
frequencies should be less than 5. If this does
not hold
– row or column variables categories can sometimes
be combined to make the expected frequencies
larger or
– use Yate’s correction
c 2

 O - E - 1 2
2

Yat
E

• For 2x2 table

n ad - bc - n 2
2

c Yat 
2

(a  c)(b  d )(a  b)(c  d )


Yate’s Correction X2 =3.28, p-value=0.07
Module 19

Fisher’s Exact test


Fisher’s Exact Test
• For 2x2 table, when the total no of
observations is less than 20 or when it is
greater than 20 and the smallest of the four
expected frequencies is less than 5, use
Fisher’s Exact test.
Fisher’s Exact Test
• The Ho tested with both the chi-square test and
Fisher’s exact test is that the observed frequencies
or frequencies more extreme could occur by
chance, given the fixed values of the row and
column totals.

• Therefore, for Fisher’s exact test, the probabilities


of frequencies more extreme than those observed
must also be calculated, and the probabilities of all
the more extreme sets are added to the probability
of the observed set.
Fisher’s Exact Test
• The exact probability of the occurrence of the
observed frequencies, given the assumption
of independence and size of the marginal
frequencies (row and column totals), is

(a  b)!(c  d )!(a  c)!(b  d )!


P
a!b!c!d!n!
Summary how to do the Fisher’s exact test
• Compute the probability associated with the
observed data
• Identify the cell in the contingency table with the
smallest frequency
• Reduce the smallest element in the table by 1, then
compute the elements for the other three cells so
that the row and column sums remain constant
• Compute the probabilities associated with the new
table
Summary how to do the Fisher’s exact test
• Repeat this process until the smallest element is
zero.
• List the remaining tables by repeating this process
for the other three elements. List each pattern of
observations only once.
• Compute the probabilities associated with each of
these tables.
• Add all the probabilities together that are equal to
or smaller than the probability associated with the
observed data.
Summary how to do the Fisher’s exact test
• This probability is the two-tail probability of
observing a pattern in the data as extreme or
more extreme than the observed.
• Note that when either the two rows or two
columns have the same sums, the two-tail
probability is simply twice the one-tail
probability.
• Some books say that the two-tail value of P is
always twice the one-tail value. This is not
correct unless the row or column sums are
equal.
Example
• Artificial data to illustrate Fisher’s Exact Test.
Survived Died Total
Treatment A 3 1 4
Treatment B 2 2 4
Total 5 3 8
The probability of obtaining the observed frequencies is
4! 4! 5! 3!
 0.429
3!1! 2! 2! 8!
• The potential values for O11 are (1, 2, 3, 4)
when the row and column totals are fixed
• P(1)= 0.071
• P(2)= 0.429
• P(3)= 0.429 (Observed Table)
• P(4)= 0.071
Example
• In this example, there is one extreme table:
Survived Died Total
Treatment A 4 0 4
Treatment B 1 3 4
Total 5 3 8
• This table has a probability of occurring of
4! 4! 5! 3!
 0.071
4! 0!1! 3! 8!
Example
• Thus, the one-tailed Fisher’s exact test yields a
P-value of P = 0.429 + 0.071=0.50.
• Since the two rows have the same sum, the
two-tailed probability is twice the one tailed
probability (2X0.50=1.00).
Example
• Note that the other tail probability can be also
calculated by listing all the remaining possible
patterns in the data that would give the same
row and column totals. Other patterns of
observations in the data with the same row
and column totals (This tables are found by
putting the values 2 & 3 in the died in
treatment group A. Any other values would
make the died total greater than 3).
Example
Survived Died Total
Treatment A 2 2 4
Treatment B 3 1 4
Total 5 3 8
4! 4! 5! 3!
 0.429
2! 2! 3!1! 8!
Survived Died Total
Treatment A 1 3 4
Treatment B 4 0 4
Total 5 3 8
4! 4! 5! 3!
 0.071
1! 3! 4! 0! 8!
Computation of p-values
• Depends on whether a one-sided or a two-sided
test is being used
• Ho:p1=p2 vs. H1:p1≠p2 (two-sided)
P-value = 2xmin[p(1)+p(2)+...+p(a), p(a)+p(a+1)+…+p(k), 0.5)]
• Ho:p1p2 vs. H1:p1<p2 (one-sided)
P-value = p(1)+p(2)+...+p(a).
• Ho:p1p2 vs. H1:p1>p2 (one-sided)
P-value = p(a)+p(a+1)+...+p(k).
• Based on the p-value, we either reject or not reject Ho.
Example
• Evaluate the statistical significance of the data in
the above table using a two-sided alternative.
• P-value = 2xmin[p(1)+p(2)..+p(a), p(a)+p(a+1)+..+p(k), 0.5)
• Left hand tail area = p(1)+p(2)+p(3)
= 0.929
• Right hand tail area = p(3)+p(4) Minimum
= 0.500
• P-value (two-tailed) =2xmin(0.929, 0.5, 0.5)
= 1.00
Example
• Thus, the total probability of obtaining a
pattern of observations as extreme or more
extreme than that observed is P = 1.00, and
we conclude there is no association between
the type of treatment and survival status.
Module 20
Introduction to Logistic
Regression
Logistic regression
• The type of regression method most commonily
used for the analysis of binary outcome variables
• The basic principle of logistic regression is much
the same as for ordinary multiple regression.
• The main difference is that instead of developing
a model that uses a combination of the values of
a group of explanatory variables to predict the
value of a dependent variable, we predict a
transformation of the dependent variable.
Logistic regression
• This type of variable is called a binomial (or
binary) variable and the regression is called
binary logistic regression.
• Applications of logistic regression have also been
extended to cases where the dependent variable
is of more than two cases, known as multinomial
logistic regression.
• When multiple classes of the dependent variable
can be ranked, then ordinal logistic regression is
preferred to multinomial logistic regression.
Binary Logistic regression
• Can be used to:
 Compare a binary outcome variable
between two exposure groups
 Compare more than two exposure groups
 Examine the effect of an ordered or
continuous exposure variable
Binary Logistic regression
• Multivariable analysis allows for the efficient
estimation of measures of association while
controlling for a number of confounding factors
simultaneously.
• Logistic regression involves the construction of a
mathematical model to describe the association
between exposure and disease and other variables
that may confound or modify the effect of exposure
Binary Logistic regression
• Modeling is a technique of fitting the data to
particular statistical equations
• In this model the outcome (disease) is a
function of
 Exposure variables,
 Confounders, and
 Interaction terms (effect modifiers).
Binary Logistic regression
• If the model includes only the outcome
variable and one exposure variable, the results
should equal the odds ratio that can be
calculated from the 2x2 table.
• When other variables are included, the odds
ratio calculated is adjusted for all the other
variables.
Binary Logistic regression
• Logistic regression can be applied to case-control,
cohort, cross-sectional, and experimental data.
• All types of variables (categorical and continuous)
can be included in a logistic regression model,
although categorical (coded) variables make results
easier to interpret because the number of possible
responses has been limited (i.e., all possible
responses for continuous variables can be
numerous).
Binary Logistic regression
Table 1: Age and signs of coronary heart disease (CD)

Age CD Age CD Age CD


22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1
How can we analyse these data?
• Compare mean age of diseased and non-
diseased
– Non-diseased: 38.6 years
– Diseased: 58.7 years (p<0.0001)

• Linear regression?
Dot-plot: Data from The Table
Yes
Signs of coronary disease

No

0 20 40 60 80 100
AGE (years)
Logistic regression
Table 2: Prevalence (%) of signs of CD by age group

Diseased

Age group # in group # %

20 - 29 5 0 0

30 - 39 6 1 17

40 - 49 7 2 29

50 - 59 7 4 57

60 - 69 5 4 80

70 - 79 2 2 100

80 - 89 1 1 100
Dot-plot: Data from the above Table

Diseased % 100

80

60

40

20

0
0 2 4 6 8

Age group
1
Best fit curve
.8

e  x
P( y x ) 
1  e  x
.4 Y .6
Probability
.2
0

-10
-∞ -5 0
Z
5
+∞ 10

The sigmoidal shape means that the probability of disease is low until a certain
exposure threshold is reached, after which the risk increases rapidly until all
but the most resilient subjects have become ill as a result of exposure.
• However, the values of the probabilities
calculated remain between 0 and 1, whereas
the linear regression may predict values
above 1 or less than 0.
• The linear regression model is not
appropriate when Y is a dichotomous
variable because the expected value (or
mean) of Y is the probability that Y=1 and,
therefore, is limited to the range 0 through 1,
inclusive.
Logistic Model
• In logistic regression we model log odds of
disease or outcome of interest, and the
models are referred to as logistic models.

• If we let P=Pr(Y=1), then the ratio P/(1-P) can


take on values between 0 and +.
• In logistic regression, we express the
association with predictors by Odds Ratio (OR)

Odds of disease = Probdisease/Probnodisease

Odds (Y=1) = Pr(Y=1)/1-Pr(Y=1)

Pr(Y=1) = Odds (Y=1)/1+Odds (Y=1)


Logistic transformation
e x
P( y x )    x
1 e

 P( y x ) 
ln      x
1 - P( y x ) 

logit of P(y|x)
• Therefore, a logistic model for the association
between a binary exposure and disease:
log odds = ln[p/1-p] =  +  exposure
(Where “odds” is short for odds of disease)

 P 
ln    α  βx
 1- P 
Logit ( p)     x
P
 e αβx
1- P
• If we let
– Exposure 1=exposed and 0=unexposed
– Disease 1=disease and 0=disease free
log odds = α + β*Exposure
Risk Analysis Using a 2x2 Table

Y=1=Outcome Present=Success
Y=0=Outcome Absent=Failure
Interpretation of coefficient 
Disease Y
Exposure X yes no

yes P ( y x  1) 1 - P ( y x  1)

no P ( y x  0) 1 - P ( y x  0)

e 
P Oddsd e  e   OR    e 
 e αβx e
1- P
Oddsd e  e ln( OR )  
Interpretation of coefficient 
•  = increase in logarithm of OR for a one
unit increase in x

• Interval estimation

(1.96SE  )
95% CI  e
Example
• Risk of developing coronary heart disease
(CD) by age (<55 and 55+ years)

CD
Present Absent
(1) (0)
55+ (1) 21 6
<55 (0) 22 51

Odds of disease among exposed = 21/6


Odds of disease among unexposed = 22/51 Odds ratio = 8.1
• Logistic Regression Model
 P 
ln    α  β1  Age  - 0.841  2.094  Age
 1 - P 

Coefficient SE Coeff/SE

Age 2.094 0.529 3.96


Constant -0.841 0.255 -3.30

OR  e 2.094  8.1
Wald Test  3.96 2 with 1df (p  0.05)
95% CI  e (2.094 1.96 x 0.529 )  2.9, 22.9
Multiple Logistic Regression
• More than one independent variable
– Dichotomous, ordinal, nominal, continuous …

 P 
ln    α  β1x 1  β2 x 2  ... βi x i
 1- P 
• Interpretation of i
– Increase in log-odds for a one unit increase in xi with all the
other xis constant
– Measures association between xi and log-odds adjusted for
all other xi
Effect modification
• Effect modification
– Can be modelled by including interaction terms

 P 
ln    α  β1x1  β2x 2  β3x1  x 2
 1- P 
Example
P Probability for cardiac arrest
Exc 1= lack of exercise, 0 = exercise
Smk 1= smokers, 0= non-smokers

 P 
ln    α  β1 Exc  β2 Smk
 1- P 
 0.7102  1.0047 Exc  0.7005 Smk
(SE 0.2614) (SE 0.2664)

OR for lack of exercise  e1.0047  2.73 (adjusted for smoking)


95% CI  e (1.0047  1.96 x 0.2614)  1.64 - 4.56
• Interaction between smoking and exercise?
 P 
ln    α  β1 Exc  β2 Smk  β3 Smk  Exc
 1- P 
• Product term 3 = -0.4604 (SE 0.5332)
Wald test = 0.75 (1df)

-2log(L) = 342.092 with interaction term


= 342.836 without interaction term

 LR statistic = 0.74 (1df), p = 0.39


 No evidence of any interaction
Logistic Regression Model Building Strategy
1. Identify the variables to be considered
through analysis of univariate, bivariate, and
stratified analysis.
2. Evaluate the interaction.
3. Evaluate confounding.
4. Choose the final model.
Summary
• For the small-sample case, Fisher’s exact test is used
to compare binomial proportions in two independent
samples.
• A chi-square test for R × C contingency tables was
developed, which is a direct generalization of the 2 ×
2 contingency-table test
Summary
• We examined multiple logistic regression, a
technique similar to multiple linear regression
when the outcome variable is binary.
• Using this technique allows one to control for
many confounding variables simultaneously
• We discussed how to convert regression
coefficients into odds ratios
The END

You might also like