Bio Statslectures

Biostatistics (MATH11230)
Vanda Inácio
University of Edinburgh
Semester 1, 2021/2022
Vanda Inácio (UoE) Biostatistics (MATH11230) 1 / 60

Introduction and general concepts
What is Epidemiology?
,→ The first part of the syllabus relates to epidemiological concepts and techniques.
,→ But what is epidemiology?
,→ Epidemiology is the study of the distribution and determinants of disease in human

populations (Woodward, 2014, p 1).
,→ Epidemiological information is used to plan and evaluate strategies to prevent illness and
also as a way to guide the management of patients who already developed the disease.

,→ According to Woodward (2014, p 2) the term derives from the term ‘epidemic’, which
appears to have been derived from epidemeion, a word used by Hippocrates when
describing a disease that was ‘visiting the people’.
,→ Still from the same author, modern use of the term retains the restriction to human
populations but has broadened the scope to include any type of disease, transient or not.
,→ Epidemiologists therefore study long duration or chronic diseases (e.g., cancer or asthma),
as well as infectious diseases (e.g., COVID 19).
,→ Most epidemiological research addresses however chronic diseases because these

account for a large proportion of deaths in today’s world.

,→ The following is verbatim from Woodward (2014, p 2).
,→ Epidemiology is usually regarded as a branch of medicine that deals with populations

rather than individuals.
,→ Whereas the hospital clinician considers the best treatment and advice to give to each
individual patient so as to enable them to get better, the epidemiologist considers what
advice to give to the general population in order to lessen the overall burden of disease.
,→ However, due to its dealings with aggregations (of people), epidemiology is also an applied
branch of statistics.
,→ Advances in epidemiological research have generally been achieved through interaction of

the disciplines of medicine and statistics.

Measures of disease occurrence
Prevalence and incidence
,→ The fundamental observations in epidemiology are measures of the occurrence of

disease.
,→ We will be considering that disease outcomes are binary: disease present or disease
absent (or, more generally, health outcome of interest present or absent).
,→ Although fine levels of a disease outcome variable would enhance the understanding of the
associations between disease and exposure to risk factors, quantifying a continuous level
of disease may involve invasive methods and therefore may be impractical or unethical.

,→ The prevalence of a disease is the proportion of a defined population at risk for the
disease that is affected by it at a specified point on the time scale.
,→ The prevalence at time, say t, is estimated as the ratio of the number of existing cases at
time t to the size of the population at risk at time t.
,→ The incidence proportion is the proportion of a defined population, all of whom are at risk
for the disease at the beginning of a specified time interval, who become new cases of the
disease before the end of the interval.
,→ Because the incidence proportion includes all individuals who have contracted the disease
over the entire interval, it is sometimes also referred to as the cumulative incidence.

,→ To be ‘at risk’ means that an individual has been previously unaffected by the disease, or
that susceptibility has been regained after previously contracting the disease and
recovering (e.g., as with the cold that no one becomes fully immune).
,→ An incidence proportion can only be meaningful interpreted when the time interval is
specified.

,→ Toy example from Jewell (2003, p 12)

,→ The prevalence at time t is wither 4/100 or 4/99 depending on whether we consider case
4 to be at risk for the disease at time t or not, respectively.
,→ The incidence proportion over the interval time [t0 , t1 ] is 4/98 because cases 1 and 4 are
not at risk for the disease at the beginning of the interval (they are already cases).
,→ Note that the prevalence and incidence proportion do not have units. They are simply
proportions, sometimes expressed as percentages, that must lie on [0, 1].

,→ Some comments about the prevalence and incidence proportion as scientific measures of
disease occurrence are in order.
,→ Clearly, disease occurrence affects prevalence.
,→ The higher the incidence of disease, the more people will have it.
,→ But prevalence is also affected by the duration of the disease.
,→ Diseases of long duration tend to have a higher prevalence than short term illnesses
(either because of recovery or death), even if the total number of affected individuals are
about the same.
,→ That is, a population might have a low disease prevalence when: (1) the disease is rare, or
(2) it occurs with higher frequency, but affected individuals stay diseased for only a short
period of time (either because they recover or die).

,→ Further, because the duration of the disease, and hence its prevalence, may be affected by
medical treatment or other factors that are unrelated to those that caused the disease in
first place, the prevalence is not as appropriate as the incidence proportion to investigate
the causes of the disease (known in the epidemiological language as the etiology of the
disease).

,→ The following example taken from Jewell (2003, p 13) illustrates the dangers of using
prevalence data when trying to establish a causal association between an exposure and
initiation of disease.
,→ The data are from the Framingham Heart Study (Friedman et al., 1966) and relates
cholesterol levels and coronary heart disease (CHD) for men whose ages are between 30
and 59 years old.

,→ In this example, incidence refers to a group of men, initially free of CHD, whose cholesterol
levels were measured at the beginning of a 10 year follow-up period during which one
observed (and counted!) the number of men who developed CHD.
,→ Cholesterol levels were divided into four quartiles and ‘high’ and ‘low’ refer to the highest
and lowest quartiles, respectively.
,→ It is clear from the incidence data that there is a considerable larger proportion of CHD
cases in the high cholesterol group as compared with the low cholesterol group.
,→ However, this is not evident at all from the prevalence data (prevalence of CHD is not that
different for the two cholesterol groups), where both cholesterol and CHD measurement
were taken at the end of the 10 year follow-up period.

,→ This difference may arise if high cholesterol levels are associated only with those CHD
cases who died rapidly (i.e., died before the end of the 10 year period) and thus were not
included in the prevalence calculation.
,→ Another explanation might be that some surviving CHD patients modified their cholesterol
levels (e.g., through medication), so that their cholesterol levels were lower at the end of
the follow-up period.

,→ The prevalence is nonetheless extremely useful for measuring burden of long lasting
diseases on a population, especially if those contracting the disease require specific
medical attention.
,→ For example, the prevalent number of people in a population with end-stage renal disease
predicts the need in that population for dialysis treatment.

,→ The incidence proportion is also not free of drawbacks.
,→ The main drawback is a practical one and relates to the fact that subjects need to be
followed up over a period of time.
,→ First, there are competing risks that may lead to some subjects dying before the
observation period ends, making it impossible to ascertain whether they would have
developed the disease of interest if they had not died early from a cause other than the
outcome of the study.
,→ Secondly, it is also challenging to follow individuals for long periods of time, and subjects
can become lost to follow-up (e.g., because they move away or choose not to participate
further in the study) which also means that we do not know whether they would have
developed the disease or not in case they have remained in the study.

,→ The difficulty in computing the incidence proportion in studies in which there are competing
risks and losses to follow-up is what to use in the denominator.
,→ If the denominator is comprised by the number of subjects who were initially being
followed, then we would be underestimating the incidence proportion that would have been
observed if there had been no competing risks or losses to follow-up.
,→ Generally, the incidence proportion is more useful when the follow-up time is relatively
short and therefore one expect few losses to follow-up.
,→ However, when measuring the incidence of a rare disease, individuals need to be followed
for long periods.

,→ To address the problem of competing risks and losses to follow-up, epidemiologists often
resort to the incidence rate as a measure of disease occurrence.
,→ The numerator of the incidence rate is the same as in the incidence proportion (number of
new cases over a defined interval). It is in the denominator that these two measures differ.
,→ In the incidence rate the denominator is given by the total amount of time at risk for the
disease accumulated by the entire population over the same interval.
,→ The units of the incidence rate are thus (time)−1 .
,→ It is worth mentioning that a mortality rate is an incidence rate in which the outcome
under study is death.

,→ The following example is a toy one from Jewell (2014, p 14).

,→ In the following calculations we will first assume that the disease under study is chronic
and that, as such, there is no recovery.
,→ Prevalence
,→ At t = 0: 0/5.
,→ At t = 5: 1/2.
,→ Incidence proportion over the time interval [0, 5]: 3/5 = 0.6.
,→ Incidence rate over the time interval [0, 5]: 3/(5 + 1 + 4 + 3 + 1) = 3/14 = 0.21 cases per
year.

,→ Now, under the assumption that the disease is acute thus implying that individuals who
recover immediately return to being at risk, then the denominator of the incidence rate
would be 5 + 1 + 5 + 3 + 4.5, leading to an incidence rate of 0.16 cases per year.
,→ Unlike the prevalence and incidence proportion, the incidence rate does not (necessarily)
lie on [0, 1].
,→ It indeed has a lower bound of zero, but it can theoretically become as great as infinity.
,→ Because the denominator of an incidence rate is measured in time units, we can imagine
that the time units can be smaller, making the rate larger.
,→ Indeed, the value of the incidence rate depends on what time unit is chosen.

,→ As an example, suppose that we measure an incidence rate in a population as 47 cases

occurring in 158 months.
,→ This would lead to 47/158 = 0.3 cases per month.
,→ We could restate this same incidence rate using cases per year instead of cases per
month, thus leading to 47/13.17 = 3.57 cases per year.
,→ The above two expressions measure the same incidence rate; the only difference is the
time unit chosen to express the denominator.
,→ The different time units affect the numerical values of the incidence rate.
,→ The situation is pretty much similar as expressing speed in different time units of time or
distance!

,→ In summary, we have learned thus far that:
,→ Prevalence concerns existing cases of a disease (or other health outcome) at a point
in time.
,→ Incidence concerns new cases of disease (or other health outcome) over a period of
follow-up.
,→ Prevalence is a useful measure for assessing the health status of a population and
in the planning of health services.
,→ Incidence measures are useful for identifying risk factors and assessing disease
etiology.

Prevalence and incidence: exercise
,→ Triple antiviral therapy has dramatically improved survival among patients with human
immunodeficiency virus (HIV) disease. If the incidence of HIV were to remain constant,
what is the expected impact of widespread triple antiviral therapy on the prevalence of HIV
in the population?
1 Increase.
2 Decrease.
3 Stay the same.

Prevalence and incidence: solution of the exercise
,→ The correct answer is A.
,→ If the incidence of disease remains constant, but the duration of the disease increases,
then prevalence will increase.
,→ Prevalence provides a snapshot of the amount of disease that is present at a given time.
The widespread use of antiviral therapies for HIV has dramatically increased the lifespan of
people who contract this disease.
,→ As a result, the proportion of people living with HIV disease in a given snapshot of time has
increased.

Prevalence and incidence: exercise
,→ The incidence of a disease is five times greater in men compared with women, yet there is
no difference in disease prevalence by sex. What is the best explanation for this finding?
1 Men receive more intensive medical care for the disease.
2 The mortality rate is greater among women.
3 The disease is less aggressive among women.
4 Women are older than men when they are diagnosed with the disease.

Prevalence and incidence: solution of the exercise
,→ The correct answer is A.
,→ The incidence of the disease is five times greater in men compared to women. For the
prevalences to be equal, the duration of the disease in men must be shorter than in
women.
,→ This could be due to men dying from the disease more rapidly, for example, the disease
may be more aggressive in men, or could be due to men recovering more quickly, for
example, men may receive more intensive treatment for the disease.

Measures of disease-exposure association
,→ We will now be looking at measures that allow us to study the association between a risk
factor or exposure and the occurrence of disease.
,→ For simplicity, we will start with a risk factor that can only take two values (e.g., exposed
and unexposed).
,→ Let us denote the disease outcome (also binary: present or absent) by D and the risk
factor by E.

Relative risk
,→ The relative risk (RR), also known as risk ratio, for an outcome D associated with a binary
risk factor E is given by
Pr(D | E) Pr(D | E)
RR = = .
Pr(D | Ē) Pr(D | not E)
,→ The relative risk takes values is a non-negative number.
,→ A RR = 1, the so-called null value, implies that Pr(D | E) = Pr(D | not E), which is
equivalent to saying that D and E are independent (no association between the exposure
to the risk factor E and the outcome D).
,→ A RR > 1 indicates that there is a greater risk or probability of D when exposed than when
not exposed (positive association between the exposure to the risk factor E and the
outcome D).
,→ When RR < 1 there is a reduced risk or probability of D when exposed than when not
exposed (negative association between the exposure to the risk factor E and the outcome
D).

Relative risk
,→ We shall note that the relative risk has an implicit upper bound.
,→ This is because the maximum possible value for a risk and, in particular for Pr(D | E) is
one, and therefore the relative risk must be less than or equal to 1/ Pr(D | not E).
,→ This restriction on the range of the relative risk is only problematic if the disease outcome
we are studying is a common one, as in this case Pr(D | not E) may become large.

Relative risk
,→ We shall note that the relative risk is not symmetric in the role of the two factors D and E.
,→ That is, the relative risk for E associated with D is a different measure of association than
the relative risk for D associated with E, i.e.,
Pr(E | D) Pr (D | E)
6= .
Pr(E | not D) Pr (D | not E)

Relative risk
,→ Let us illustrate the relative risk calculation with data from the table below (Jewell, 2003, p
22), which lists the vital status of all births in the United States in 1991, one year after date
of birth, and categorized by the marital status of the mother at the birth and the birth weight
of the infant.
,→ Low birth weight was defined as a birth where the newborn’s weight is less than 2.5kg.

Relative risk
,→ Following the definition, the relative risk for infant mortality in the USA in 1991, associated
with the mother being unmarried, is
Pr(death | unmarried mother) 16712/1213854

= = 2.12.
Pr(death | married mother) 18784/2897205
,→ This result indicates that there is a positive association between infant mortality and
mother being unmarried.
,→ The risk of an infant death with an unmarried mother is about twice the risk of an infant
death with a married mother.

Relative risk
,→ In turn, the relative risk for infant mortality in the USA in 1991, associated with a low birth
weight is given by
Pr(death | low birth weight) 21054/292323

= = 19.0.
Pr(death | normal birth weight) 14442/3818736
,→ There is thus a much greather effect of birth weight on infant mortality than marital status.

Odds ratio
,→ The relative risk measures the risk of the outcome D through Pr(D).
,→ The odds of disease are simply given by
Pr (D) Pr(D)
= .
Pr(not D) 1 − Pr(D)
,→ The odds of D provide the same information as Pr(D) since knowing one of the quantities
immediately determines the other.
,→ Just as the relative risk measures association by comparing the probability of the outcome
D in the exposed and unexposed groups, the odds ratio measures association by
comparing the odds of disease in the exposed and unexposed groups.
,→ The odds ratio for D associated with E is defined by

,
Pr(D | E) Pr(D | not E)
OR = .
Pr(not D | E) Pr(not D | not E)

Odds ratio
,→ The odds ratio is a number between 0 and ∞.
,→ Because there is no upper limits for the odds of D, the odds ratio, by opposition to the
relative risk, has no implicit upper bound.
,→ As with the relative risk, OR = 1 is the null value, and it corresponds to the case where the
odds of the outcome D are the same in both groups (exposed and unexposed), and it is
again equivalent to no association (independence) between D and E.
,→ When OR > 1, there is a greater risk of D in the exposed group.
,→ The reverse is true when OR < 1, that is, there is a lower risk of D in the exposed group.

Odds ratio
,→ Let us revisit the example about infant mortality in the USA in 1991.
,→ Let us first compute the odds of infant mortality in the unmarried mother (exposed) group
16712/1213854 16712
= .
1197142/1213854 1197142
,→ Similarly, the odds of infant mortality in the married mother (unexposed) group are
18784/2897205 18784
= .
2878421/2897205 2878421
,→ The odds ratio for infant mortality associated with an unmarried mother is
16712/1197142
OR = = 2.14.
18784/2878421

Odds ratio
,→ We will now re-do the calcution but now with low birth weight being the risk
factor/exposure.
,→ The odds of death in the low birth weight group are
21054/292323 21054
= .
271269/292323 271269
,→ Analogously, the odds of death in the normal birth weight group are
14442/3818736 14442
= .
3804294/3818736 3804294
,→ Finally, the odds ratio for infant mortality associated with low birth weight is
21054/271269
OR = = 20.4.
14442/3804294

Odds ratio
,→ So, let us compare what we have obtained for this example so far in terms of relative risks
and odds ratios.
RR OR
Exposure/risk factor: marital status 2.12 2.14
Exposure/risk factor: low birth weight 19.0 20.4
,→ The RRs and ORs for both exposures are very similar.
,→ But is this always the case?
,→ Let us look formally at the relationship between the RR and the OR in the next slide.

Odds ratio
,→ By its definition, the odds ratio for D associated with E is
Pr(D | E) Pr(not D | not E)

OR = ×
Pr (not D | E) Pr(D | not E)
Pr(D | E) Pr(not D | not E)
= ×
Pr(D | not E) Pr (not D | E)
Pr(not D | not E)
= RR × .
Pr (not D | E)

Odds ratio
,→ Let us suppose first that RR > 1. By definition of relative risk, this means that
Pr(D | E) > Pr(D | not E),
which trivially implies that
1 − Pr(D | E) < 1 − Pr(D | not E),
and which, in turn, can be equivalently written as,
Pr(not D | E) < Pr(not D | not E),
thus implying that

Pr(not D | not E)
> 1.
Pr(not D | E)

Odds ratio
,→ The OR is thus the product of the RR (which by assumption is greater than one) by a
quantity that is also greater than one and therefore we must conclude that OR > RR when
RR > 1.
,→ When RR < 1, a similar reasoning leads to the conclusion that OR < RR.
,→ We thus arrive at the conclusion that the odds ratio is always farther away from one than
the relative risk (except when the relative risk is one, in which case the odds ratio will also
be one).
,→ How farther away the OR is from 1 compared to the RR will depend on both Pr(D | E) and
Pr(D | not E), which are also the two probabilities involved in the computation of the risk
ratio.
,→ When the risk of disease in both the exposed and unexposed groups is low, then
Pr(not D | E) ≈ 1 and Pr(not D | not E) ≈ 1 and so OR ≈ RR.

Odds ratio
,→ We have seen before that the relative risk is not symmetric in the roles of D and E, that is,
the relative risk for D associated with E does not (need to) coincide with the relative risk for
E associated with D.
,→ The odds ratio enjoys the property of being symmetric with respect to the roles of D and E,
that is, reversing the roles of D and E makes no difference in its computation.
,→ As we will see in one of the next lectures, this property will be key when estimating the
association between D and E in certain study designs (case-control studies).
,→ In what follows, let ORD|E denote the odds ratio for D associated with E and ORE|D denote
the odds ratio for E associated with D.

Odds ratio
,
Pr(D | E) Pr(D | not E)
ORD|E =
Pr(not D | E) Pr(not D | not E)
,
Pr(D&E)/ Pr(E) Pr(D&not E)/ Pr(not E)
=
Pr(not D&E)/ Pr(E) Pr(not D&not E)/ Pr(not E)
,
Pr(D&E) Pr(D&not E)
=
Pr(not D&E) Pr(not D&not E)
,
Pr(D&E) Pr(not D&E)
=
Pr(D&not E) Pr(not D&not E)
,
Pr(D&E)/ Pr(D) Pr(not D&E)/ Pr(not D)
=
Pr(D&not E)/ Pr(D) Pr(not D&not E)/ Pr(not D)
,
Pr(E | D) Pr(E | not D)
=
Pr(not E | D) Pr(not E | not D)
= ORE|D

Odds ratio
,→ Another advantage of the odds ratio over the relative risk is its insensitivity to whether a
study is summarised with respect to D present or to D absent (e.g., death or survival).
,→ One odds ratio is simply the reciprocal of the other. The same is not true for the relative
risk.
,→ Kahn and Sempos (1989) illustrated this point with the following data comparing mortality
experience in two communities.
Number dying Number surviving Total

Community A 2 98 100
Community B 1 99 100
Total 3 197 200
,→ Community here acts as the risk factor/exposure.

Odds ratio
,→ Using these data to compare community A with community B, the relative risk of dying is
Pr(dying | A) 2/100
= = 2.
Pr(dying | B) 1/100
,→ The same comparison of odds leads to the odds ratio

,
Pr(dying | A) Pr(dying | B) 2/98
= ≈ 2.
Pr(not dying | A) Pr(not dying | B) 1/99

Odds ratio
,→ Now, we will use the same data to compare the two communities but this time with respect
to surviving instead of dying.
,→ The relative risk for surviving is
Pr(surviving | A) 98/100
= ≈ 1.
Pr(surviving | B) 99/100
,→ In turn, the odds tratio for surviving is

,
Pr(surviving | A) Pr(surviving | B) 98/2
= ≈ 1/2.
Pr(not surviving | A) Pr(not surviving | B) 99/1

Odds ratio
,→ The use of the relative risk led to different results depending on whether the study stressed
death or survival.
,→ The odds ratio does not suffer from such a drawback: community A has about twice the
odds of community B with respect to dying and about half the odds of community B with
respect to surviving.
,→ This indifference as to whether stress is placed in counting the events or the nonevents is
clearly a desirable property of the odds ratio as a measure of association between a
disease outcome and exposure to a risk factor.

Relative risk or odds ratio?
,→ As we have seen, both the relative risk and the odds ratio measure comparative chance of
disease (in the exposed and unexposed groups).
,→ The relative risk is easier to understand and less subject to misinterpretation.
,→ However, in some contexts, the odds ratio is all that we can estimate (the situation in
case–control studies as we will see later) or it is the most convenient measure to calculate
(in logistic regression analysis; also to be covered later in the course).
,→ As we have seen, the odds ratio will be a good approximation to the relative risk whenever
the disease in question is rare!

Excess risk
,→ The following exposition follows closely Jewell (2003, Chapter 4).
,→ Both the relative risk and the odds ratio are relative measures of risk differences between
the exposed and unexposed groups.
,→ The excess risk (ER), also known as risk difference, is defined as the difference between
the risk of disease in an exposed population and the risk of disease in an unexposed
population, that is,
ER = Pr(D | E) − Pr(D | not E).
,→ The ER focuses on the absolute effect of the exposure, or the excess risk of disease in
those who have the risk factor compared with those who do not.

Excess risk
,→ The ER can also be interpreted as the difference in the number of cases of D in

populations where either everyone is exposed or unexposed.
,→ To see that, let us consider a population of size N and that all individuals are exposed.
Then, Pr(D) = Pr(D | E) and so the number of cases of D would be given by
N × Pr(D | E).
,→ Analogously, if in the same population there was no exposure to the risk factor, then
Pr(D) = Pr(D | not E) and the number of cases would be N × Pr(D | not E).
,→ The difference in the two caseloads is then N × Pr(D | E) − N × Pr(D | not E).
,→ Expressing N × Pr(D | E) − N × Pr(D | not E) as a fraction of the population size yields

Pr(D | E) − Pr(D | not E).
,→ Therefore, the excess risk can be interpreted as the excess number of cases, as a fraction
of the population size, when the individuals from the population are all exposed as
compared to them all being unexposed.

Excess risk
,→ The ER is a number between −1 and 1.
,→ ER = 0 is the null value since this is equivalent to Pr(D | E) = Pr(D | not E), that is, there
is no association between D and E.
,→ If ER > 0 then there is a greater risk of disease when exposed than when not exposed to
the risk factor. The opposite is true when ER < 0.

Excess risk
,→ Let us calculate the ER for the example of infant mortality in the USA in 1991 associated
with low birth weight.
,→ For this specific example, we have
ER = Pr(infant death | low birth weight) − Pr(infant death | normal birth weight)
21054 14442
= −
292323 3818736
= 0.0682.
,→ Direct interpretation of this result means that we would expect the infant mortality
percentage to increase by about 7% if all birth exhibited low birth weight as compared to all
births being of normal birth weight (more on this causal role of birth weight and infant
deaths will be discussed in the next slides).

Attributable risk
,→ So far in our discussion nothing prevents an individual to become diseased without

exposure to the risk factor of interest, that is, Pr(D | not E) can be greater than zero.
,→ As a result, not all disease is due to exposure to the risk factor and so it is fair to ask how
much of the disease in the population can be explained by the presence of the risk factor.
,→ The attributable risk (AR) is a measure of association that answers such a question.
,→ The attributable risk is defined as the fraction of all D cases in the population that can be
attributed to exposure to the risk factor of interest.

Attributable risk
,→ As we did for the excess risk, let us suppose that the population of interest has N
individuals.
,→ Then, there are N × Pr(D) individuals with D in the population.
,→ Assuming that exposure to the risk factor E is eradicated, the risk of D in the population is
then given by Pr(D | not E) = Pr(D) and there would be N × Pr(D | not E) cases of D.
,→ The fraction of cases that would be removed, or thereby explained as due to E, is
N Pr(D) − N Pr(D | not E) Pr(D) − Pr(D | not E)

AR = = .
N Pr(D) Pr(D)

Attributable risk
,→ Let us do some work with the AR expression.

,→ By the law of total probability we know that
Pr(D) = Pr(D | E) Pr(E) + Pr(D | not E) Pr(not E).
,→ Replacing it in the numerator of the AR expression (later we will replace it in the

denominator too, but for now we avoid introducing further ado), we have
Pr(D | E) Pr(E) + Pr(D | not E) Pr(not E) − Pr(D | not E)

AR =
Pr(D)
Pr(E)[Pr(D | E) − Pr(D | not E)]
=
Pr(D)
∗ Pr(E)(RR − 1)
=
Pr(E)RR + 1 − Pr(E)
Pr(E)(RR − 1)
= .
1 + Pr(E)(RR − 1)
,→ In the step marked with ∗ we have replaced in the numerator Pr(D) by its corresponding
expression using the law of total probability and then divided both the numerator and the
denominator by Pr(D | not E).
Attributable risk
,→ The null value that implies no association between D and E is AR = 0 (when RR = 1).
,→ When exposure to E increases the risk of D, one has 0 < AR ≤ 1.
,→ On the other hand, AR < 0 when there is a negative association between D and E.

Attributable risk
,→ We will revisit the infant mortality (in the USA in 1991) to illustrate the computation of the
AR.
,→ When the risk factor is the mother’s marital status (E: unmarried mother) we have already
seen that RR = 2.12. In addition, Pr(unmarried mother) = 1213854/4111059 = 0.2952.
,→ The AR associated with marital status is thus 0.25.
,→ In a similar fashion, when the risk factor is low-birth weight, we have also already
concluded that RR = 19.0. Moreover, Pr(low birth weight) = 292323/4111059 = 0.0711.
,→ Therefore, the AR associated with a low birth weight infant is 0.56.

Attributable risk
,→ As stated in Jewell (2003, p 41) the naive interpretation of the results in terms of the AR in
the previous slide would be that infant mortality could be reduced by 25% if all mothers
were married or by 56% if low birth weights could be eradicated.
,→ While it seems plausible that such a substantial fraction of infant mortality could be
mitigated by intervention programs aimed to eliminate the risk of a low birth child, it is hard
to believe that 25% of infant deaths could be avoided simply through a program to have
single pregnant women marry before they give birth.
,→ This suggests that marital status does not, in fact, cause infant mortality. The apparent
association, as captured by the RR, OR, and AR is likely due to other factors that are
related both to marital status and infant mortality. We will deal with this in a few weeks in
this course.

Extra material: vaccine efficacy (and its connection to
the relative risk...)
,→ As can be read here:

https://www.thelancet.com/journals/lanmic/article/
PIIS2666-5247(21)00069-0/fulltext
vaccine efficacy is generally reported as a relative risk reduction (RRR), which is defined
as 1 − RR.
,→ Here the exposure is beneficial and the exposed group consists of those who were
vaccinated. The unexposed group then consists of those who were not vaccinated. The
outcome D is the getting infected with the virus SARS-CoV-2.
,→ This blog entry nicely illustrates the calculation of the efficacy of the Pfizer and Moderna
(for COVID-19) vaccines
https://towardsdatascience.com/
pfizer-and-moderna-vaccine-efficacy-calculated-from-data-9566897173c

Bio Statslectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bio Statslectures

Uploaded by

Copyright:

Available Formats

Biostatistics (MATH11230)

Vanda Inácio (UoE) Biostatistics (MATH11230) 1 / 60

,→ But what is epidemiology?

,→ Epidemiology is the study of the distribution and determinants of disease in human

Vanda Inácio (UoE) Biostatistics (MATH11230) 2 / 60

,→ Most epidemiological research addresses however chronic diseases because these

Vanda Inácio (UoE) Biostatistics (MATH11230) 3 / 60

,→ The following is verbatim from Woodward (2014, p 2).

,→ Epidemiology is usually regarded as a branch of medicine that deals with populations

,→ Advances in epidemiological research have generally been achieved through interaction of

Vanda Inácio (UoE) Biostatistics (MATH11230) 4 / 60

,→ The fundamental observations in epidemiology are measures of the occurrence of

Vanda Inácio (UoE) Biostatistics (MATH11230) 5 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 6 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 7 / 60

,→ Toy example from Jewell (2003, p 12)

Vanda Inácio (UoE) Biostatistics (MATH11230) 8 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 9 / 60

,→ Clearly, disease occurrence affects prevalence.

,→ But prevalence is also affected by the duration of the disease.

Vanda Inácio (UoE) Biostatistics (MATH11230) 10 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 11 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 12 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 13 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 14 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 15 / 60

,→ The incidence proportion is also not free of drawbacks.

Vanda Inácio (UoE) Biostatistics (MATH11230) 16 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 17 / 60

,→ The units of the incidence rate are thus (time)−1 .

Vanda Inácio (UoE) Biostatistics (MATH11230) 18 / 60

,→ The following example is a toy one from Jewell (2014, p 14).

Vanda Inácio (UoE) Biostatistics (MATH11230) 19 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 20 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 21 / 60

,→ As an example, suppose that we measure an incidence rate in a population as 47 cases

,→ This would lead to 47/158 = 0.3 cases per month.

Vanda Inácio (UoE) Biostatistics (MATH11230) 22 / 60

,→ In summary, we have learned thus far that:

Vanda Inácio (UoE) Biostatistics (MATH11230) 23 / 60

3 Stay the same.

Vanda Inácio (UoE) Biostatistics (MATH11230) 24 / 60

,→ The correct answer is A.

Vanda Inácio (UoE) Biostatistics (MATH11230) 25 / 60

1 Men receive more intensive medical care for the disease.

2 The mortality rate is greater among women.

3 The disease is less aggressive among women.

Vanda Inácio (UoE) Biostatistics (MATH11230) 26 / 60

,→ The correct answer is A.

Vanda Inácio (UoE) Biostatistics (MATH11230) 27 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 28 / 60

,→ The relative risk takes values is a non-negative number.

Vanda Inácio (UoE) Biostatistics (MATH11230) 29 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 30 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 31 / 60

Vanda Inácio (UoE) Biostatistics (MATH11230) 32 / 60

Pr(death | unmarried mother) 16712/1213854

Vanda Inácio (UoE) Biostatistics (MATH11230) 33 / 60

Pr(death | low birth weight) 21054/292323

Vanda Inácio (UoE) Biostatistics (MATH11230) 34 / 60

,→ The odds of disease are simply given by

,→ The odds ratio for D associated with E is defined by

Vanda Inácio (UoE) Biostatistics (MATH11230) 35 / 60