Professional Documents
Culture Documents
Bio Statslectures
Bio Statslectures
Vanda Inácio
University of Edinburgh
Semester 1, 2021/2022
,→ The first part of the syllabus relates to epidemiological concepts and techniques.
,→ Epidemiological information is used to plan and evaluate strategies to prevent illness and
also as a way to guide the management of patients who already developed the disease.
,→ According to Woodward (2014, p 2) the term derives from the term ‘epidemic’, which
appears to have been derived from epidemeion, a word used by Hippocrates when
describing a disease that was ‘visiting the people’.
,→ Still from the same author, modern use of the term retains the restriction to human
populations but has broadened the scope to include any type of disease, transient or not.
,→ Epidemiologists therefore study long duration or chronic diseases (e.g., cancer or asthma),
as well as infectious diseases (e.g., COVID 19).
,→ Whereas the hospital clinician considers the best treatment and advice to give to each
individual patient so as to enable them to get better, the epidemiologist considers what
advice to give to the general population in order to lessen the overall burden of disease.
,→ However, due to its dealings with aggregations (of people), epidemiology is also an applied
branch of statistics.
,→ We will be considering that disease outcomes are binary: disease present or disease
absent (or, more generally, health outcome of interest present or absent).
,→ Although fine levels of a disease outcome variable would enhance the understanding of the
associations between disease and exposure to risk factors, quantifying a continuous level
of disease may involve invasive methods and therefore may be impractical or unethical.
,→ The prevalence of a disease is the proportion of a defined population at risk for the
disease that is affected by it at a specified point on the time scale.
,→ The prevalence at time, say t, is estimated as the ratio of the number of existing cases at
time t to the size of the population at risk at time t.
,→ The incidence proportion is the proportion of a defined population, all of whom are at risk
for the disease at the beginning of a specified time interval, who become new cases of the
disease before the end of the interval.
,→ Because the incidence proportion includes all individuals who have contracted the disease
over the entire interval, it is sometimes also referred to as the cumulative incidence.
,→ To be ‘at risk’ means that an individual has been previously unaffected by the disease, or
that susceptibility has been regained after previously contracting the disease and
recovering (e.g., as with the cold that no one becomes fully immune).
,→ An incidence proportion can only be meaningful interpreted when the time interval is
specified.
,→ The prevalence at time t is wither 4/100 or 4/99 depending on whether we consider case
4 to be at risk for the disease at time t or not, respectively.
,→ The incidence proportion over the interval time [t0 , t1 ] is 4/98 because cases 1 and 4 are
not at risk for the disease at the beginning of the interval (they are already cases).
,→ Note that the prevalence and incidence proportion do not have units. They are simply
proportions, sometimes expressed as percentages, that must lie on [0, 1].
,→ Some comments about the prevalence and incidence proportion as scientific measures of
disease occurrence are in order.
,→ The higher the incidence of disease, the more people will have it.
,→ Diseases of long duration tend to have a higher prevalence than short term illnesses
(either because of recovery or death), even if the total number of affected individuals are
about the same.
,→ That is, a population might have a low disease prevalence when: (1) the disease is rare, or
(2) it occurs with higher frequency, but affected individuals stay diseased for only a short
period of time (either because they recover or die).
,→ Further, because the duration of the disease, and hence its prevalence, may be affected by
medical treatment or other factors that are unrelated to those that caused the disease in
first place, the prevalence is not as appropriate as the incidence proportion to investigate
the causes of the disease (known in the epidemiological language as the etiology of the
disease).
,→ The following example taken from Jewell (2003, p 13) illustrates the dangers of using
prevalence data when trying to establish a causal association between an exposure and
initiation of disease.
,→ The data are from the Framingham Heart Study (Friedman et al., 1966) and relates
cholesterol levels and coronary heart disease (CHD) for men whose ages are between 30
and 59 years old.
,→ In this example, incidence refers to a group of men, initially free of CHD, whose cholesterol
levels were measured at the beginning of a 10 year follow-up period during which one
observed (and counted!) the number of men who developed CHD.
,→ Cholesterol levels were divided into four quartiles and ‘high’ and ‘low’ refer to the highest
and lowest quartiles, respectively.
,→ It is clear from the incidence data that there is a considerable larger proportion of CHD
cases in the high cholesterol group as compared with the low cholesterol group.
,→ However, this is not evident at all from the prevalence data (prevalence of CHD is not that
different for the two cholesterol groups), where both cholesterol and CHD measurement
were taken at the end of the 10 year follow-up period.
,→ This difference may arise if high cholesterol levels are associated only with those CHD
cases who died rapidly (i.e., died before the end of the 10 year period) and thus were not
included in the prevalence calculation.
,→ Another explanation might be that some surviving CHD patients modified their cholesterol
levels (e.g., through medication), so that their cholesterol levels were lower at the end of
the follow-up period.
,→ The prevalence is nonetheless extremely useful for measuring burden of long lasting
diseases on a population, especially if those contracting the disease require specific
medical attention.
,→ For example, the prevalent number of people in a population with end-stage renal disease
predicts the need in that population for dialysis treatment.
,→ The main drawback is a practical one and relates to the fact that subjects need to be
followed up over a period of time.
,→ First, there are competing risks that may lead to some subjects dying before the
observation period ends, making it impossible to ascertain whether they would have
developed the disease of interest if they had not died early from a cause other than the
outcome of the study.
,→ Secondly, it is also challenging to follow individuals for long periods of time, and subjects
can become lost to follow-up (e.g., because they move away or choose not to participate
further in the study) which also means that we do not know whether they would have
developed the disease or not in case they have remained in the study.
,→ The difficulty in computing the incidence proportion in studies in which there are competing
risks and losses to follow-up is what to use in the denominator.
,→ If the denominator is comprised by the number of subjects who were initially being
followed, then we would be underestimating the incidence proportion that would have been
observed if there had been no competing risks or losses to follow-up.
,→ Generally, the incidence proportion is more useful when the follow-up time is relatively
short and therefore one expect few losses to follow-up.
,→ However, when measuring the incidence of a rare disease, individuals need to be followed
for long periods.
,→ To address the problem of competing risks and losses to follow-up, epidemiologists often
resort to the incidence rate as a measure of disease occurrence.
,→ The numerator of the incidence rate is the same as in the incidence proportion (number of
new cases over a defined interval). It is in the denominator that these two measures differ.
,→ In the incidence rate the denominator is given by the total amount of time at risk for the
disease accumulated by the entire population over the same interval.
,→ It is worth mentioning that a mortality rate is an incidence rate in which the outcome
under study is death.
,→ In the following calculations we will first assume that the disease under study is chronic
and that, as such, there is no recovery.
,→ Prevalence
,→ At t = 0: 0/5.
,→ At t = 5: 1/2.
,→ Incidence proportion over the time interval [0, 5]: 3/5 = 0.6.
,→ Incidence rate over the time interval [0, 5]: 3/(5 + 1 + 4 + 3 + 1) = 3/14 = 0.21 cases per
year.
,→ Now, under the assumption that the disease is acute thus implying that individuals who
recover immediately return to being at risk, then the denominator of the incidence rate
would be 5 + 1 + 5 + 3 + 4.5, leading to an incidence rate of 0.16 cases per year.
,→ Unlike the prevalence and incidence proportion, the incidence rate does not (necessarily)
lie on [0, 1].
,→ It indeed has a lower bound of zero, but it can theoretically become as great as infinity.
,→ Because the denominator of an incidence rate is measured in time units, we can imagine
that the time units can be smaller, making the rate larger.
,→ Indeed, the value of the incidence rate depends on what time unit is chosen.
,→ We could restate this same incidence rate using cases per year instead of cases per
month, thus leading to 47/13.17 = 3.57 cases per year.
,→ The above two expressions measure the same incidence rate; the only difference is the
time unit chosen to express the denominator.
,→ The different time units affect the numerical values of the incidence rate.
,→ The situation is pretty much similar as expressing speed in different time units of time or
distance!
,→ Prevalence concerns existing cases of a disease (or other health outcome) at a point
in time.
,→ Incidence concerns new cases of disease (or other health outcome) over a period of
follow-up.
,→ Prevalence is a useful measure for assessing the health status of a population and
in the planning of health services.
,→ Incidence measures are useful for identifying risk factors and assessing disease
etiology.
,→ Triple antiviral therapy has dramatically improved survival among patients with human
immunodeficiency virus (HIV) disease. If the incidence of HIV were to remain constant,
what is the expected impact of widespread triple antiviral therapy on the prevalence of HIV
in the population?
1 Increase.
2 Decrease.
,→ If the incidence of disease remains constant, but the duration of the disease increases,
then prevalence will increase.
,→ Prevalence provides a snapshot of the amount of disease that is present at a given time.
The widespread use of antiviral therapies for HIV has dramatically increased the lifespan of
people who contract this disease.
,→ As a result, the proportion of people living with HIV disease in a given snapshot of time has
increased.
,→ The incidence of a disease is five times greater in men compared with women, yet there is
no difference in disease prevalence by sex. What is the best explanation for this finding?
4 Women are older than men when they are diagnosed with the disease.
,→ The incidence of the disease is five times greater in men compared to women. For the
prevalences to be equal, the duration of the disease in men must be shorter than in
women.
,→ This could be due to men dying from the disease more rapidly, for example, the disease
may be more aggressive in men, or could be due to men recovering more quickly, for
example, men may receive more intensive treatment for the disease.
,→ We will now be looking at measures that allow us to study the association between a risk
factor or exposure and the occurrence of disease.
,→ For simplicity, we will start with a risk factor that can only take two values (e.g., exposed
and unexposed).
,→ Let us denote the disease outcome (also binary: present or absent) by D and the risk
factor by E.
,→ The relative risk (RR), also known as risk ratio, for an outcome D associated with a binary
risk factor E is given by
Pr(D | E) Pr(D | E)
RR = = .
Pr(D | Ē) Pr(D | not E)
,→ A RR = 1, the so-called null value, implies that Pr(D | E) = Pr(D | not E), which is
equivalent to saying that D and E are independent (no association between the exposure
to the risk factor E and the outcome D).
,→ A RR > 1 indicates that there is a greater risk or probability of D when exposed than when
not exposed (positive association between the exposure to the risk factor E and the
outcome D).
,→ When RR < 1 there is a reduced risk or probability of D when exposed than when not
exposed (negative association between the exposure to the risk factor E and the outcome
D).
,→ We shall note that the relative risk has an implicit upper bound.
,→ This is because the maximum possible value for a risk and, in particular for Pr(D | E) is
one, and therefore the relative risk must be less than or equal to 1/ Pr(D | not E).
,→ This restriction on the range of the relative risk is only problematic if the disease outcome
we are studying is a common one, as in this case Pr(D | not E) may become large.
,→ We shall note that the relative risk is not symmetric in the role of the two factors D and E.
,→ That is, the relative risk for E associated with D is a different measure of association than
the relative risk for D associated with E, i.e.,
Pr(E | D) Pr (D | E)
6= .
Pr(E | not D) Pr (D | not E)
,→ Let us illustrate the relative risk calculation with data from the table below (Jewell, 2003, p
22), which lists the vital status of all births in the United States in 1991, one year after date
of birth, and categorized by the marital status of the mother at the birth and the birth weight
of the infant.
,→ Low birth weight was defined as a birth where the newborn’s weight is less than 2.5kg.
,→ Following the definition, the relative risk for infant mortality in the USA in 1991, associated
with the mother being unmarried, is
,→ This result indicates that there is a positive association between infant mortality and
mother being unmarried.
,→ The risk of an infant death with an unmarried mother is about twice the risk of an infant
death with a married mother.
,→ In turn, the relative risk for infant mortality in the USA in 1991, associated with a low birth
weight is given by
,→ There is thus a much greather effect of birth weight on infant mortality than marital status.
,→ The relative risk measures the risk of the outcome D through Pr(D).
Pr (D) Pr(D)
= .
Pr(not D) 1 − Pr(D)
,→ The odds of D provide the same information as Pr(D) since knowing one of the quantities
immediately determines the other.
,→ Just as the relative risk measures association by comparing the probability of the outcome
D in the exposed and unexposed groups, the odds ratio measures association by
comparing the odds of disease in the exposed and unexposed groups.
,→ Because there is no upper limits for the odds of D, the odds ratio, by opposition to the
relative risk, has no implicit upper bound.
,→ As with the relative risk, OR = 1 is the null value, and it corresponds to the case where the
odds of the outcome D are the same in both groups (exposed and unexposed), and it is
again equivalent to no association (independence) between D and E.
,→ The reverse is true when OR < 1, that is, there is a lower risk of D in the exposed group.
,→ Let us revisit the example about infant mortality in the USA in 1991.
,→ Let us first compute the odds of infant mortality in the unmarried mother (exposed) group
16712/1213854 16712
= .
1197142/1213854 1197142
,→ Similarly, the odds of infant mortality in the married mother (unexposed) group are
18784/2897205 18784
= .
2878421/2897205 2878421
,→ The odds ratio for infant mortality associated with an unmarried mother is
16712/1197142
OR = = 2.14.
18784/2878421
,→ We will now re-do the calcution but now with low birth weight being the risk
factor/exposure.
21054/292323 21054
= .
271269/292323 271269
,→ Analogously, the odds of death in the normal birth weight group are
14442/3818736 14442
= .
3804294/3818736 3804294
,→ Finally, the odds ratio for infant mortality associated with low birth weight is
21054/271269
OR = = 20.4.
14442/3804294
,→ So, let us compare what we have obtained for this example so far in terms of relative risks
and odds ratios.
RR OR
Exposure/risk factor: marital status 2.12 2.14
Exposure/risk factor: low birth weight 19.0 20.4
,→ The RRs and ORs for both exposures are very similar.
,→ Let us look formally at the relationship between the RR and the OR in the next slide.
,→ Let us suppose first that RR > 1. By definition of relative risk, this means that
,→ The OR is thus the product of the RR (which by assumption is greater than one) by a
quantity that is also greater than one and therefore we must conclude that OR > RR when
RR > 1.
,→ When RR < 1, a similar reasoning leads to the conclusion that OR < RR.
,→ We thus arrive at the conclusion that the odds ratio is always farther away from one than
the relative risk (except when the relative risk is one, in which case the odds ratio will also
be one).
,→ How farther away the OR is from 1 compared to the RR will depend on both Pr(D | E) and
Pr(D | not E), which are also the two probabilities involved in the computation of the risk
ratio.
,→ When the risk of disease in both the exposed and unexposed groups is low, then
Pr(not D | E) ≈ 1 and Pr(not D | not E) ≈ 1 and so OR ≈ RR.
,→ We have seen before that the relative risk is not symmetric in the roles of D and E, that is,
the relative risk for D associated with E does not (need to) coincide with the relative risk for
E associated with D.
,→ The odds ratio enjoys the property of being symmetric with respect to the roles of D and E,
that is, reversing the roles of D and E makes no difference in its computation.
,→ As we will see in one of the next lectures, this property will be key when estimating the
association between D and E in certain study designs (case-control studies).
,→ In what follows, let ORD|E denote the odds ratio for D associated with E and ORE|D denote
the odds ratio for E associated with D.
,
Pr(D | E) Pr(D | not E)
ORD|E =
Pr(not D | E) Pr(not D | not E)
,
Pr(D&E)/ Pr(E) Pr(D¬ E)/ Pr(not E)
=
Pr(not D&E)/ Pr(E) Pr(not D¬ E)/ Pr(not E)
,
Pr(D&E) Pr(D¬ E)
=
Pr(not D&E) Pr(not D¬ E)
,
Pr(D&E) Pr(not D&E)
=
Pr(D¬ E) Pr(not D¬ E)
,
Pr(D&E)/ Pr(D) Pr(not D&E)/ Pr(not D)
=
Pr(D¬ E)/ Pr(D) Pr(not D¬ E)/ Pr(not D)
,
Pr(E | D) Pr(E | not D)
=
Pr(not E | D) Pr(not E | not D)
= ORE|D
,→ Another advantage of the odds ratio over the relative risk is its insensitivity to whether a
study is summarised with respect to D present or to D absent (e.g., death or survival).
,→ One odds ratio is simply the reciprocal of the other. The same is not true for the relative
risk.
,→ Kahn and Sempos (1989) illustrated this point with the following data comparing mortality
experience in two communities.
,→ Using these data to compare community A with community B, the relative risk of dying is
Pr(dying | A) 2/100
= = 2.
Pr(dying | B) 1/100
,→ Now, we will use the same data to compare the two communities but this time with respect
to surviving instead of dying.
Pr(surviving | A) 98/100
= ≈ 1.
Pr(surviving | B) 99/100
,→ The use of the relative risk led to different results depending on whether the study stressed
death or survival.
,→ The odds ratio does not suffer from such a drawback: community A has about twice the
odds of community B with respect to dying and about half the odds of community B with
respect to surviving.
,→ This indifference as to whether stress is placed in counting the events or the nonevents is
clearly a desirable property of the odds ratio as a measure of association between a
disease outcome and exposure to a risk factor.
,→ As we have seen, both the relative risk and the odds ratio measure comparative chance of
disease (in the exposed and unexposed groups).
,→ However, in some contexts, the odds ratio is all that we can estimate (the situation in
case–control studies as we will see later) or it is the most convenient measure to calculate
(in logistic regression analysis; also to be covered later in the course).
,→ As we have seen, the odds ratio will be a good approximation to the relative risk whenever
the disease in question is rare!
,→ Both the relative risk and the odds ratio are relative measures of risk differences between
the exposed and unexposed groups.
,→ The excess risk (ER), also known as risk difference, is defined as the difference between
the risk of disease in an exposed population and the risk of disease in an unexposed
population, that is,
ER = Pr(D | E) − Pr(D | not E).
,→ The ER focuses on the absolute effect of the exposure, or the excess risk of disease in
those who have the risk factor compared with those who do not.
,→ To see that, let us consider a population of size N and that all individuals are exposed.
Then, Pr(D) = Pr(D | E) and so the number of cases of D would be given by
N × Pr(D | E).
,→ Analogously, if in the same population there was no exposure to the risk factor, then
Pr(D) = Pr(D | not E) and the number of cases would be N × Pr(D | not E).
,→ The difference in the two caseloads is then N × Pr(D | E) − N × Pr(D | not E).
,→ Therefore, the excess risk can be interpreted as the excess number of cases, as a fraction
of the population size, when the individuals from the population are all exposed as
compared to them all being unexposed.
,→ ER = 0 is the null value since this is equivalent to Pr(D | E) = Pr(D | not E), that is, there
is no association between D and E.
,→ If ER > 0 then there is a greater risk of disease when exposed than when not exposed to
the risk factor. The opposite is true when ER < 0.
,→ Let us calculate the ER for the example of infant mortality in the USA in 1991 associated
with low birth weight.
ER = Pr(infant death | low birth weight) − Pr(infant death | normal birth weight)
21054 14442
= −
292323 3818736
= 0.0682.
,→ Direct interpretation of this result means that we would expect the infant mortality
percentage to increase by about 7% if all birth exhibited low birth weight as compared to all
births being of normal birth weight (more on this causal role of birth weight and infant
deaths will be discussed in the next slides).
,→ As a result, not all disease is due to exposure to the risk factor and so it is fair to ask how
much of the disease in the population can be explained by the presence of the risk factor.
,→ The attributable risk (AR) is a measure of association that answers such a question.
,→ The attributable risk is defined as the fraction of all D cases in the population that can be
attributed to exposure to the risk factor of interest.
,→ As we did for the excess risk, let us suppose that the population of interest has N
individuals.
,→ Assuming that exposure to the risk factor E is eradicated, the risk of D in the population is
then given by Pr(D | not E) = Pr(D) and there would be N × Pr(D | not E) cases of D.
,→ In the step marked with ∗ we have replaced in the numerator Pr(D) by its corresponding
expression using the law of total probability and then divided both the numerator and the
denominator by Pr(D | not E).
Vanda Inácio (UoE) Biostatistics (MATH11230) 56 / 60
Measures of disease-exposure association
Attributable risk
,→ The null value that implies no association between D and E is AR = 0 (when RR = 1).
,→ On the other hand, AR < 0 when there is a negative association between D and E.
,→ We will revisit the infant mortality (in the USA in 1991) to illustrate the computation of the
AR.
,→ When the risk factor is the mother’s marital status (E: unmarried mother) we have already
seen that RR = 2.12. In addition, Pr(unmarried mother) = 1213854/4111059 = 0.2952.
,→ In a similar fashion, when the risk factor is low-birth weight, we have also already
concluded that RR = 19.0. Moreover, Pr(low birth weight) = 292323/4111059 = 0.0711.
,→ As stated in Jewell (2003, p 41) the naive interpretation of the results in terms of the AR in
the previous slide would be that infant mortality could be reduced by 25% if all mothers
were married or by 56% if low birth weights could be eradicated.
,→ While it seems plausible that such a substantial fraction of infant mortality could be
mitigated by intervention programs aimed to eliminate the risk of a low birth child, it is hard
to believe that 25% of infant deaths could be avoided simply through a program to have
single pregnant women marry before they give birth.
,→ This suggests that marital status does not, in fact, cause infant mortality. The apparent
association, as captured by the RR, OR, and AR is likely due to other factors that are
related both to marital status and infant mortality. We will deal with this in a few weeks in
this course.
,→ Here the exposure is beneficial and the exposed group consists of those who were
vaccinated. The unexposed group then consists of those who were not vaccinated. The
outcome D is the getting infected with the virus SARS-CoV-2.
,→ This blog entry nicely illustrates the calculation of the efficacy of the Pfizer and Moderna
(for COVID-19) vaccines
https://towardsdatascience.com/
pfizer-and-moderna-vaccine-efficacy-calculated-from-data-9566897173c