Professional Documents
Culture Documents
Common Pitfalls in The Interpretation of COVID-19 Data and Statistics
Common Pitfalls in The Interpretation of COVID-19 Data and Statistics
1007/s10272-020-0893-1
Andreas Backhaus
Common Pitfalls in the Interpretation of COVID-19 Data and Statistics
Policymakers, experts and the general public heavily rely common among these pitfalls given that they have the po-
on the data that are being reported in the context of the tential to intentionally or unintentionally mislead the public
coronavirus pandemic. Daily data releases on confirmed debate and thereby the course of future policy actions.
COVID-19 cases and deaths provide information on the
course of the pandemic. The same data are also essen- The list of pitfalls presented is non-exhaustive. In fact,
tial for the estimation of indicators such as the reproduc- as the supply of data has increased since the beginning
tion rate and for the evaluation of policy interventions that of the pandemic, new pitfalls have emerged in parallel,
seek to slow down the pandemic. while others have decreased in relevance; a tendency that
seems likely to continue into the future. Beyond explain-
Together with the proliferation of data, however, a num- ing some of the current pitfalls, this paper will serve as a
ber of pitfalls have arisen with regard to the interpreta- more general caveat regarding the interpretation of data
tion of the data and the conclusions that can be drawn in the context of the SARS-CoV-2 pandemic.
from them. The aim of this paper is to highlight the most
A primer on case fatality rates, infection fatality rates
and mortality rates
© The Author(s) 2020. Open Access: This article is distributed under the
terms of the Creative Commons Attribution 4.0 International License
(https://creativecommons.org/licenses/by/4.0/). In the public debate, one can encounter at least three con-
Open Access funding provided by ZBW – Leibniz Information Centre cepts that measure the deadliness of SARS-CoV-2: the
for Economics. case fatality rate (CFR), the infection fatality rate (IFR) and
the mortality rate (MR). Unfortunately, these three con-
cepts are sometimes used interchangeably, which creates
confusion as they differ from each other by definition.
Andreas Backhaus, Federal Institute for Population
Research, Wiesbaden, Germany. In its simplest form, the case fatality rate divides the to-
tal number of confirmed deaths by COVID-19 by the to-
Intereconomics 2020 | 3
162
Forum
tal number of confirmed cases of infections with SARS- hence be lower than the IFR (and the CFR). However, the
CoV-2, neglecting adjustments for future deaths among computation of the MR is not particularly informative when
current cases here. However, the number of confirmed the pandemic has only been going on for a few months.
cases is believed to severely underestimate the true num- For example, the global COVID-19 death count has in-
ber of infections. This is due to the asymptomatic process creased more than fivefold between 1 April and 1 May in
of the infection in many individuals and the lack of testing 2020 (Our World in Data, 2020), rendering any MR com-
capacities. Hence, the CFR presumably reflects rather an puted around 1 April essentially meaningless. Thus, the
upper bound to the true lethality of SARS-CoV-2, as its MR is more appropriately used as a retrospective measure
denominator does not take the undetected infections into of the damage done in terms of lives lost after a pandemic
account. has run its course.
The infection fatality rate seeks to represent the lethality Comparability of case fatality rates between
more accurately by incorporating the number of undetect- countries
ed infections or at least an estimate thereof into its cal-
culation. Consequently, the IFR divides the total number In contrast to the IFR, case fatality rates have been avail-
of confirmed deaths by COVID-19 by the total number of able for many countries relatively early on during the pan-
infections with SARS-CoV-2. Due to its larger denomina- demic due to their simplicity. Recall that the computation
tor but identical numerator, the IFR is lower than the CFR. of the CFR only requires the total number of confirmed
The IFR represents a crucial parameter in epidemiological deaths by COVID-19 and the total number of confirmed
simulation models, such as that presented by Ferguson et cases of infections with SARS-CoV-2. As a consequence,
al. (2020), as it determines the number of expected fatali- CFRs have frequently been compared between countries.
ties given the simulated spread of the disease among the For example, the CFR of Italy has at virtually every point
population. during the coronavirus pandemic exceeded the CFR of
South Korea. A naive interpretation of this persistent dif-
The methodological challenge regarding the IFR is, of ference could be that the virus has somehow been dead-
course, to find a credible estimate of the undetected cas- lier in Italy than in South Korea for unknown reasons. Such
es of infection. An early estimate of the IFR was provided an interpretation overlooks that it must be assured first
on the basis of data collected in the course of the SARS- that the CFRs of different countries are comparable. A
CoV-2 outbreak on the Diamond Princess cruise ship in comparability between CFRs is given only if the confirmed
February 2020. Mizumoto et al. (2020) estimate that 17.9% cases that enter the calculation of the CFRs are sufficient-
(95% confidence interval: 15.5-20.2) of the cases were ly similar in terms of characteristics that are associated
asymptomatic. Russell et al. (2020), after adjusting for with fatalities.
age, estimate that the IFR among the Diamond Princess
cases is 1.3% (95% confidence interval: 0.38-3.6) when Age is among the most important of such characteristics
considering all cases, but 6.4% (95% confidence interval: given the overwhelming evidence that the likelihood of
2.6–13) when considering only cases of patients that are survival is substantially lower for patients at higher ages
70 years and older. The serological studies that are cur- (Docherty et al., 2020; Dowd et al., 2020). Italy and South
rently being conducted in several countries and localities Korea are among those countries that have published
serve to provide more estimates of the true number of demographic characteristics of their confirmed cases
infections with SARS-CoV-2 that have occurred over the comparatively early and consistently over the course of
past few months.1 the pandemic. Figure 1 compares the confirmed cases by
age group in Italy and South Korea. On 19 March 2020,
Finally, the (crude) mortality rate (or death rate) of SARS- South Korea exhibited a CFR of 1.1%, while Italy’s CFR
CoV-2 is computed by dividing the total number of con- stood at 8.6%. Using data from both countries and from
firmed deaths by COVID-19 that have occurred in a given the same date, a simple depiction of the distribution of
location during a certain period of time by the total popu- the confirmed cases across ten-year-age groups reveals
lation present in the same location during the same time that the CFRs of the two countries are not comparable:
period. Therefore, the MR can in principle be computed by the cases in Italy are concentrated in the high-age and
dividing a country’s COVID-19 death count by its current hence high-risk groups, as 38% of all confirmed Italian
population. Given that the coronavirus has never infected cases are at least 70 years old. By contrast, the confirmed
a country’s or location’s entire population, the MR will cases in South Korea are distributed more evenly across
age groups except for a spike in the young age group (20-
1 For a non-exhaustive overview of the serological studies and the as- 29). Only 10% of the Korean cases are at least 70 years
sociated complications, see e.g. Joseph and Branswell (2020). old. Consequently, the confirmed cases that enter the
Intereconomics 2020 | 3
164
Forum
M h
M h
M h
M h
2 ch
6 ril
10 ril
14 ril
18 ril
22 ril
26 ril
30 ril
ril
17 arc
21 arc
25 arc
29 arc
Ap
Ap
Ap
Ap
Ap
Ap
Ap
Ap
ar
M
1
Statistics Sweden further provides a vivid depiction of Importantly, sample selection bias is not related to sam-
the effects of the various data revisions on the total re- ple size. Hence, increasing the sample size by simply col-
ported death count per day in Sweden during the months lecting more data will not eliminate the selection problem
of March and April (see Figure 2): several days before its if the underlying mechanism that governs the selection
respective release date, each data series drops abruptly into the sample is not addressed.
and indicates an unreasonably low death count. Every
subsequent data release then substantially revises the Endogeneity of policy interventions
death count upwards, with additional but less significant
revisions in even later releases. For example, the data re- It would certainly be worthwhile to evaluate the effective-
lease from 6 April reports a total daily death count of 157 ness of the various lockdown strategies implemented by
for 1 April. However, the data release from the following governments across the globe in response to the coro-
week revises this initial death count for 1 April upwards by navirus pandemic. For that purpose, it might be tempting
almost 100% to 308. The subsequent releases settle the to rank countries according to the stringency of their re-
total death count at 324. Hence, it is important to keep in spective lockdown strategies and then to simply compare
mind that very recent data are often incomplete and sub- this ranking to a country ranking of the COVID-19 death
ject to substantial revisions. They are therefore not ade- toll, which would represent the outcome variable that the
quate for immediate use in policy evaluation. lockdowns were supposed to affect.
Sample selection bias However, such a comparison and equally every regres-
sion analysis following the same intuition would suffer
Most often, the data collected and analysed in the con- from an endogeneity problem. This econometric term is
text of the coronavirus pandemic do not represent ran- best understood by asking the question: Why have some
dom samples of the underlying population. The same ap- countries with a high COVID-19 death toll, such as Italy
Intereconomics 2020 | 3
166