Journal of Economic Surveys - 2023 - Askarov - Selective and Mis Leading Economics Journals Meta Research Evidence

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

DOI: 10.1111/joes.

12598

ARTICLE

Selective and (mis)leading economics journals:


Meta-research evidence

Zohid Askarov1 Anthony Doucouliagos2


Hristos Doucouliagos3 T. D. Stanley4

1 Department of Economics, Westminster


International University in Tashkent, Abstract
Tashkent, Uzbekistan We assess statistical power and excess statistical sig-
2 Coles Group, Melbourne, Australia nificance among 31 leading economics general interest
3 Department of Economics and Deakin
and field journals using 22,281 parameter estimates from
Laboratory for the Meta-Analysis of
Research, Department of Economics, 368 distinct areas of economics research. Median statis-
Deakin University, Melbourne, Australia tical power in leading economics journals is very low
4 Deakin Laboratory for the Meta-Analysis (only 7%), and excess statistical significance is quite high
of Research, Department of Economics,
Deakin University, Melbourne, Australia
(19%). Power this low and excess significance this high
raise serious doubts about the credibility of economics
Correspondence research. We find that 26% of all reported results have
Hristos Doucouliagos, Department of
Economics, Deakin University, undergone some process of selection for statistical sig-
Melbourne, Australia. nificance and 56% of statistically significant results were
Email: douc@deakin.edu.au
selected to be statistically significant. Selection bias is
greater at the top five journals, where 66% of statistically
significant results were selected to be statistically signif-
icant. A large majority of empirical evidence reported
in leading economics journals is potentially misleading.
Results reported to be statistically significant are about
as likely to be misleading as not (falsely positive) and
statistically nonsignificant results are much more likely
to be misleading (falsely negative). We also compare
observational to experimental research and find that the
quality of experimental economic evidence is notably
higher.

This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits
use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or
adaptations are made.
© 2023 The Authors. Journal of Economic Surveys published by John Wiley & Sons Ltd.

J Econ Surv. 2023;1–26. wileyonlinelibrary.com/journal/joes 1


14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 ASKAROV et al.

KEYWORDS
credibility, economic journals, excess statistical significance,
experimental economics, statistical power

1 INTRODUCTION

Science at its core is empirical, testing its theories against reproducible experiments and observa-
tions; this is what distinguishes it from other types of human endeavor (Popper, 1959). Experiment,
observation, and, in general, empirical investigation is science’s raison d’être.1 Nevertheless, there
is mounting concern about the reliability, reproducibility, and replicability of reported research.
Evidence is accumulating that empirical economics is affected by low statistical power, excessive
heterogeneity, specification searching, and the selection of estimates confirming favored hypothe-
ses (Abadie, 2020; Brodeur et al., 2016; Camerer et al., 2016; Christensen & Miguel, 2018; Ioannidis
et al., 2017; List et al., 2001). Low statistical power coupled with methodological flexibility entices
some researchers to “torture the data (to) confess” (Coase, 1995, p. 27), resulting in excess statisti-
cal significance and an exaggerated (or biased) research record. The research record becomes less
credible when it is notably biased or exaggerated.
Are the empirical results reported in the most prominent economics journals credible? In this
article we assess the credibility of results reported in leading economics journals by investigat-
ing whether the typical statistically significant result provides misleading information. We use
a unique dataset of 167,753 economic estimates from 368 research areas, of which 22,281 esti-
mates were published in the 31 leading general interest and field economics journals (Heckman &
Moktan, 2020). Using these data, we calculate statistical power and excess statistical significance
(ESS) and assess the degree to which reported results have gone through a process of selection for
statistical significance.
Statistical power is an objective measure of the empirical contribution of a reported statistically
significant finding. Social scientists have widely recognized that adequate power is a prerequisite
of reliable evidence and scientific credibility (Cohen, 1969). “Studies with low statistical power
produce inherently ambiguous results because they often fail to replicate” (Psychonomic Society,
2012, p. 1). Low statistical power is one of the sources of the failures to replicate.2 Excess statistical
significance (ESS) is the difference between the observed proportion of findings reported to be
statistically significant and the proportion that is expected be statistically significant assuming
no selection for statistical significance (Ioannidis & Trikalinos, 2007; Stanley et al., 2021). ESS is
a necessary condition for publication selection bias. In addition to power and ESS, we present
a new measure of the incidence of publication selection bias: the proportion of results reported
to be statistically significant that have gone through some process of preferentially selecting for
statistical significance. We validate these measures through simulations reported in the Online
Supplement.
We survey statistical power and ESS to assess whether economics journals report credible sci-
entific evidence and thereby contribute to the growing ‘meta-research’ literature investigating
economics research and the associated journals (Blanco-Perez & Brodeur, 2020; Brodeur et al.,
2016, 2020; Card & DellaVigna, 2020; Heckman & Moktan, 2020; Ioannidis et al., 2017). Specifi-
cally, Ioannidis’ et al. (2017) survey of economics research finds that power is typically quite low
(<18%) among 159 areas of economics research and the typical reported effect size is exaggerated by
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 3

a factor of two or more (Ioannidis et al., 2017). We double the size of their survey and disaggregate
to journal-specific dimensions of statistical and scientific quality. We find that median statisti-
cal power in leading economics journals is only 7%, much lower than the 80% standard. Excess
statistical significance is quite high at 19%. Further, we find that 26% of all reported results have
been selected for statistical significance and 56% of statistically significant results were selected
to be statistically significant. Across the top 5 journals, median power is still lower (5%), ESS is
higher (25%), 35% of reported estimates have undergone some process of selection for statistical
significance, and 66% of statistically significant results were selected to be statistically significant.
However, we also find that experimental economics has much higher power (78%), thereby exper-
imental results are more credible and likely to be less exaggerated with ESS at 10%. These findings
do not necessarily generalize beyond the 368 research areas included in our survey. Nevertheless,
they do raise serious doubts about the credibility of published economics research.
The next section discusses how we calculate statistical power and excess statistical significance.
Section 3 describes the data. Section 4 presents our findings, while Section 5 discusses these find-
ings and the limitations of our methods. Section 6 concludes. The Online Supplement provides
further information on the data and robustness checks.

2 POWER AND EXCESS STATISTICAL SIGNIFICANCE

Statistical power is the probability that researchers will find a statistically significant effect; thus,
high power makes it easier to reach statistical significance. Adequate statistical power is central to
credible empirical investigations and is widely acknowledged to be 80% or higher (Cohen, 1969).
That is, the probability of a type II error (1 - power) should be no larger than four times the proba-
bility of the conventional 5% type I error. By definition, underpowered studies are unlikely to find
what they seek.3 Low power makes it difficult for studies to reveal an underlying effect, where it
exists, leading to a type II error or to greater efforts to exaggerate effects to be statistically signifi-
cant. Absent adequate power, incentives to engage in specification searches, p-hacking, and other
questionable research practices that artificially inflate the significance and magnitude of a find-
ing may increase. With low power, some researchers may try harder to find statistical significance,
resulting in exaggerated effects whereby empirical effects are reported to be larger than they truly
are (Ioannidis et al., 2017). This exaggeration will be accompanied by erroneous statistically sig-
nificance, thus providing ‘falsely positive’ evidence. Worse still, exaggerated effects and statistical
significance can give the false impression of meaningful policy effectiveness where there is none,
wasting resources and thereby casting suspicion upon science.

2.1 Calculating statistical power

We follow Ioannidis et al. (2017), Stanley et al. (2018), and Stanley et al. (2021) by calculating sta-
tistical power retrospectively. Other surveys of power calculate statistical power, hypothetically,
from a study’s reported sample size and by assuming an arbitrary, often optimistic, effect size
(e.g., Cohen, 1962; Fraley & Vazire, 2014; Sedlmeier & Gigerenzer, 1989). Hypothetical power cal-
culations are minimally informed by the relevant research records and will be approximately
representative only in the rare case that the chosen effect sizes just happen to be close to the
population mean effect. Retrospective power, in contrast, begins with a comprehensive record of
the relevant area of research, using it to estimate the mean effect.
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 ASKAROV et al.

We do not use each individual estimate of effect to calculate power. Such post hoc power analysis
is widely known to be circular, uninformative, and biased (Ioannidis et al., 2017; Stanley et al., 2018;
Yuan & Maxwell, 2005). Across the disciplines, it has been widely reported that small studies (i.e.,
those with low power) systematically report larger effects, thereby exaggerating power (Fanelli
et al., 2017; Pereira et al., 2012). In economics, the typical effect is exaggerated by a factor of at
least two (Bartoš et al., 2023b; Ioannidis et al., 2017), making post hoc (or observed) power highly
inflated. In contrast, using a weighted average of all estimates in an area of research predictably
reduces this exaggeration when present without adding any offsetting bias or exaggeration if there
is no selective reporting (Stanley & Doucouliagos, 2015, 2017).4
Retrospective statistical power for estimate i, published in journal j, and research area m is:
( )
|𝛿|
𝑃𝑜𝑤𝑒𝑟𝑖𝑗𝑚 = 1 − 𝑁 1.96 − , (1)
𝑆𝐸𝑖𝑗𝑚

where N() represents the cumulative standard normal probability, 𝛿 is the mean effect, 𝑆𝐸𝑖𝑗𝑚 is
the standard error of each estimate, and 1.96 denotes the critical value using the conventional 0.05
level of significance. 𝛿 is the mean of the distribution of reported effects (e.g., elasticities).
To estimate 𝛿 for each research area, we draw upon the full distribution of effects and their
standard errors for each area of research (the data are discussed in Section 3 below). Conventional
meta-analysis combines all comparable estimates to estimate the mean effect. Weighted averages
of these estimates are less biased than simple averages (Stanley & Doucouliagos, 2012; Schmidt &
Hunter, 2015). We begin with a conservative meta-analysis estimate of the population mean effect,
the unrestricted weighted least squares (UWLS) (Ioannidis et al., 2017; Stanley et al., 2018). To
estimate UWLS, for research area m, we regress the reported effect size on a constant:

𝑒𝑖𝑗𝑚 = 𝛽0 + 𝜀𝑖𝑗𝑚 , (2)

where 𝑒𝑖𝑗𝑚 denotes an estimated effect (e.g., an elasticity) for the ith estimate in the jth journal. 𝛽̂0
1
is UWLS, estimated using weighted least squares with weights 𝑤𝑖 = 2 . 𝜀𝑖𝑗𝑚 is the error term.
𝑆𝐸𝑖𝑗𝑚
Because we observe widely different variances among the reported economic estimates, some
adjustment for heteroskedasticity is necessary. In this application, UWLS has desirable statistical
properties that go beyond WLS’s usual application (Stanley & Doucouliagos, 2015, 2017; Stanley
et al., 2023). Using UWLS as an estimate of the mean effect, 𝛿, we then calculate power as:
( )
|𝑊𝐿𝑆𝑚 |
𝑃𝑜𝑤𝑒𝑟𝑖𝑗𝑚 = 1 − 𝑁 1.96 − , (3)
𝑆𝐸𝑖𝑗𝑚

where WLSm designates the UWLS weighted average for research area m, that is, 𝛽̂0 from Equation
(2) (Ioannidis et al., 2017; Stanley et al., 2018). This formula assumes that researchers are using the
conventional 0.05 level of significance (or type I error) and a two-tail test, making 1.96 the critical
value, beyond which lies the rejection region.5 Power is calculated for each of the 167,753 estimates
across the 368 research areas and then averaged for each journal, separately.
The UWLS estimate of the mean effect is ‘conservative’ in the sense that it will err on the side
of overestimating power (and underestimating excess statistical significance), thereby favorably
assessing the quality of empirical economics (Ioannidis et al., 2017). Like all other conven-
tional meta-analysis methods, UWLS will be biased, exaggerating the mean effect, when there
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 5

is some systematic selection for statistically significant findings (often called publication bias or
p-hacking). However, simulations show that UWLS reliably reduces the bias of random-effects,
which is the most frequently used meta-analysis estimator (Stanley & Doucouliagos, 2015, 2017).
Specifically, UWLS is less biased than random-effects meta-analyses when the evidence base is
affected by publication selection bias.6
Selection for statistical significance inflates those estimates selected. UWLS does not attempt
to correct or remove these exaggerated estimates. Hence, UWLS will produce upwardly biased
estimates of the mean effect, 𝛿, when there is publication selection bias. Consequently, UWLS
will, on average, lead to inflated estimates of statistical power, and this exaggeration is known to
be larger than those methods that explicitly attempt to correct publication selection bias (Bartoš
et al., 2023a; Ioannidis et al., 2017; Stanley & Doucouliagos, 2017). UWLS is calculated separately
for each of the 368 research areas, using all of the estimates published or reported in any jour-
nal, working paper or book, not only those found in the leading journals. In summary, we use
all reported estimates to calculate a weighted average that is conservative in the sense that it will
sometimes overestimate statistical power and the associated indicators of scientific and statistical
quality.7

2.2 Excess statistical significance

Among all reported research in our survey, 84,849 estimates (or 50.6%) are statistically significant
(absolute t-statistic > 1.96) and 49.4% are statistically nonsignificant. Of the 22,281 estimates pub-
lished in leading economics journals, 52.7% are statistically significant, while 10,539 (or 47.3%) are
statistically nonsignificant. Are all of the observed t-statistics reported exactly and fully as they
are first estimated or are they artificially inflated/exaggerated through selection, searching, or
manipulation? Ioannidis and Trikalinos (2007) and Stanley et al. (2021) develop a test for selec-
tion for statistical significance (including: publication selection bias, reporting bias, specification
searching, and p-hacking) by assessing excess statistical significance (ESS). ESS is the difference
between the observed proportion of findings reported to be statistically significant (Pss) and the
proportion that is expected be statistically significant assuming that there were no selection for
statistical significance, Esig (Askarov et al., 2023; Stanley et al., 2021). That is, ESS = Pss—Esig.
ESS is a useful indicator of the intensity of publication selection bias. The larger the proportion
that are selected to be statistically significant, the larger is ESS.8 Below we show how ESS entails
an estimate of the proportion of all reported results that have undergone some process of selection
for statistical significance, Psss, whether it is through specification searching, p-hacking, fraud,
or any number of questionable research practices. ESS may be interpreted as the proportion of
all evidence that is falsely reported as statistically significant. Here, we define ‘false positive’ as
unsound evidence that some effect is statistically significant, in other words provides misleading
evidence of an economic effect. ESS is an estimate of the rate of these ‘false positives.’9 ESS and its
calculation also entail a new statistical test of the presence publication selection bias (PSST) that
has been shown to be better than the alternative tests of publication bias (Stanley et al., 2021)—see
Section 2.3 below.
Expected statistical significance (Esig) goes beyond basic power calculations to adjust fully for
observed heterogeneity. Heterogeneity is estimated as the observed variance among all reported
effects in each area of research that cannot be attributed to their reported random sampling
2
variances alone. The Online Supplement, Table S2 reports 𝜏̂ 𝑚 , which is a measure of the between-
estimate or heterogeneity variance, for each of the 368 research areas.10 Researchers who conduct
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 ASKAROV et al.

individual research studies do not know the heterogeneity variance and therefore cannot make
allowance for it. Thus, meta-analyses are needed to estimate and accommodate heterogeneity
fully. Heterogeneity is typically quite large in empirical economics, nearly 20 times larger than
reported sampling error variances (Ioannidis et al., 2017; Stanley & Doucouliagos, 2019). 𝐸𝑠𝑖𝑔𝑖𝑗𝑚 is
adjusted for the observed heterogeneity in each of these 368 areas of research, m, and total variance
2 2
calculated for each reported effect, i, as the sum of 𝜏̂ 𝑚 and 𝑆𝐸𝑖𝑗𝑚 . Where there is excess heterogene-
2
ity (i.e., 𝜏̂ 𝑚 > 0), the distribution of estimated effect sizes is expected to be correspondingly wider,
often much wider, than what would arise due to sampling error alone and the expected proportion
that are reported to be statistically significant (Esig) will be larger than statistical power.11
For each estimate within each of the 368 research areas, we calculate excess statistical signifi-
cance as: 𝐸𝑆𝑆𝑖𝑗𝑚 = 𝑆𝑖𝑔𝑖𝑗𝑚 − 𝐸𝑠𝑖𝑔𝑖𝑗𝑚 . Where 𝑆𝑖𝑔𝑖𝑗𝑚 (0/1) denotes whether a reported estimate is
statistically significant or not, defined by the conventional 5% level. When an estimate’s t-value is
at least 1.96 (or, equivalently, if the estimate is larger than the critical value, 𝑐 = 1.96 ∗ 𝑆𝐸𝑖𝑗𝑚 ), it
is reported to be statistically significant at the 5% level.
𝐸𝑠𝑖𝑔𝑖𝑗𝑚 is the expected probability that an effect would be statistically significant in the pre-
ferred direction given the estimated mean and heterogeneity variance from each of these 368
meta-analyses.12 Specifically, to calculate 𝐸𝑠𝑖𝑔𝑖𝑗𝑚 , we use UWLS to estimate 368 population
2
means, as before, random-effects to estimate 368 heterogeneity variances (𝜏̂ 𝑚 ), and each of the
2 13
167,753 reported sampling variances, 𝑆𝐸𝑖𝑗𝑚 . 𝐸𝑠𝑖𝑔𝑖𝑗𝑚 is then calculated as: 1 − 𝑁(𝑍𝑖𝑗𝑚 ), where
𝑁(𝑍𝑖𝑗𝑚 ) denotes the cumulative standard normal probability, and

( )/√ 2 2
𝑍𝑖𝑗𝑚 = 1.96 ∗ 𝑆𝐸𝑖𝑗𝑚 − |𝑊𝐿𝑆𝑚 | 𝑆𝐸𝑖𝑗𝑚 + 𝜏̂ 𝑚 . (4)

Traditionally, power calculations only consider random sampling error variance, and we report
statistical power in this traditional sense, below, in Tables 1 and 2—recall Section 2.1, above.
However, when calculating truly excess statistical significance (i.e., statistical significance beyond
what would be expected from an unbiased selection of reported findings across the full distribu-
tion of effects), we need to make an adjustment for the observed heterogeneity variance because
researchers have more choices about where to look and how to produce estimated effects than
through resampling alone. In calculating expected statistical significance (𝐸𝑠𝑖𝑔𝑖𝑗𝑚 ), we are assum-
ing that there is no selection for statistical significance. Comparison with the observed frequency
of statistical significance will then show if there is notable excess statistical significance.
Stanley et al. (2021) show that ‘true’ ESS > 0 is a necessary condition of the presence of publi-
cation selection bias; that is, if there is any selection for statistical significance, then E(ESS) > 0.
ESS is the proportion of results reported to be falsely ‘positive’, where ‘positive’ is short for being
reported to be statistically significant. Thus, ESS > 0 is an indicator of publication selection bias,
albeit a ‘conservative’ one because of the way that it is calculated. It is important to realize that our
estimates of excess statistical significance, as calculated above, will be ‘conservative’ in the sense
that ESS will more often be estimated to be smaller than the ‘true’ ESS. Just as power is system-
atically overestimated because UWLS is upwardly bias when there is publication selection bias,
Esig will also tend to overestimate the expected number that should be statistically significant in
the absence of publication selection bias. If Esig is overestimated, ESS is underestimated because
Pss is directly observed.
To validate these claims, we conduct simulation studies calibrated on the observable char-
acteristics seen in this extensive database of economics research. See section A of the Online
Supplement for further details about the simulation design and code. When there is no selection
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 7

for statistical significance, average ESS = −0.0025, using the above methods and after calibrat-
ing our simulation design closely upon the relevant research dimensions seen widely across
economics research (Ioannidis et al., 2017). A separate simulation experiment of 100,000 meta-
analyses shows that there is virtually no chance that the high levels of ESS that we see in leading
journals could have come from research that was not selected to be statistically significant, and
this is verified by statistical testing. In this simulation of 100,000 meta-analyses, each meta-
analysis of 700 estimates forces a different proportion of reported results to go through a process
of selection for statistical significance, and this proportion selected for statistical significance is
determined by a random draw from a uniform (0, .7) distribution.14 Also, each of these areas of
research has a different, random mean effect, and random heterogeneity with a different random
variance. Across these 100,000 simulated areas of research, we find that ESS reliably underes-
timates the ‘true’ ESS by 22%–see Figure S1 in the Online Supplement. The meta-regression of
𝑡
observed ESS, 𝐸𝑆𝑆𝑚 , as a predictor of ‘true’ ESS, 𝐸𝑆𝑆𝑚 ,(i.e., ESS calculated from known popula-
𝑡
tion parameters) across these 100,000 simulated meta-analyses is: 𝐸𝑆𝑆𝑚 = 1.221 ⋅ 𝐸𝑆𝑆𝑚 ; 𝑅2 =
96.0% (t = 3,128.0; p << .0001). Thus, we can be confident that ESS calculated over hundreds or
15

thousands of estimates will be an underestimate, by 22%. When ESS is multiplied by 1.221, it has
a .98 criterion validity in the assessment of the ‘true’ ESS.
The connection of ESS to selection for statistical significance serves as a basis for the estimation
of the incidence of this selection. Estimates of 𝐸𝑠𝑖𝑔 and 𝐸𝑆𝑆 entail an estimate of the proportion of
the observed research record that has undergone some process of being selected to be statistically
significant, P(SSS). Obviously, not all reported research has been selected for statistical signifi-
cance because only 50.6% are reported to be so. Even among those that are statistically significant,
many will not be the fruit of any pressure or effort to report statistically significant findings.
What proportion of those findings reported to be statistically significant have been selected to
be so? Does this vary by journal, and are higher-ranked journals also more selective with respect
to statistical significance as they are about perceived quality? Simple probability calculus provides
the answer. The proportion of the research record that has been selected to be statistically signif-
icant, P(SSS), equals ESS/(1-Esig).16 Thus, with estimates of ESS and Esig, we estimate P(SSS), as
PSSS . PSSS is an estimate the proportion of all results that were reported to be statistically signifi-
cant by going through some process of preferentially selecting for statistical significance whether
it is through specification searching, the suppression of nonsignificant results, any number of
questionable research practices, p-hacking, or fraud. As discussed above, both ESS and Esig are
imperfectly estimated, and because ESS is systematically underestimated, the ‘true’ proportion of
reported results that have been selected to be statistically significant will also be systematically
underestimated unless it is further adjusted.
We corroborate this underestimation of P(SSS) through the same simulation experiment dis-
cussed above and in section A of the Online Supplement. Across 100,000 simulated areas of
economics research, each with a random and different mean, heterogeneity variance, and a ran-
dom uniform [0, .7] true proportion of selection for statistical significance that is fixed and known
𝐸𝑆𝑆𝑚
in each simulated meta-analysis, we find: 𝑃(𝑆𝑆𝑆) = 1.118 ⋅ ; 𝑅2 = 97.7% (t = 4,297.7;
1−𝐸𝑠𝑖𝑔𝑚
p << .0001), where 𝑃(𝑆𝑆𝑆) is the actual probability (fixed for that specific simulation) that an
observed estimate was selected for statistical significance. Thus, our estimate of 𝑃(𝑆𝑆𝑆), 𝑃𝑠𝑠𝑠 =
𝐸𝑆𝑆
, may be reliably corrected for this systematic underestimation by multiplying it by 1.12,
1−𝐸𝑠𝑖𝑔
giving Psss*. That is, Psss* = 1.12 ⋅ 𝑃𝑠𝑠𝑠. These simulations show that Psss*, adjusted in this way,
has a .988 criterion validity in the measurement of the ‘true’ proportion that were selected to be
statistically significant, 𝑃(𝑆𝑆𝑆).—see Figure S2 in the Online Supplement. Nevertheless, in our
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 ASKAROV et al.

survey we opt to report the more conversative estimates of publication selection bias, PSSS , in
Tables 1 and 2 below. In this way, we consistently err on the side of a favorable evaluation of
economics research. That is, our estimates of power are known to be biased upward while our
reported estimates of ESS and the proportion selected to be statistically significant (PSSS ) are
known to be underestimates. For the sake of robustness, we report both PSSS and Psss* in the
Discussion section, below.

2.3 The proportion of statistical significance test—PSST

Recently, a test for the presence of selection for statistical significance (i.e., publication selection
bias, specification searching, p-hacking, etc.) has been shown to be more powerful than the alter-
natives (Stanley et al., 2021). The proportion of statistical significance test (PSST) is a simple test
of a proportion, 𝑃𝑠𝑠. 𝑃𝑠𝑠 is the observed proportion that are statistically significant and an esti-
mate of the population proportion, πss . 𝐸𝑠𝑖𝑔 serves as the theoretical proportion that should be
statistically significant were there is no selection for statistical significance. Thus, PSST tests: H0 :
πss ≤ 𝐸𝑠𝑖𝑔 using the calculated test statistic:

𝐸𝑠𝑖𝑔 (1 − 𝐸𝑠𝑖𝑔)
𝑍𝑃𝑆𝑆𝑇 = (𝑃𝑠𝑠 − 𝐸𝑠𝑖𝑔) ∕ , (5)
𝑘

where 𝑃𝑠𝑠 is∑the observed proportion that are statistically significant in k reported results
𝐸𝑠𝑖𝑔𝑖
and 𝐸𝑠𝑖𝑔 = . 𝑍𝑃𝑆𝑆𝑇 has, approximately, a standard normal distribution under the null
𝑘
hypothesis. Note that the numerator equals ESS. With as few as 80 estimates, PSST has adequate
power to detect the incidence of selection that we see below at the top 5 (Stanley et al., 2021).
For areas of research or journals with hundreds of estimates, often more, PSST is a very power-
ful test, as we see below. Neither PSST nor ESS depend on any assumption about a correlation of
standard errors (SE) with reported effects. Thus, they are not subject to any dismissal as ‘small-
study effects,’ that is common among medical researchers when evaluating regression-based tests
for publication bias. Nor does the possibility that some researchers might ‘game’ SE to achieve a
statistically significant result invalidate a finding of significant PSST or large ESS (Irsova et al.,
2023).17
Simulations show that PSST does not have inflated type I errors and is more powerful than
alternative tests for publication bias (Stanley et al., 2021). Because 𝐸𝑠𝑖𝑔𝑖𝑗𝑚 is a proportion, it can
be meaningfully compared and aggregated across different areas of research or journals (Askarov
et al., 2023; Ioannidis, 2011). Below we employ PSST to test whether there is significant selection
for statistical significance for specific economic journals and grouping of journals, and we use
ESS and Psss to reflect and measure the severity of this selection.

3 DATA

Our data come from 368 meta-analyses that collect all reported and comparable effect sizes (e.g.,
elasticities and correlations) for a specific research area. Experts in each of these 368 areas of
research decided which estimates were representative of the economic phenomenon that they
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 9

were studying, not us. Thus, the mean of each of these distributions is representative of the
empirical estimates of that economic phenomenon or effect.
We identified meta-studies using search engines (Econlit, Scopus, and Google Scholar), pub-
lisher sites (e.g., Science Direct, Sage, and Wiley), and webpages of researchers known to publish
meta-analyses. We also searched all volumes of individual journals that are known to publish
meta-analyses, for example, Journal of Economic Surveys, World Development, Public Choice, Euro-
pean Journal of Political Economy, Oxford Economic Papers, and Ecological Economics. Our search
for meta-analyses was not limited to economics; we include several meta-analyses published
in industrial relations, business research, political science, international relations, and psychol-
ogy but which relate to economics issues and contain estimates published in at least one of
the 31 leading economics journals. We used the following search terms: ‘meta-analysis’, ‘meta-
regression’, ‘research synthesis’, ‘systematic review’, ‘quantitative review’, ‘economics’, ‘economics
research’, ‘applied economics’, and ‘econometrics’. We also used field search terms such as ‘microe-
conomics’, ‘macroeconomics’, ‘experimental economics’, ‘industrial relations’, ‘labor economics’,
and ‘international economics’. The search for data ended July 31st, 2021.
Some studies report the meta-analysis data as part of the study or as an online appendix. Where
meta-analysis data were unavailable, we contacted authors via email. We had a 74% response
rate from the 109 contacted authors. To be included in our survey, a meta-analysis had to pro-
vide estimates of the effect size and its standard error; otherwise, statistical power could not be
calculated. We thus excluded several studies where standard errors were unavailable. We also
excluded meta-studies that did not report effect sizes but rather whether the reported result was
statistically significant or not; this too makes it impossible to calculate statistical power. Where
a research area has received more than one meta-analysis or systematic review, we include the
most recent and comprehensive study.18 Finally, we include only meta-studies that contained
at least five primary studies to ensure the reliability of estimates of the mean and the hetero-
geneity variance. We include both published and working-paper meta-analyses. The 368 research
areas span most Journal of Economic Literature codes. Fifty-nine research areas relate to experi-
mental economics research (either field or laboratory based) and the remainder primarily involve
observational research. All meta-analyses are referenced in the Online Supplement where we also
provide further details on the search process, and the email survey.
We cannot guarantee that our sample is representative of all the empirical evidence reported by
these journals. However, it is broadly representative of the evidence that has been meta-analyzed.
One concern is whether there are systematic differences between areas meta-analyzed and those
that are not. For example, perhaps research areas with widely different results are more likely to
be meta-analyzed. A related concern is whether the mean effect is smaller in those areas that have
been meta-analyzed and whether publication selection bias is a function of the size of the mean
effect. Nevertheless, relative evaluations are not affected because journals are compared on the
same areas of research and means.
These 368 meta-analyses include 167,753 estimates of which 22,281 were published in 31 leading
economics general interest or field journals. Although we focus on estimates published in these
journals, our statistical calculations are made for all 167,753.19 We follow Heckman and Moktan’s
(2020) classification of journals, including the top five, the non-top five general interest journals,
and 21 ‘tier A’ field journals to focus on that research that economists regard as the best. Table 1
below first lists the journals and then the number of estimates from each journal in Column (1).
Details on the number of research areas and years covered are provided in the Online Supplement,
Table S1.
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 ASKAROV et al.

F I G U R E 1 Marriage-wage premium.
[Colour figure can be viewed at wileyonlinelibrary.com]
Notes: UWLS = 0.044; SE = 0.036; tau = 0.05.

To illustrate the calculation of ESS and its components, consider one of these 368 meta-analyses,
the marriage wage premium. De Linde Leonard and Stanley (2015) identified 661 estimates of the
higher wage that married men receive relative to their single peers. For this area of research,
2
UWLS = 0.044, 𝜏̂ 𝑚 = 0.0025, and the median standard error is 0.036. A marriage-wage premium
with median power (i.e., SEi = 0.036) will be reported to be statistically significant when it is larger
than the critical value, 𝑐 = 1.96 ∗ 𝑆𝐸𝑖 = 0.07056, which means that the estimated marriage-wage
premium needs to be about 7% to obtain a significant finding. 𝑐 = 0.07056 defines the rejection
region for Ho : μ = 0 in terms of marriage-wage premiums—see Figure 1 and Equation (4). How-
ever, based on our meta-analysis of 661 marriage-wage premiums, we have reason to believe that
the mean marriage-wage premium is not zero. When the population mean marriage-wage pre-
mium is set equal to UWLS, the associated normal √ curve is centered at 0.044 rather than at 0, and
the standard deviation of this normal curve is: 𝑆𝐸𝑖2 + 𝜏̂ 𝑚 2
= .0616 – see Figure 1. From Equation
(4), we can now calculate the relevant z-value for the distribution of estimated marriage-wage pre-
miums as 0.43. Taking 1 − N(.43) gives 𝐸𝑠𝑖𝑔𝑖𝑗𝑚 = 0.3336, which is the expected probability that the
next randomly produced estimated marriage-wage premium with the median standard error will
be statistically significant. When the corresponding estimate is reported to be statistically signifi-
cant, 𝐸𝑆𝑆𝑖 = 0.6664. Overall, 59.5% of the marriage-wage premiums are reported to be statistically
positive. When averaged across these 661 estimates, ESS = 0.26 or 26%.
In order to understand and compare excess statistical significance at the leading journals, we
calculate 𝐸𝑆𝑆𝑖𝑗𝑚 for each estimate in each of the 368 meta-analyses. After calculating 𝐸𝑆𝑆𝑖𝑗𝑚
for each of these 167,753 estimates, they are averaged across each leading economics journal,
and the remaining 𝐸𝑆𝑆𝑖𝑗𝑚 from non-leading economic journals are averaged as ‘other reported
research’—see Table 1 below.
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 11

FIGURE 2 Productivity of public capital elasticities, all estimates.


Notes: Black circles denote estimates published in leading economics journals. Gray circles denote estimates reported in all other
outlets. n = 533.

To illustrate the calculation of retrospective statistical power, we use another one of the 368
meta-analyses, the 578 elasticity estimates of the productivity of public infrastructure capital (Bom
& Ligthart, 2014). Figure 2 illustrates the distribution of the elasticities comparing those reported
in leading journals (in black) versus all other outlets (in gray). The simple average elasticity of all
estimates is 0.188, while the UWLS weighted average elasticity is 0.046.20 We identified 45 esti-
mates as outliers or leverage points.21 Removing these observations gives a UWLS mean elasticity
of 0.070. This estimate of the mean elasticity is then substituted into Equations (3) and (4) to
obtain power and ESS calculated separately for each of the 533 estimates. Only 89 estimates have
at least 80% power.
Of the 533 estimates, 54 were published in four of the leading economics journals. Figure 3
focuses on the estimates in the leading journals, showing six estimates with adequate power (black
circles) versus those that lack adequate power (hollow circles). Average and median power are
0.378 and 0.295, respectively, for these 54 estimates. Average (and median) power by journal is:
0.299 (0.231) for Journal of Monetary Economics; 0.407 (0.414) for Journal of Public Economics;
0.399 (0.271) for Review of Economics and Statistics; and 0.410 (0.481) for Public Choice.
We repeat this process separately for each of the 368 research areas and then calculate average
and median power across all 368 research areas for each of the 31 journals. Different meta-analyses
use different formulas and metrics for measuring empirical effects. However, the effect size met-
ric is always exactly the same within each of these 368 meta-analyses.22 Because we do not, in any
way, average or aggregate effects across meta-analyses, different effect size measures pose no prob-
lem for our survey. Instead, we average power, t-values, the percent of estimates that are reported
to be statistically significant, and ESS. All of these statistics are measured in the same units of
measurement across these 368 meta-analyses and by the same formulas. Therefore, these sum-
mary statistics have the same meaning across different research topics and journals and can be
meaningfully compared and aggregated.
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 ASKAROV et al.

1
.8
Statistical power
.4 .2
0 .6

-.2 0 UWLS .2 .4 .6
Elasticity

FIGURE 3 Power of productivity of public capital estimates, leading economics journals.


Notes: Only estimates published in leading economics journals included. Black circles denote estimates with power ≥80%. The
dashed vertical line denotes the UWLS estimate of the mean elasticity (𝛿̂ = 0.07). n = 54.

4 FINDINGS

Table 1, Columns (2) and (3) display average and median power, estimated using UWLS. Median
power is more representative of typical economic research results because average power is often
influenced by the outsized effect of only a relatively few estimates. Median power is highest among
estimates published in the AEJ: Macroeconomics, the AEJ: Economic Policy, and the Journal of
Health Economics. For all other journals, typical power is much lower than the long-established
and widely-accepted 80% threshold for adequate power (Cohen, 1969) and also substantially lower
than the standard of 50% median retrospective power needed for credible meta-analyses (Stanley
et al., 2022).
Column (4) reports the average absolute value of the t-statistic. Despite low power, the aver-
ages of the reported t-statistics are quite statistically significant. This seeming inconsistency is
especially notable at the top five and non-top five general interest journals and is consistent
with a sizable exaggeration of the ‘significance’ of the typical findings reported in these journals.
Nevertheless, these journals also report many statistically nonsignificant results.
Column (5) presents the proportion of results that are reported to be statistically significant
(Pss),23 and Column (6) displays the proportion that are excessively statistically significant (ESS).
Pss varies rather tightly around the median (50.8%; interquartile range—IQR = [43.2%; 56.5%]),
but it is larger, often much larger, than what we would expect to find were there no selection for
statistical significance. Median excess statistical significance (ESS) across all 31 leading journals
is 19.7% and IQR = [13.7%; 26.6%], which is quite high and provides extremely strong evidence of
significant selection for statistical significance for nearly all of these journals. Recall that PSST is
a test for the presence of selection for statistical significance, and it is distributed as a standard
normal under the null hypothesis of no selection. Its z-value, ZPSST , is displayed in Column (7),
and is statistically significant at the .05 level for all journals with more than 80 estimates, with one
TA B L E 1 Power and excess statistical significance of leading economics journals.
Number of Average Median Average
estimates power power |t-value| Pss ESS ZPSST Psss Psss/Pss
ASKAROV et al.

Journal (1) (2) (3) (4) (5) (6) (7) (8) (9)
Top five journals
American Economic Review 2404 0.270 0.054 4.060 0.504 0.230 25.29 0.317 0.629
Journal of Political Economy 874 0.244 0.051 3.533 0.537 0.285 19.44 0.381 0.709
Quarterly Journal of Economics 589 0.246 0.049 3.598 0.567 0.309 17.16 0.416 0.734
Review of Economic Studies 209 0.264 0.056 3.307 0.455 0.181 5.87 0.249 0.548
Econometrica 169 0.295 0.056 5.965 0.568 0.296 8.69 0.406 0.714
Non-top five general interest
Review of Economics and 2642 0.348 0.101 4.780 0.481 0.142 15.36 0.214 0.445
Statistics
European Economic Review 1539 0.353 0.156 3.617 0.539 0.197 16.31 0.300 0.556
Economic Journal 988 0.266 0.093 3.223 0.427 0.151 10.62 0.209 0.489
Journal of the European 136 0.495 0.290 11.401 0.676 0.258 6.09 0.443 0.656
Economic Association
International Economic Review 71 0.475 0.310 4.784 0.775 0.356 6.08 0.612 0.790
Tier A field journals
Journal of Development 2829 0.141 0.045 1.963 0.350 0.156 20.90 0.193 0.552
Economics
Journal of Public Economics 1356 0.188 0.070 2.720 0.406 0.176 15.37 0.228 0.562
Journal of Finance 1193 0.117 0.027 2.260 0.258 0.123 12.43 0.142 0.551
Journal of Financial Economics 1001 0.177 0.031 2.514 0.309 0.121 9.80 0.149 0.482
Journal of Monetary Economics 944 0.131 0.031 2.705 0.326 0.157 12.82 0.189 0.578
Journal of Money, Credit, and 834 0.233 0.096 3.063 0.508 0.265 17.80 0.350 0.688
Banking
(Continues)
13

14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14

TA B L E 1 (Continued)
Number of Average Median Average
estimates power power |t-value| Pss ESS ZPSST Psss Psss/Pss
Journal (1) (2) (3) (4) (5) (6) (7) (8) (9)
Public Choice 830 0.195 0.074 2.070 0.360 0.100 6.54 0.135 0.374
Journal of Human Resources 607 0.396 0.245 2.991 0.473 0.080 4.06 0.132 0.280
Journal of Labor Economics 570 0.260 0.212 3.214 0.554 0.251 13.01 0.360 0.649
Health Economics 534 0.316 0.214 3.236 0.601 0.264 12.91 0.398 0.663
Journal of Economic Growth 470 0.163 0.055 2.322 0.455 0.258 14.02 0.321 0.705
Journal of Business and 300 0.189 0.027 2.876 0.560 0.401 18.99 0.477 0.851
Economic Statistics
American Economic Journal: 281 0.562 0.997 6.952 0.790 0.274 9.20 0.566 0.717
Macroeconomics
Journal of Health Economics 193 0.708 0.913 9.046 0.705 0.049 1.43 0.142 0.202
American Economic Journal: 168 0.754 1.000 12.098 0.786 0.179 4.74 0.455 0.579
Economic Policy
Journal of Econometrics 167 0.232 0.094 3.406 0.563 0.317 9.51 0.420 0.747
Journal of Industrial Economics 145 0.294 0.179 2.528 0.538 0.241 6.34 0.342 0.636
American Economic Journal: 105 0.182 0.084 2.611 0.486 0.266 6.60 0.341 0.703
Applied Economics
Rand Journal of Economics 95 0.479 0.298 3.439 0.537 0.132 2.63 0.222 0.414
Games and Economic Behavior 32 0.359 0.088 4.136 0.438 0.131 1.61 0.189 0.431
Journal of Economic Theory 6 0.082 0.025 0.987 0.167 0.078 0.67 0.086 0.515
Other reported research 145,472 0.277 0.093 3.319 0.422 0.120 99.39 0.172 0.406
Note: Pss = proportion reported as statistically significant, ESS = proportion that are excess statistically significant, and Psss estimates the proportion of all results selected for their statistical
significance. PSST provides evidence of selection for statistical significance when larger than 1.645 (α = .05). The last row reports these statistics for all other reported research results.
ASKAROV et al.

14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 15

exception. The Journal of Health Economics is the exception that proves the rule with ZPSST = 1.43
(p > .05). At the top five, typical ZPSST is over 15, varying from 5.87 (Review of Economic Studies)
to 25.29 (AER). These values of ESS and their associated ZPSST provide clear evidence of selection
for statistical significance, corroborating the seeming inconsistency of the large, reported t-values
and low power. What does this tell us about the incidence of selective reporting for statistical
significance?
The magnitude of ESS relative to the complement of Esig provides an estimate the proportion
of all findings that went through some process of selection for statistical significance—
𝐸𝑆𝑆
PSSS = [ ], Column (8), Table 1.24 Column (9) expresses Psss as a proportion of those results
(1−𝐸𝑠𝑖𝑔)
reported to be statistically significant (Psss/Pss). Psss/Pss estimates the probability that a result,
which is reported to be statistically significant, was selected to be so.25 Across these journals,
the median incidence of selection, Psss, is 31.7%; IQR = [19.1%; 40.2%]). That is, almost one-third
of all results published in leading economics journals have undergone a process of selection for
statistical significance.
Table 2 presents averages for these 31 leading economics journals versus ‘other’: non-leading
journals, working papers, books, and reports. Table 2 also compares observational research
to experimental research and to mixed research (areas with observational and experimental
research). Average power is 25% across leading economics journals but typical (median) power
is only 7%. This is unacceptably low by any standard and less than the research reported in ‘other’
where median power is 9% and average power is 28%. Overall, the statistical power of economic
research reported outside leading journals is two percentage points larger and yet the proportion
that are statistically significant is three percentage points smaller. This discrepancy is reflected by
the lower rate of falsely positive evidence (ESS) for ‘other’ research.
Outside of these 31 leading journals, falsely positive evidence represents only 12% of those
reported, compared with 19% for leading journals and 25% at the top five—see Column (5) Table 2.
Across all subgroups of journals and types of evidence, there is strong evidence of selection for
statistical significance—see ZPSST , Column (6) Table 2. These proportions are especially notable
when one compares them to the proportion of evidence reported to be ‘positive’ (i.e., statistically
significant). At the top five, falsely positive evidence represents nearly half of the positive evidence
(49%)—see Column (7) Table 2. Thus, evidence of an effect is almost as likely an error than not
when it is published by a top five journal. In contrast, 28% of evidence reported outside leading
journals is falsely positive evidence and, hence, there is a clear majority (72%) of positive evidence
that is genuine.
Smaller excess statistical significance reflects a lower incidence of selective reporting. We esti-
mate that the proportion of findings that were selected to be statistically significant is 35% at the
top five, 26% at the 31 leading journals, and 17% if reported elsewhere—Column (8) Table 2. These
are lower bound estimates.26 The higher credibility of nonleading journal evidence is especially
evident when we compare these incidences of selective reporting to the proportion of evidence
reported as positive (Psss/Pss)—see Column (9) Table 2. Two-thirds (66.3%) of positive evidence
reported at the top five had been selected to be positive, this drops to 56% across all 31 leading
journals, and it falls further to 41% for ‘other.’
The ‘good news’ is that experimental evidence is more reliable and less likely to be falsely
positive. Experimental studies published in leading journals have much less excess statistical sig-
nificance than observational research (ESS is 9.7% vs. 19.1%) and a lower incidence of selective
reporting (Psss is 21% vs. 25.8%). Also, the typical experimental study published at leading jour-
nals is nearly adequately powered (78%) and much more powerful than experimental findings
16

TA B L E 2 Median power, statistical significance, and excess statistical significance.


Number of
Subgroup estimates Average power Median power Pss ESS ZPSST ESS/Pss Psss Psss/Pss
(1) (2) (3) (4) (5) (6) (7) (8) (9)
Top five 4,245 0.262 0.051 0.520 0.253 37.18 0.486 0.345 0.663
31 leading journals 22,281 0.254 0.073 0.456 0.188 63.23 0.412 0.256 0.563
Other research 145,472 0.277 0.093 0.422 0.120 99.39 0.283 0.172 0.406
31 leading observational 20,238 0.247 0.071 0.452 0.191 61.76 0.422 0.258 0.571
Other observational 135,301 0.272 0.092 0.417 0.118 94.93 0.283 0.169 0.404
31 leading experimental 699 0.621 0.777 0.634 0.097 5.15 0.153 0.210 0.331
Other experimental 4,367 0.498 0.421 0.545 0.072 21.59 0.133 0.118 0.216
31 leading mixed 1,344 0.163 0.062 0.420 0.189 16.40 0.449 0.245 0.584
Other mixed 5,804 0.225 0.061 0.441 0.191 33.48 0.432 0.254 0.576
Note: Pss = proportion reported as statistically significant, ESS = proportion that are excess statistically significant, and Psss estimates the proportion of all results selected for their statistical
significance. PSST provides evidence of selection for statistical significance when larger than 1.645 (α = .05).
ASKAROV et al.

14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 17

reported at the non-leading journals (42%). Nonetheless, other experimental evidence is still less
falsely positive: ESS = 7.2% vs. 9.7%.
Our estimate of power in economics is lower than Ioannidis et al. (2017). This is due to
three differences between our data and Ioannidis et al. (2017). First, in several cases we include
estimates from updated meta-analyses (see the Online Supplement for details). Second, our sur-
vey focuses on the top 31 journals in economics. Hence, we exclude several research areas that
do not include any estimates published in these journals. Third, we include many more research
areas and observations. Specifically, our survey includes 167,753 estimates from 368 research areas,
whereas Ioannidis et al. (2017) included 64,076 estimates from 159 research areas.27

5 DISCUSSION

It is widely accepted that statistical power needs to be 80% to produce trustworthy evidence
(Cohen, 1969). This level of power is often required in the experimental sciences and by funding
agencies. In general, research published in leading economics journals fails to reach this level, at
least for the 368 research areas included in our survey. Only a small percent of the overall evidence
is adequately powered, or nearly so. Power is notably higher in experimental research and in three
of 31 leading economics journals. When median retrospective power is less than 50%, “the research
base may not be sufficiently informative for reliable summary” (Stanley et al., 2022, p. 101), which
describes the top five and all but three of the 31 leading economics journals for the research areas
included in our survey. With typical power this low (7%), the probability that the representative
study published in leading economics journals can find what it is seeking is only two percentage
points larger than the probability of falsely identifying an effect that does not exist (recall that α,
the probability of a type I error, is set at 5%). And yet, 46% claim that they do by reporting a statis-
tically significant finding. Clearly, there is strong selection for statistical significance somewhere
along the process that leads to publication in leading journals. Thus, when published in leading
journals, much of the ‘positive,’ statistically significant, evidence is not credible.
We corroborate strong selection for statistical significance through multiple measures and tests.
First, the proportion of statistical significance test (PSST) detects wide and very strong evidence
of the presence of selection for statistical significance across all groupings of journals and types of
research—see Table 2. Simulations do not find inflated type I errors for PSST (Stanley et al., 2021),
but they do find that PSST can have high power with as few as 80 estimates. With the thousands
of estimates that we have here, selection by only a small minority of researchers will be detected
by PSST.28 Hence, we also seek to measure the magnitude and reach of selective reporting, not
merely its statistical significance.
Secondly, we estimate the proportion of empirical evidence that is falsely reported to be positive
and statistically significant; in other words, excessively statistically significant (ESS) or evidence
that is ‘falsely positive’, that is, falsely reported as statistically significant. ESS is the proportion of
all results that were reported to be statistically significant beyond what could be expected to be so
if there were no selection for statistical significance based upon what we can observe about the
total distribution of observed economic effects and their standard errors in 368 areas of research
containing 167,753 estimated effects. At the top five journals, 25% of reported empirical research
is falsely positive but this decreases to 19% across all 31 leading journals; recall the ESS estimates
in Column 5 of Table 2. Nevertheless, 19% falsely positive is a large proportion of the reported
evidence base. The z-values associated with these proportions are 37.18 and 63.33, respectively,
again strongly confirming the existence of some publication selection bias. When we compare
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 ASKAROV et al.

ESS relative to the proportion reported as ‘positive’ (i.e., statistically significant), Pss, the propor-
tion of falsely positive evidence or, in other words, ‘misleading’ evidence is nearly half (49%) at
the top five and 41% across all 31 leading journals. It is important to recall that we know this to
be an underestimate of this proportion because our estimate of the mean effect, UWLS, is known
to be exaggerated when there is some selection for statistical significance. Simulations reported
in section A of the Online Supplement demonstrate that our estimate of ESS underestimates its
true value by 22%, on average, and that multiplying ESS by 1.221 makes it less biased and highly
correlated (r = .98) with the true proportion that are excessively reported to be statistically signif-
icant. After adjusting for ESS’s likely underestimate, 59.3% of the ‘positive’ statistically significant
evidence reported at the top five is misleading, a clear majority, and a slim majority (50.3%) of
the positive evidence reported across 31 leading journals is misleading. When empirical scientific
evidence is as likely to be misleading as not, it goes without saying that it is not scientifically
credible.
As unfavorable as the above assessment may seem, it greatly underestimates the overall pro-
portion of misleading evidence reported by leading economics journals because it puts to the side
the 54.4% of all results reported to be statistically nonsignificant. For this nonsignificant evidence,
type II errors are the relevant threat to their credibility. The probability of a type II error, often rep-
resented as β, equals 1-statistical power. We estimate β to be 93% for the typical result published in
a leading economics journal. If we interpret a failure to reject the hypothesis of a null economic
effect as ‘inconclusive,’ then inconclusive evidence reported at leading journals has, on average,
a 93% chance of being mistaken.29 Looking at the full spectrum of reported findings, results that
are reported to be statistically significant are about as likely to be misleading as not (49%) and
the remaining statistically nonsignificant results are much more likely (93%) to be misleading or
uninformative. Taken together, this implies that a large majority of empirical evidence reported
in leading economics journals is potentially misleading or uninformative, hence not credible.
Thirdly, our rich research database allows us to directly estimate the proportion of all reported
results that have undergone some process of selection for statistical significance (Psss). The cal-
𝐸𝑆𝑆
culation of ESS entails Psss. 𝑃𝑠𝑠𝑠 = . For the top five journals, we estimate that 35% of all
1−𝐸𝑠𝑖𝑔
reported results have undergone some process of selection for statistical significance (Psss), 26%
across all 31 leading journals, and notably less, 17%, if reported elsewhere.30 The pattern is similar
if we focus on Psss*, which corrects Psss’ underestimation by multiplying it by 1.12.31 We then find
that 39% of all reported results in the Top 5 have undergone some process of selection for statisti-
cal significance, 29% across all 31 leading journals, and 19%, in other outlets. Clearly, there is an
ordered hierarchy of publishing economics research. Nonleading publications and unpublished
reports are the least selective, leading journals in general, next, and the top 5 progressively publish-
ing more selective research, on average.32 These proportions of selected results are much higher
still when seen relative to positive results. For the top five, 66% of positive results were selected to
be positive, while lower it is still 56% across all 31 leading journals.33 Leading journals have long
been open about being highly selective, as seen in their published low acceptance rates. As many
have long suspected, we now have some evidence that this selectivity extends to the reporting of
positive, statistically significant, evidence.
Economic theory predicts that behavior follows incentives. It is widely known that academic
promotion and tenure decisions are driven by publication, especially when published at leading
and top five journals (Heckman & Moktan, 2020). If leading journals are seen to prefer large,
dramatic, and statistically significant results, researchers will have incentives to select and report
such findings. In particular, the “winner’s curse” has been offered to explain publication selection
bias:
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 19

The current system of publication . . . provides a distorted view of the reality of


scientific data . . . The “winner’s curse,” a more general statement of publication
bias, suggests that the small proportion of results chosen for publication (by leading
journals) are unrepresentative of scientists’ repeated samplings of the real world. . .
(T)he more extreme, spectacular results may be preferentially published (Young et al.,
2008, p. 1418, parentheses added).

Without statistical significance, evidence will not be seen as ‘spectacular.’ Thus, the pattern of
selection for statistical significance that we find at leading economics journals is consistent with
such publication preferences and this winner’s curse.
When selection is as severe as we see in the top economics journals (35% selection in the top
five or 39% if we use Psss*), the research record will be highly exaggerated in size and significance.
To evaluate the exaggeration that a 40% selection for statistical significance is likely to induce, we
conducted another simulation using the same design as before (with random population means
and heterogeneity) but where 40% of the reported result are selected across random sampling
error and random heterogeneity to be statistically significant. We found that this level of selection
causes the average reported results to be exaggerated by a factor of 2.77 and over half of the results
reported to be statistically significant (52%) are statistically significant only due to this selection.
Exaggeration of economic parameters this severe can have grave consequences for policy, causing
resources to be directed to projects that have little benefit or away from actions that have much
lower costs.
On a positive note, we know that small changes in the editorial policies at leading journals
can improve the credibility of the research they choose to publish, and they have done so. The
same data that are used here have also been employed to evaluate the effect of the adoption of
mandatory data-sharing policies at leading economics journals. Askarov et al. (2023) found that
merely mandating that researchers share their data is associated with a decrease in excess statisti-
cal significance, estimated as here, and mandatory data-sharing causes average reported t-values
to decrease. Similarly, Blanco-Perez and Brodeur (2020) found that journal editorials that wel-
come nonsignificant findings have an impact on reducing statistical significance and publication
bias reported at these journals and, presumably, their excess statistical significance. Corroborat-
ing Broduer et al.’s (2020) finding that the use of stronger experimental methods can mitigate
publication bias, we find that experimental studies are less selected for statistical significance and
much less excessively statistically significant—Table 2.
Long-established theoretical, scientific, and statistical justifications have established that exper-
imental evidence is, ceteris paribus, more credible than observational evidence, at least for the
most part. Our survey clearly corroborates this general assessment. We find that the average
economic experiment has much higher quality than the typical observational study.34 Table 2
discloses that the typical statistical power of experimental evidence is more than 10 times higher
than the power of observational studies published in the leading journals. Reflecting this higher
power, ESS of experimental evidence is only half as large as observational results. Thus, not all
evidence published in leading economics journals is highly selected to be statistically significant.
However, this experimental evidence represents only 3% of what is published in these leading jour-
nals. Nonetheless, our test of selection for statistical significance, PSST, is sufficiently powerful to
find clear evidence of publication bias even for experimental evidence (Z = 5.15; p < 0.001)—see
Table 2. Interestingly, the typical power of experimental evidence published by leading jour-
nals is much larger than experimental evidence reported elsewhere—78% versus 42%. Thus, it
appears that leading journals can select for power if they choose to.35 However, even here, we
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
20 ASKAROV et al.

have clear evidence that those experimental results that are published by the leading journals
have been more highly selected for statistical significance; ESS = {9.7%; 7.2%} and Psss = {21%;
12%}, respectively, for leading journals and ‘other’ reported research. Again, our results are
consistent with the belief that stronger experimental designs reduce publication bias and
p-hacking (Brodeur et al., 2020).36

6 SUMMARY

We survey statistical power and excess statistical significance in over 22 thousand test statistics
published in 31 leading economics journals. The typical result published in leading economic
journals has very low statistical power (7%), only two percentage points larger than the nominal
level of type I errors. Median power is lower in the top five journals; only 5%. At the same time,
46% of these test statistics are reported as statistically significant (52% in the top five). This suggests
that the profession is producing and reporting an excess of statistically significant results. Indeed,
we find that excess statistical significance is 19% in the leading economics journals (25% in the
top five). Comparing this excess significance to the proportion reported as statistically significant,
we find that 41% of the evidence in the 31 leading economics journals is falsely positive and 49%
is falsely positive in the top five. Moreover, we calculate that 56% of all statistically significant
results published in leading journals have gone through some process of preferentially selecting
for statistical significance. This proportion is 66% in the top five journals.
Low power and high excess statistical significance cast doubt on the contributions of many
published findings to empirical knowledge of economic phenomena. Whether reported to be sta-
tistically significant, or not, many published results are likely to be misleading. At least half of the
‘positive’ evidence reported in the top five journals is likely to be misleading and at least two-thirds
of this ‘positive’ evidence was selected to be positive.
Although some judgement about the choice of methods is unavoidable, the overall results and
their central patterns are robust to sensible alternatives. From the outset, our intention has been
to be conservative in our choice of methods and approaches by erring of the side of the credibility
of economic evidence and the quality of the research published in leading journals. Nevertheless,
we caution that while the findings apply to the 368 research areas included in our survey, they do
not necessarily generalize to economics, broadly, nor necessarily to all the empirical research pub-
lished in these 31 journals. Power and statistical significance are only two dimensions of research
quality, and articles published in scientific journals contribute in more ways than the reporting of
empirical investigations. For example, studies contribute new methods and new theoretical and
policy insights that are not part of our assessment. Moreover, our survey also establishes that the
quality of experimental economic evidence is notably higher and frequently scientifically credible.
In this article we assessed only two empirical dimensions of the credibility of economics
research. Journals, especially the ‘top five’, publish studies whose contributions often extend
beyond providing parameter estimates. Nevertheless, our results flag the importance of the analy-
sis of power and the rates of false positive or misleading evidence in economics. One implication of
our findings is the need for further investigations into the incidence of selection for statistical sig-
nificance and the rate of misleading evidence reported in economics. Another implication is that
evidence-based policy and the further development of economic theories needs to be grounded
on a broader, more credible, evidence base. In this regard, research synthesis methods, such
as meta-analysis, can be beneficial as they increase statistical power by pooling the results of
many studies (Schmidt & Hunter, 2015). Fortunately, notable efforts to increase credibility are
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 21

ongoing, such as: mandatory data-sharing, the increased use of experimental methods and meta-
analysis, expanded transparency, and the pre-registration of experimental trials (e.g., Angrist
& Pischke, 2010; Christensen & Miguel, 2018; Blanco-Perez & Brodeur, 2020; Askarov et al.,
2023).

AC K N OW L E D G M E N T S
Several scholars made their data available for our analysis. We are especially grateful to: Anar
K. Ahmadov, Francesco Aiello, Edward Anderson, Sebastian Beer, Germa Bel, Troy Broderstad,
Penelope Buckley, Antoine Cazals, Sefa Awaworyi Churchill, Maria Cippolina, Marco Colagrossi,
Liesbeth Colen, Thomas Conlon, Sandy Dall’Erba, Binyam Afewerk Demena, Irina Dolgopolova,
Robert JR Elliott, François Fall, Erik Fernau, Nino Fonseca, Lynn J. Frewer, Craig Gallet, Jerome
Geyer-Klingeberg, Adem Gök, David Guerreiro, Carla Haelermans, Christopher Hansen, Tomas
Havranek, Jost Heckemeyer, Philipp Heimberger, Wuyang Hu, Kaixing Huang, Nick Huntington-
Klein, Taisuke Imai, Ichiro Iwasaki, Huriya Jabbar, Zameelah Khan Jaffur, Mohammed Jawad,
Manu Jose, Tim Kaiser, Carl Koopmans, Patrice Laroche, Jing Li, Yulin Lin, Ludwig List, Cather-
ine Liston-Heyes, Georgios Magkonis, Luis A. De los Santos-Montero, Carina Neisser, Pedro
Cunha Neves, Thi Mai Lan Nguyen, Robin Nunkoo, Geoff Pugh, William O’Brochta, Edward
Oczkowski, Robert Reed, Jhon James Mora Rodríguez, Fabio Santeramo, Andreas Schneck,
Kibrom T. Sibhatu, Todd Sorensen, Mark Stevens, Janina I. Steinert, Mehmet Ugur, Paola Vesco,
Tracey Wang, Wei Yang, Mustafa Yeter, and Patrick Zwerschke. No funding was received to con-
duct this research. The authors declare no relevant or material financial interests that relate to the
research described in this paper.
Open access publishing facilitated by Deakin University, as part of the Wiley - Deakin
University agreement via the Council of Australian University Librarians.

D A T A AVA I L A B I L I T Y S T A T E M E N T
All data and code used in this survey are available through GitHub: https://github.com/
anthonydouc/Selective-Leading-Economics-Journals

ORCID
Hristos Doucouliagos https://orcid.org/0000-0001-5269-3556

ENDNOTES
1
There are, of course, many other important ways in which one can contribute to science—new theories,
criticizing old theories, developing new methods of observation, experiment, measurement and analysis, the
development of theories, and policy insights, to mention a few. Our focus in this article is exclusively on empir-
ical contributions. Sir Karl Popper (1959/34) repeatedly discussed the importance of ‘reproducible’ experiments
and tests of theories for science; however, in context, he is clearly referring to independent ‘replications’ by other
teams of researchers as the term ‘replication’ is commonly used today.
2
Widely-publicized failures to replicate highly-regarded findings from psychology and economic experiments
have led to a ‘replication crisis’ in the social sciences (Open Science Collaboration, 2015; Camerer et al., 2016;
Camerer et al., 2018; Klein et al., 2018).
3
The alternative hypothesis should represent the result for which researchers are seeking empirical support. Oth-
erwise, the conventional levels of type I and type II errors are misaligned. Even when researchers may be seeking
evidence in support of the null hypothesis, ‘accepting’ the null hypothesis only carries statistical or scientific
weight when statistical power is high. Cohen (1990), for example, argues that evidence in support of the null
hypothesis only has credence if a test accepts a trivial effect when the power to find a nontrivial effect is very
high (at least 95%).
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
22 ASKAROV et al.

4
By taking a weighted average of the reported evidence, large and often imprecisely estimated effect sizes are
downweighed. Hence, the UWLS meta-average reduces some of the artificial inflation of reported effect sizes.
5
Typically, t-tests are reported in the primary literature; however, the t-distribution rapidly converges to the normal
as sample size increases. The vast majority of the research collected here used a sample size large enough to
make the difference between 1.96 and the associated critical t-value practically negligible (median n = 350) for
our calculations of power and ESS).
6
In practice and in simulations, random-effects have been shown to produce highly exaggerated estimates of the
mean effect (Stanley & Doucouliagos, 2012, 2014; Kvarven et al., 2020). The Online Supplement reports power
using the random-effects weighted average.
7
The Online Supplement presents estimates of power using several alternate measures of the ‘true effect’. In par-
ticular, we estimate UWLS using only studies reported up to the year prior to an article been submitted for
review. This allows the UWLS meta-average to change over time, increasing in some research areas and declin-
ing in others. Further, we also use the PET-PEESE publication bias correction, conditional estimator (Stanley
& Doucouliagos, 2012) again using only those effect sizes reported up to the year prior to an article submitted
for review at a journal. This procedure accommodates two of the main sources of heterogeneity in empirical
economics: time-varying effect sizes and publication selection.
8
ESS = Psss (1 - Esig); where Psss is the proportion of estimates selected to be statistically significant.
9
‘Falsely positive’ evidence differs from ‘false positives’ when used as a synonym for type I errors, in that falsely
positive evidence makes no assumption about the ‘true’ population mean effect. Rather, falsely positive evidence
only means that the proclamation that this evidence is statistically significant is likely to be in error.
10 2
𝜏̂ 𝑚 is calculated from the random-effects meta-average model. This model assumes that each observed estimate
is equal to the population mean effect + unobservable random heterogeneity + random sampling error. The
individual true effect is the sum of the mean and random heterogeneity; however, neither of these is observable.
11
This is true as long as power <50%, which is the case for the typical empirical estimate reported across in all but
3 of the 31 leading economics journals but not for all areas of economics research.
12
Selective publication bias and the associated exaggeration of the size and significance of economic effects only
become notably problematic when selection is directional (Ioannidis et al., 2017). Thus, both 𝐸𝑠𝑖𝑔𝑖𝑗𝑚 and 𝑆𝐼𝐺𝑖𝑗𝑚
are calculated directionally, which we operationalize, empirically, by the sign of UWLS. When two-tailed excess
statistical significance is calculated for the leading journals, ESS is even larger.
13
Because there is high heterogeneity among reported economic results, we assume the conventional random-
effects model and estimate the heterogeneity variance using its maximum likelihood estimate. However, it is
widely known that the random effects estimate of the population mean effect is highly biased and much more
so than UWLS when there is publication selection bias (Bartoš et al., 2023a; Henmi & Copas, 2010; Stanley &
Doucouliagos, 2012, 2014, 2015). Thus, we use the UWLS estimate of the mean. This mixing of methods was
first proposed by Henmi and Copas (2010) to improve random-effects confidence intervals and to make random
effects more robust to publication bias.
14
700 is selected because this is the average number, approximately, of estimates per leading journal in our research
database.
15
We use this meta-regression equation merely to determine the average correction factor needed to ensure an
unbiased empirical estimate of the true proportion that are excessively statistically significant. A properly formed
regression model would reverse the variables. However, doing so does not change the correlation, which is the
𝑡
validity of using ES𝑆𝑚 = 1.221 ⋅ 𝐸𝑆𝑆𝑚 as a less biased estimate of the true proportion that are excessively
statistically significant.
16
The law of total probability states that the marginal probability of any event, 𝑃(𝐴), can be expressed as the sum

of joint probabilities that partition the sample space: 𝑃(𝐴 ) = 𝑃 (𝐵𝑛 ) ⋅ 𝑃(𝐴|𝐵𝑛 ) , where 𝐵𝑛 are pairwise dis-
𝑛
joint events whose union is the entire sample space. For this application, replacing A by SS (the event that a
particular estimate is reported to be statistically significant), 𝐵1 by SSS (evidence is selected to be statistically
significant), and 𝐵2 by 𝑆𝑆𝑆 (not selected to be statistically significant) gives 𝑃(𝑆𝑆 ) = 𝑃(𝑆𝑆𝑆) ⋅ 𝑃(𝑆𝑆|𝑆𝑆𝑆) +
𝑃(𝑆𝑆𝑆 ) ⋅ 𝑃(𝑆𝑆|𝑆𝑆𝑆 ). Several of these probabilities are known or can be easily estimated. By definition, every
reported estimate that has been selected to be statistically significant will be reported to be statistically signifi-
cant; hence 𝑃(𝑆𝑆|𝑆𝑆𝑆) = 1. Similarly, 𝑃(𝑆𝑆|𝑆𝑆𝑆 ) = Esig by our definition of Esig, which meta-analysis allows
us to estimate. Lastly, 𝑃(𝑆𝑆 ) is easily estimated as the observed proportion of estimates reported to be statis-
tically significant, Pss. Thus, 𝑃𝑠𝑠 = 𝑃(𝑆𝑆𝑆) + 𝐸𝑠𝑖𝑔 ⋅ (1 − 𝑃(𝑆𝑆𝑆)) and 𝑃𝑠𝑠 − 𝐸𝑠𝑖𝑔 = 𝑃(𝑆𝑆𝑆) − 𝐸𝑠𝑖𝑔 ⋅ 𝑃(𝑆𝑆𝑆). Or,
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 23

𝐸𝑆𝑆
𝐸𝑆𝑆 = 𝑃(𝑆𝑆𝑆) ⋅ (1 − 𝐸𝑠𝑖𝑔), giving 𝑃(𝑆𝑆𝑆) = for ∀ 𝐸𝑠𝑖𝑔 ≠ 1. If ESS < 0 for an area of research, we set
(1−𝐸𝑠𝑖𝑔)
𝑃(𝑆𝑆𝑆) = 0.
17
For example, suppose that a given SE is fraudulently or mistakenly reported to be one-half its true size to achieve
statistical significance for a particular estimate. ESS and PSST would be correspondingly underestimated because
this fraudulent, or mistaken, reporting of SE is a form of selection for statistical significance. Their underesti-
mation would be the result of an overestimated Esig because a falsely small SE overestimates the true power,
making the probability of finding a statistically significant finding higher in the absence of selection. As we
fully acknowledged above and our simulations confirm, our methods represent conservative assessments (or
underestimates) of the extent and reach of selection for statistical significance. If SEs are also ‘gamed’ to achieve
statistical significance this would serve to amplify the conservative nature of our methods and PSST.
18
Consequently, our sample differs from Ioannidis et al. (2017) in several ways; see the Online Supplement for
details.
19
To ensure minimal effects from typos and other recoding errors, outliers and leverage points were identified and
removed when they had studentized residuals in excess of 2.5 or DFBETAs greater than 2/sqrt(n) in absolute
values. See Table S2, Online Supplement for more details. 167,753 is the number of estimates after these are
removed, of which 22,281 were published in the leading economics journals. Our findings are similar with and
without this correction; see Table S6 in the Online Supplement.
20
UWLS of the productivity estimates of public infrastructure capital published in top economics journals is 0.092
(95% CI = 0.052–0.132). The UWLS estimate for all other reported research is 0.069 (95% CI = 0.061–0.076).
21
As detailed in the Online Supplement and above, we use studentized residuals in excess of 2.5 or DFBETAs greater
than 2/sqrt(n) in absolute values to identify and remove outliers and leverage points.
22
Hence, the effect size metric does not affect estimated power or ESS.
23
Pss denotes the proportion of estimates reported to be statistically significant in the same direction as the UWLS.
The proportion reported to be statistically significant in either direction is larger.
24
Psss is larger than ESS because some tests are sufficiently powerful to find a statistically significant result even
without selection for statistical significance. Whenever a colleague or student tells us that they got the ‘right’
result on their first try, it is an illustration of the difference between Psss and ESS.
25
It is easy to show that (Psss/Pss) estimates the probability that a result was selected to be statistically significant
if reported to be statistically significant; that is, P(SSS|SS). By Bayes theorem, we know that: P(SSS|SS) = P(SSS) ∙
P(SS|SSS)/ P(SS). P(SS|SSS) = 1, by definition, and P(SS) is estimated by Pss. Thus, (Psss/Pss) estimates P(SSS|SS).
26
Recall that the UWLS estimate of the mean overestimates power and underestimates ESS. Moreover, the meta-
analyses data combine estimates from studies whose focus was the effect of a particular variable (e.g., the effect
of aid on growth) with studies that include a variable merely as a control (e.g., aid is included as a control in
regressions of the effect of ICT on growth). Selection bias is likely to be much larger in the former group of
studies.
27
Ioannidis et al. (2017) calculate the median of the median power of the 156 research areas included in their survey.
In contrast, our aggregation is different in that we take the median power of the 22,281 estimates published in
the leading 31 journals. That is, we aggregate the individual estimates of power rather than aggregating median
power by research area. Median power is 13% if we aggregate the same way as Ioannidis et al. (i.e., take the median
of the medians).
28
For example, we estimate that only 12% of experimental evidence reported outside leading journals were selected
for their statistical significance. Yet, PSST is highly statistically significant (Z = 21.59; p < .0001), because there
are over 4000 of these estimates.
29
By classical, Neyman-Pearson, hypothesis testing, the failure to reject the null hypothesis is not evidence that
the null is true, but rather that the test result is ‘inconclusive’ or insufficiently powered to be informative. Cohen
(1990) and Abadie (2020) argue that failures to reject the null hypotheses can be highly informative. However,
they assume that power is very high, which our survey finds to be atypical. We also agree that a nonsignificant
finding would be much more informative if the null hypothesis were not a point but rather set to the interval less
than or equal to the minimally scientific effect size (Stanley & Doucouliagos, 2019). Whether we view nonsignif-
icant findings as merely inconclusive or as highly unreliably and inconclusive (due to their high probability of a
type II error) matters little. Either way, such evidence is either highly misleading or uninformative; hence, not
credible. If 54% of the research base is ‘inconclusive’ or highly prone to error and the remaining 46% is almost as
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
24 ASKAROV et al.

likely to be misleading as not, then, overall, typical evidence published in leading journals is neither credible nor
reliably valid.
30
Recall that ‘other’ research includes unpublished reports, conference papers, working papers, and theses.
31
Our simulations demonstrate that a correction for PSSS ’ systematic underestimation, 𝑃𝑠𝑠𝑠∗ = 1.12 ⋅ 𝑃𝑠𝑠𝑠, pro-
vides a highly reliable estimate of the proportion of reported results that have been subjected to a process of
selection for statistical significance—see section A of the Online Supplement.
32
Our research evidence does not allow us to identify the exact pathway of selection. Are reviewers and editors
at these journals preferentially selecting statistically significant results or do authors perceive that this might be
the case and engage in selection for statistically significant results before they submit their research to leading
journals? We cannot say. Perhaps, all along the way, some researchers are inclined to selectively report the evi-
dence they prefer. However, we can say that the evidence reported at the top 5 and at leading journals has been
selected for statistical significance more intensively than those reported elsewhere.
33
These percentages are higher if we correct Psss’ underestimation (Psss*/ Pss). We then find that 74% and 63% of
positive results in the top five journals and all 31 leading journals, respectively, were selected to be positive.
34
Observational studies are widely regarded to be more prone to be falsely positive (e.g., Ioannidis, 2005). Brodeur
et al. (2020) compare p-hacking and publication bias in 25 leading economics journals and their results confirm
“an unspoken hierarchy in the profession, which typically regards RCT (randomized controlled trials) as a gold
standard.”
35
Perhaps, this apparent selection for statistical power is possible because power and study quality are more closely
associated with the sample size of experimental studies and sample size is routinely reported. For observational
research, sample size, even when known, will be a poor proxy for research quality and less correlated with power
as power and quality depends on dozens of nuances about econometric methods and models in addition to sample
size.
36
Sample sizes in experimental research are comparable between leading and non-leading journals. For those
research areas where we can collect data on sample size, we find that median sample sizes are slightly smaller
in experimental research in leading economics journals; 168 compared to 198 for all other journals. They are
also slightly lower in observational research: 339 in the leading economics journals compared to 447 for all other
journals.

REFERENCES
Abadie, A. (2020). Statistical nonsignificance in empirical economics. AER: Insights, 2(2), 193–208.
Angrist, J. D., & Pischke, J.-S. (2010). The credibility revolution in empirical economics: How better research design
is taking the con out of econometrics. Journal of Economic Perspectives, 24(2), 3–30.
Askarov, Z., Doucouliagos, A., Doucouliagos, H., & Stanley, T. D. (2023). The significance of data-sharing policy.
Journal of the European Economic Association, 21(3), 1191–1226.
Bartoš, F., Maier, M., Wagenmakers, E. J., Doucouliagos, H., & Stanley, T. D. (2023a). Robust Bayesian meta analysis:
Model averaging across complementary publication bias adjustment methods. Research Synthesis Methods, 14,
99–116.
Bartoš, F., Maier, M., Wagenmakers, E. J., Nippold, F., Doucouliagos, H., Ioannidis, J. P. A., Otte, W. M., Sladekova,
M., Deresssa, T. K., Bruns, S. B., Fanelli, D., & Stanley, T. D. (2023b). Footprint of publication selection bias on
meta-analyses in medicine, environmental sciences, psychology, and economics. arXiv:2208.12334.
Blanco-Perez, C., & Brodeur, A. (2020). Publication bias and editorial statement on negative findings. Economic
Journal, 130, 1226–1247.
Bom, P. R. D., & Ligthart, J. E. (2014). What have we learned from three decades of research on the productivity of
public capital? Journal of Economic Surveys, 28(5), 889–916.
Brodeur, A., Cook, N., & Heyes, A. (2020). Methods matter: P-hacking and publication bias in causal analysis in
economics. American Economic Review, 110(11), 3634–3660.
Brodeur, A., Lé, M., Sangnier, M., & Zylberberg, Y. (2016). Star wars: The empirics strike back. American Economic
Journal: Applied Economics, 8(1), 1–32.
Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., Kirchler, M., Almenberg, J., Altmejd,
A., Chan, T., Heikensten, E., Holzmeister, F., Imai, T., Isaksson, S., Nave, G., Pfeiffer, T., Razen, M., & Wu, H.
(2016). Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436.
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
ASKAROV et al. 25

Camerer, C. F., Dreber, A., Holzmeister, F., Ho, T. H., Huber, J., Johannesson, M., Kirchler, M., Nave, G., Nosek,
B. A., Pfeiffer, T., Altmejd, A., Buttrick, N., Chan, T., Chen, Y., Forsell, E., Gampa, A., Heikensten, E., Hummer,
L., Imai, T., . . . Wu, H. (2018). Evaluating the replicability of social science experiments in Nature and Science
between 2010 and 2015. Nature Human Behavior, 2(9), 637–644.
Card, D., & DellaVigna, S. (2020). What do editors maximize? Evidence from four leading economics journals.
Review of Economics and Statistics, 102(1), 195–217. https://doi.org/10.1162/rest_a_00839
Christensen, G., & Miguel, E. (2018). Transparency, reproducibility, and the credibility of economics research.
Journal of Economic Literature, 56(3), 920–980.
Coase, R. H. (1995). Essays on economics and economists. University of Chicago Press.
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of
Abnormal and Social Psychology, 65, 145–153.
Cohen, J. (1969). Statistical power analysis in the behavioral sciences. Academic Press.
Cohen, J. (1990). Things I learned (so far). American Psychologist, 45, 1304–1312.
De Linde Leonard, M., & Stanley, T. D. (2015). Married with children: What remains when observable biases are
removed from the reported male marriage wage premium. Labour Economics, 33, 72–80.
Fanelli, D., Costas, R., & Ioannidis, J. P. A. (2017). Meta-assessment of bias in science. Proceedings of the National
Academy of Sciences, 114(14), 3714–3719.
Fraley, R. C., & Vazire, S. (2014). The N-pact factor: Evaluating the quality of empirical journals with respect to
sample size and statistical power. PLoS One, 9, e109019.
Heckman, J. J., & Moktan, S. (2020). Publishing and promotion in economics: The tyranny of the top five. Journal
of Economic Literature, 58(2), 419–470.
Henmi, M., & Copas, J. B. (2010). Confidence intervals for random effects meta-analysis and robustness to
publication bias. Statistics in Medicine, 29, 2969–2983.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.
Ioannidis, J. P. A. (2011). Excess significance bias in the literature on brain volume abnormalities. Archive of General
Psychiatry, 68, 773–780.
Ioannidis, J. P. A., Stanley, T. D., & Doucouliagos, C. (2017). The power of bias in economics research. Economic
Journal, 127, F236–F265.
Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials,
4(3), 245–253.
Irsova, Z., Bom, P. R. D., Havranek, T., & Rachinger, H. (2023). Spurious precision in meta-analysis. https://www.
econstor.eu/handle/10419/268683
Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams, R. B., Alper, S., Aveyard, M., Axt, J. R., Babalola,
M. T., Bahník, Š., Batra, R., Berkics, M., Bernstein, M. J., Berry, D. R., Bialobrzeska, O., Binan, E. D., Bocian, K.,
Brandt, M. J., Busching, R., . . . Nosek, B. A. (2018). Many Labs 2: Investigating variation in replicability across
sample and setting. Advances in Methods and Practices in Psychological Science, 1(4), 443–490.
Kvarven, A., Strømland, E., & Johannesson, M. (2020). Comparing meta-analyses and preregistered multiple-
laboratory replication projects. Nature: Human Behavior, 4, 423–434.
List, J. A., Bailey, C. D., Euzent, P. J., & Martin, T. L. (2001). Academic economists behaving badly? A survey on
three areas of unethical behavior. Economic Inquiry, 39, 162–170.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251),
aac4716.
Pereira, T. V., Horwitz, H. I., & Ioannidis, J. P. A. (2012). Empirical evaluation of very large treatment effects of
medical interventions. Journal of the American Medical Association, 308(16), 1676–1684.
Popper, K. R. (1959). The logic of scientific discovery. Basic Books.
Popper, K. R. (1972). Objective knowledge. Clarendon Press.
Psychonomic Society. (2012). New statistical guidelines for journals of the psychonomic society. http://www.
psychonomic.org/page/statisticalguideline
Schmidt, F. L., & Hunter, J. E. (2015). Methods of meta-analysis: Correcting error and bias in research findings. 3rd
edition Sage.
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies?
Psychological Bulletin, 105, 309–316.
14676419, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/joes.12598 by INASP/HINARI - PAKISTAN, Wiley Online Library on [14/12/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
26 ASKAROV et al.

Stanley, T. D., Carter, E., & Doucouliagos, H. ( C. ) (2018). What meta-analyses reveal about the replicability of
psychological research. Psychological Bulletin, 144, 1325–1346.
Stanley, T. D., & Doucouliagos, H. C. (2012). Meta-regression analysis in economics and business. Routledge.
Stanley, T. D., & Doucouliagos, H. ( C. ) (2014). Meta-regression approximations to reduce publication selection
bias. Research Synthesis Methods, 5, 60–78.
Stanley, T. D., & Doucouliagos, H. ( C. ) (2015). Neither fixed nor random: Weighted least squares meta-analysis.
Statistics in Medicine, 34, 2116–2127.
Stanley, T. D., & Doucouliagos, H. ( C. ) (2017). Neither fixed nor random: Weighted least squares meta-regression.
Research Synthesis Methods, 8, 19–42.
Stanley, T. D., & Doucouliagos, H. C. (2019). Practical significance, meta-analysis and the credibility of economics.
IZA Discussion Paper, No. 12458.
Stanley, T. D., Doucouliagos, H. ( C. )., & Ioannidis, J. P. A. (2022). Retrospective median power, false positive
meta-analysis and large-scale replication. Research Synthesis Methods, 13, 88–108.
Stanley, T. D., Doucouliagos, H. ( C. )., Ioannidis, J. P. A., & Carter, E. (2021). Detecting publication selection bias
through excess statistical significance. Research Synthesis Methods, 12, 776–795.
Stanley, T. D., Ioannidis, J. P. A., Maximilian Maier, M., Doucouliagos, H. ( C. ), Otte, W. M., & Bartoš, F. (2023).
Unrestricted weighted least squares represent medical research better than random effects in 67,308 Cochrane
meta-analyses. Journal of Clinical Epidemiology, 157, 53–58.
Young, N. S., Ioannidis, J. P. A., & Al-Ubaydli, O. (2008). Why current publication practices may distort science.
PLoS Medicine, 5, 134–145.
Yuan, K., & Maxwell, S. (2005). On the post hoc power in testing mean differences. Journal of Educational and
Behavioral Statistics, 30, 141–167.

S U P P O RT I N G I N F O R M AT I O N
Additional supporting information can be found online in the Supporting Information section at
the end of this article.

How to cite this article: Askarov, Z., Doucouliagos, A., Doucouliagos, H., & Stanley, T.
D. (2023). Selective and (mis)leading economics journals: Meta-research evidence. Journal
of Economic Surveys, 1–26. https://doi.org/10.1111/joes.12598

You might also like