Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ORIGINAL ARTICLE

(wileyonlinelibrary.com) doi: 10.1002/leap.1265 Received: 4 September 2019 | Accepted: 16 October 2019

bioRxiv: Trends and analysis of five years of preprints


Kent R. Anderson

Founder, Caldera Publishing Solutions, Westborough,


Abstract
MA, USA bioRxiv was founded on the premise that publicly posting preprints would
allow authors to receive feedback and submit improved papers to journals.
ORCID: 0000-0002-5458-6735
This paper analyses a number of trends against this stated purpose,
E-mail: kent@caldera-publishing.com namely, the timing of preprint postings relative to submission to accepting
journals; trends in the rate of unpublished preprints over time; trends in
the timing of publication of preprints by accepting journals; and trends in
the concentration of published, reviewed preprints by publisher. Findings
show that a steady c.30% of preprints remain unpublished and that the
majority is posted onto bioRxiv close to or after submission – therefore
giving no time for feedback to help improve the articles. Four publishers
(Elsevier, Nature, PLOS, and Oxford University Press) account for the pub-
lication of 47% of bioRxiv preprints. Taken together, it appears that bio-
Rxiv is not accomplishing its stated goals and that authors may be using
the platform more to establish priority, as a marketing enhancement of
papers, and as functional Green OA, rather than as a community-driven
source of prepublication review.

INTRODUCTION To evaluate bioRxiv, we can be more specific. The stated pur-


pose of bioRxiv is to serve as a useful way station as papers
The preprint server bioRxiv was developed at Cold Spring Harbor move along to journals:
Labs (CSHL) and its Press (CSHLP) and has become the largest
such platform in the biological sciences since its launch in By posting preprints on bioRxiv, authors are able to make
November 2013, hosting more than 40,000 preprints to date. Its their findings immediately available to the scientific com-
preprint base has since been used to catalyse the launch of a munity and receive feedback on draft manuscripts before
related preprint initiative, medRxiv, which was launched in 2019 they are submitted to journals.
through a collaboration of CSHLP, the BMJ Publishing Group,
and Yale University.
This paper examines data acquired from a number of sources,
Preprint servers have been established under various aus-
including bioRxiv web pages and published article pages. Overall,
pices, including as temporary documents to bridge the time gap
more than 37,000 preprints from bioRxiv were analysed for this
between submission and publication (Goldschmidt-Clermont,
paper, and more than 1,200 published, peer reviewed articles
1965) to fixed papers intended to fundamentally supplant formal with associated preprints were included in the final data set.
journals (Brooks, 2009). There is little agreement about their Data were gathered using the Web Scraper plugin for Google
function, from the durability of preprints themselves in the record Chrome (https://www.webscraper.io/), and data were down-
to their function for authors. Numerous concerns about preprint loaded via CSV and then converted to Microsoft Excel. All calcu-
management have been articulated (Kling, 2005), especially in the lations were performed in Excel. Data included the date of
biological and medical sciences. preprint posting, the digital object identifier (DOI) of the preprint,

Learned Publishing 2019 www.learned-publishing.org © 2019 The Author(s). 1


Learned Publishing © 2019 ALPSP.
2 K.R. Anderson

preprints should be posted well ahead of submission to the


Key points final journal accepting the paper in order for community feed-
• Of the preprints related to papers published in Nature back to accumulate, so authors can improve the paper prior to
journals, 57% were posted on bioRxiv after the paper was submission.
• The rate of unpublished preprints remaining on the bioRxiv
submitted to the journal that would ultimately publish it.
servers. If bioRxiv is helping move manuscripts through an
• Approximately 30% of bioRxiv preprints remain un- improvement process and into journals, the rate of preprints
published, similar to the publication rate of ArXiv.org. remaining unpublished should consistently trend towards zero,
• Almost half of bioRxiv preprints that are ultimately publi- and this rate of decrease should be accelerating as the service
proves more mainstream and successful.
shed are found in journals from four publishers: Elsevier,
Nature, PLOS, and Oxford University Press.
The null hypothesis is therefore that bioRxiv is doing what it
• Authors posting preprints on bioRxiv rarely publish in meg- purports – helping authors improve papers prior to submission to
ajournals: only 3–9% of published articles appear in the journals. One sign of success would be a high and increasing per-
four main megajournals (PLOS One, Nature Communica- centage of papers successfully passing peer review and resulting
in journal publication. This outcome would satisfy all stake-
tions, Scientific Reports, and Science Advances.
holders. By extension, if the null hypothesis is contradicted by
the data, it would be reasonable to assume that bioRxiv is serving
purposes other than those stated.
and any associated journal publication emanating from the pre-
print. A sub-analysis conducted later used the same tool to assess
the use and reach of Twitter relative to bioRxiv preprints. Pub-
TIMING OF PREPRINT POSTING RELATIVE TO
lishers and journals associated with published papers emanating
PUBLICATION IN ACCEPTING (FINAL)
from preprints were determined via DOI lookup. These data cor-
JOURNALS
respond strongly with a prior analysis published earlier this year
by researchers from the University of Minnesota (Abdill &
To evaluate whether preprints are being mainly used to gather
Blekham, 2019). However, my analysis varies, and I have added
community feedback in order to improve papers prior to submis-
other data sets to test pre-publication posting behaviour and
sion, I constructed a sample of more than 1,200 papers published
other factors that those authors did not test.
from 2016 to 2018 across a variety of Nature journals. Journals
To evaluate bioRxiv’s success relative to its stated purpose, I
from Nature were selected as they cover a wide variety of sub-
have interrogated the data along several related dimensions:
ject areas and reliably include publication event data, such as date
of submission, date of acceptance, and date of publication. In
• The timing of preprint posting relative to publication in accepting addition, Nature journals were generally the most active for
journals. If the premise of bioRxiv is proving successful, most authors posting preprints in bioRxiv during the timeframe of this

FIGURE 1 This scattergram shows


1,220 papers from Nature journals.
The 0 date is the date the paper
was submitted to the journal. The
distance above this point is the
number of days prior to submission
the preprint was posted. The dis-
tance below this point is the number
of days after submission when the
preprint was posted. As you can see,
the trend is for more preprints to be
posted after submission, fewer to be
posted prior to submission, and the
times to shift towards being closer
to submission if done before or fur-
ther beyond if done after.

www.learned-publishing.org © 2019 The Author(s). Learned Publishing 2019


Learned Publishing © 2019 ALPSP.
Trends and analysis of five years of preprints 3

analysis, accounting for 14% of all published preprints (Elsevier process. This was a sampling of articles submitted to a specific
was second, with 13%). Papers were matched across bioRxiv and publisher. The sample size was limited by the availability of
the Nature journals by digital object identifier (DOI). complete submission, acceptance, and publication data across many
The data show that 57% of the preprints related to papers publishers.
published in Nature journals were posted on bioRxiv after the With the majority of preprints in this sample being posted after
paper was submitted. A total of 5% of papers were posted after submission to the journal that would ultimately publish them, and
acceptance, while 0.2% were posted after publication. usually after some editorial signal has been sent, if not complete
The time afforded to posting before submission was gener- reviews, it seems bioRxiv is being used by authors for purposes
ally brief. Only 29% of published papers were posted as a pre- other than garnering community feedback. Given the potential for
print more than 10 days before the paper was submitted to the posted preprints to reflect peer review or editorial review feedback,
journal that published it, while 26% of preprints were posted and the increasing time between submission and posting, these
within 10 days before or after submitting it to that journal. The practices may have implications for publisher policies around Green
remainder were posted more than 10 days after submission. OA, preprints, and copyright transfer agreements.
Figure 1 shows the data for each paper from 1 to 1,220 (X-axis).
Negative values connote days after submission, meaning the pre-
print was posted after submission.
The trend in the data was towards a longer period occurring TRENDS IN UNREVIEWED PREPRINTS ON
between submission and posting, doubling from 20 days post- BIORXIV
submission for 2016 preprints to 40 days post-submission for
2018 preprints. The overall effect of the data suggests fewer (A note on terminology: I view preprints as published in the prac-
postings prior to submission and more postings after (the data tical sense. However, where they have not been accepted and
are from 2016 to 2018, reading left to right, with the spread of published by a peer reviewed journal, what remains on bioRxiv is
papers distributed evenly from 0 to 1,220). This suggests that the unreviewed assertions of the authors, which here are called
authors are increasingly using the platform for purposes other the ‘unreviewed preprint’. This term is used to differentiate these
than receiving pre-submission feedback. articles from those that have been reviewed and subsequently
Overall and on average, the preprints related to Nature jour- published in peer reviewed journals.)
nal articles were posted slightly more than a month (34 days) after Using data from more than 37,000 preprints posted on bio-
being submitted to the journal that would ultimately publish the Rxiv between November 2013 and 31 December 2018, thou-
work. This timing may have some significance. In general, for many sands of unreviewed preprints remain on bioRxiv servers. I
journals, 30 days is enough time for a paper to clear initial editorial consider these to be abandoned manuscripts – either abandoned
review and, in some cases, to have cleared initial peer review. by the authors, who never intended to publish them in a peer
Given these data, it appears that, instead of a pre-submission reviewed journal or who gave up after multiple unsuccessful
system, bioRxiv may be viewed as a pre-publication system by attempts, or abandoned by the scientific community via the peer
the authors using it. This shift in author utilization of bioRxiv may review and editorial review mechanisms the community uses to
be accelerating as well. accept or reject research findings. However defined, the rate of
There are limitations to these data. We cannot tell from these unreviewed or abandoned manuscripts is steady – 2 years after
data if any papers were reviewed and rejected elsewhere prior posting, approximately a third of preprints has not resulted in a
to the ultimately successful submission, review, and publication peer reviewed publication.

FIGURE 2 Percentage of preprints not published in peer-reviewed format.

Learned Publishing 2019 © 2019 The Author(s). www.learned-publishing.org


Learned Publishing © 2019 ALPSP.
4 K.R. Anderson

FIGURE 3 Preprints posted and not published in a peer-reviewed form.

As of 1 August 2019, for the years 2013–2019, 41.9% of bio- PUBLISHERS OF PAPERS WITH ASSOCIATED
Rxiv preprints have not been published in a peer reviewed journal. PREPRINTS
The averages at the heart of the data are quite consistent (Fig. 2):
Since 2013, Elsevier and Nature have published the largest per-
• 2013 – 40.3% (service started in November, 77 preprints centage of papers initially posted on bioRxiv. These papers were
posted, 31 never published) published in journals with a variety of business models, including
• 2014 – 31.9% subscription journals. Over time, in fact, the percentage of papers
• 2015 – 32.2% published by pure OA publishers (e.g. PLOS, BioMed Central,
• 2016 – 31.3% eLife) has decreased.
• 2017 – 32.4% As of 1 August 2019, Nature and Elsevier accounted for 27%
• 2018 – 50.0% of the published papers based on preprints from 2013 to 2018.
Of the two, Elsevier is growing the fastest. Along with PLOS and
Focusing on the four complete years, when the service had a Oxford University Press (OUP), four publishers published 47% of
full year for preprint posting and publication events can be the papers based on bioRxiv preprints over the years mentioned
assumed to have had sufficient time to occur (2014–2017), the above:
average rate of abandoned manuscripts is 32%, or roughly 1 in
3. These data are not dissimilar to publication rates calculated
• Nature – 14%
from arXiv preprints (Lariviere et al., 2014).
• Elsevier – 13%
The rapid increase in the raw number of preprints means that
• PLOS – 11%
the steady percentage of abandoned preprints represents a grow-
• OUP – 9%
ing number of preprints posted on bioRxiv that have never been
accepted in a peer reviewed journal (Fig. 3): In the market overall (personal communication, 2019),
Elsevier accounts for 18% of the papers published (which puts
• 2013 – 31 preprints never published them short of market here), while SpringerNature accounts for
• 2014 – 252 ~12% (which means the Nature subset is strongly over-
• 2015 – 509 performing in the market for these papers). Authors associated
• 2016 – 1,293 with bioRxiv preprints may submit papers to Nature journals
• 2017 – 3,339 more often due to its journals mix – there are many prominent
• 2018 – 10,213 biology journals at Nature, as well as the flagship. Despite this,
Elsevier’s share appears to be overtaking Nature’s – up to
This rapid increase in preprints posted, combined with the 1 August 2019, for 2018 preprints, Elsevier had published 14% of
historical norms of unpublished or abandoned preprints, means the published preprints to Nature’s 13%.
that the volume of unreviewed and unpublished manuscripts on Two of the Top 4 are non-profit publishers. Overall, the mix
bioRxiv is growing, strongly indicating that the purpose of bioRxiv of the most active publishers of papers emanating from bioRxiv
as a platform allowing authors to share manuscripts for preprints skews towards the non-profit space. This may have to
community-based improvements prior to publication in a peer do with how biology and medical publishers skew overall, with
reviewed journal is not being fulfilled. some of the most robust independent journals coming from

www.learned-publishing.org © 2019 The Author(s). Learned Publishing 2019


Learned Publishing © 2019 ALPSP.
Trends and analysis of five years of preprints 5

biology and medicine. Non-profit biomedical publishers are usu- Communications, Scientific Reports, and Science Advances). These
ally affiliated with a society and have some of the most respected percentages declined over time.
journal brands – high-impact, central to communities, aspirational
– which may also explain the observation.
A total of ~100 publishers or journals have published
papers that first existed as bioRxiv preprints. This number
OTHER OBSERVATIONS
seems to be stabilizing, probably indicating the scope of interest
Further analysis shows that publication events generally occur
for authors of viable bioRxiv preprints. However, the papers are
more frequently in the final calendar quarter of each year, with
not evenly distributed. The share of bioRxiv preprints published
the lowest percentage published in the first calendar quarter
with the 12 most active publishers of bioRxiv preprints
(16%) and the highest percentage (30%) published in the last cal-
totals 72% of the papers resulting from preprints. These 12 are
endar quarter. Given the high probability of calendar quarters
(in order – those with a share of 10% or more are in bold
aligning with fiscal year quarters, these data suggest that pub-
italic):
lishers accelerate publication of papers as the year ends. These
trends were independent of the overall growth of the number of
1. Nature preprints on bioRxiv (Fig. 4).
2. Elsevier The publishers most consistently publishing both more
3. PLOS papers and a higher percentage of papers in Q4 were Nature,
4. OUP Elsevier, and OUP. This behaviour suggests that some calendar-
5. BioMed Central year incentive may drive the pace of paper acceptances and pub-
6. eLife lication practices.
7. Wiley Papers emanating from preprints were published in journals
8. PNAS from exclusively OA publishers (PLOS, eLife, PeerJ, BioMed Cen-
9. Genetics Society of America tral, and Frontiers) and OA megajournals 51% of the time in
10. Society for Neuroscience 2014. This percentage dropped to 24% by 2017.
11. Frontiers As the share of Nature and Elsevier journals publishing pre-
12. PeerJ prints increased, there was an observable tendency for society
journals contracting for services with these large publishers to
The total share of papers published from bioRxiv preprints publish articles using the Gold OA business model. Nature and
for this dozen has been declining slowly; however, 80% of papers Elsevier’s proprietary titles appeared to do this far less commonly.
based on 2015 preprints were published by these 12, and this Twitter amplification of preprints is the most commonly asso-
had dropped to 70% for 2018 preprints as of 1 August 2019. ciated activity on the bioRxiv platform. However, its efficacy has
Megajournals do not appear to appeal to authors posting been declining. From 2014 to 2017, a tweet about a bioRxiv pre-
preprints on bioRxiv. For the preprint years 2015–2017, bioRxiv print on average reached 76,181 users. By the first half of 2019,
preprints were only published as peer reviewed papers 3–9% of the average reach had declined to 53,967, a 29.2% decline. The
the time in four of the major megajournals (PLOS One, Nature number of tweets per preprint also declined, from an average of

FIGURE 4 Percentage of papers from pre-


prints published (by quarter).

Learned Publishing 2019 © 2019 The Author(s). www.learned-publishing.org


Learned Publishing © 2019 ALPSP.
6 K.R. Anderson

18 in the 2014–2017 time period to 13 in the first half of 2019. Overall, four publishers – Nature, Elsevier, PLOS, and OUP –
The major contributing factor appears to be that Twitter users were responsible for publishing 72% of the papers emanating
with followers numbering in the hundreds of thousands boosted from preprints posted to bioRxiv. The level of concentration
preprints in the early days but have since moved on to other among these four is slowly decreasing over time. This may be
things and no longer tweet about bioRxiv or preprints. due to more society publishers establishing cascade systems and
launching Gold OA publications. The lack of interest in mega-
journals among authors posting bioRxiv preprints and successfully
DISCUSSION pursuing publication may indicate that publication brand is an
important element of submission decisions. The erosion of pure
Overall, the null hypothesis – that bioRxiv helps authors improve OA publishers as outlets for papers emanating from bioRxiv pre-
manuscripts prior to publication in peer reviewed journals – is prints appears to confirm other findings that a business model is
not supported by the data. Most authors who publish peer not a primary way for authors to determine publication venue.
reviewed articles based on preprints post the related preprint Preprint servers are an experiment and need to be measured
after submission to the accepting journal, and more post within and assessed as such. To date, bioRxiv looks as if it adds little to
10 days of submission. Taken together, these data indicate that the experience of most authors, the fate of most papers, and the
bioRxiv is not generally used as a source of pre-submission feed- overall health of the scholarly publishing ecosystem. The risks
back. Trends also indicate that the rate of unpublished and ulti- and costs of leaving thousands of unreviewed and abandoned
mately abandoned preprints remains steady, indicating that preprints available to the general public, scientists, and practi-
bioRxiv is not scaling any effective and productive feedback tioners are not clear and may be sizable, especially as unrefereed
mechanism for authors, which one would reasonably expect preprints are ported into medRxiv, a preprint server specifically
would lead to a decreasing percentage of unpublished preprints. focused on medicine.
The practice of posting preprints close to or after submission, These data suggest that bioRxiv is being utilized by authors
and potentially after receiving a positive sign indicating likely more as a pre-publication, post-acceptance platform for article
publication, is not unexpected. Good authors are cautious. They promotion, career advancement, and protection of priority.
do not want to be scooped or embarrassed. Therefore, it makes
sense they would only post a paper that has received some signal ACKNOWLEDGEMENTS
about its fate, and perhaps some review and positive feedback, The author acknowledges the contributions of Eric Anderson,
maybe even a letter noting that only minor changes remain, bol- which included the acquisition of key data elements about pub-
stering their confidence that their paper is in decent shape and lishers’ utilization of preprints. The author also thanks Phil Davis
bound for publication. for discussions about similar findings for arXiv and for providing
There are also data suggesting that papers posted as pre- useful references for the same.
prints generate more citations (Davis & Fromerth, 2007). An
unknown proportion of authors may be placing preprints in bio- REFERENCES
Rxiv to realize this benefit. Depositing preprints also helps Abdill, R. J., & Blekham, R. (2019). Meta-research: Tracking the popu-
authors establish priority. Finally, there is a distinct aspect of larity and outcomes of all bioRxiv preprints. eLife, 8, e45133.
social media promotion detectable on bioRxiv, with most ‘com- https://doi.org/10.7554/eLife.45133
ments’ linked from the tool the platform uses coming from article Brooks, T. C. (2009). Organizing a research community with SPIRES:
promotion on Twitter and other social media platforms. In a more Where repositories, scientists, and publishers meet. Information
competitive ‘publish or perish’ and funding environment, authors Services & Use, 29, 91–96. Retrieved from https://inspirehep.net/
may be using any and all methods they have to promote them- info/general/project/ape09.pdf
selves and their work. While rational, author and paper promo- Davis, P. M., & Fromerth, M. J. (2007). Does the arXiv lead to higher
tion is not a stated purpose of bioRxiv. citations and reduced publisher downloads for mathematics arti-
cles? Scientometrics, 71, 203–215. https://doi.org/10.1007/
The observation that a steady percentage of bioRxiv pre-
s11192-007-1661-8
prints are abandoned is more problematic. Operational assump-
Goldschmidt-Clermont, L. (1965). Communication patterns in high-
tions – such as assigning permanent DOIs to preprints – were
energy. Physics Retrieved from http://eprints.rclis.org/4253/
made based on the notion that preprints would be tied to perma-
Kling, R. (2005). The internet and unrefereed scholarly publishing.
nent, validated journal articles. If the assumption had been other-
Annual Review of Information Science and Technology., 38(1),
wise, a URL would be perfectly functional until and unless a peer 591–631. https://doi.org/10.1002/aris.1440380113
reviewed article was published based on the existing preprint.
Lariviere, V. et al. (2014). arXiv e-prints and the journal of record: An
Because this assumption was made, bioRxiv also lacks a policy to analysis of roles and relationships. Journal of the Association for
retire or deprecate an inactive preprint after a reasonable period, Information Science and Technology, 65(6), 1157–1169. https://doi.
which the data indicate would be 2–3 years after posting. org/10.1002/asi.23044

www.learned-publishing.org © 2019 The Author(s). Learned Publishing 2019


Learned Publishing © 2019 ALPSP.

You might also like