Professional Documents
Culture Documents
International Authorship and Collaboration Across Biorxiv Preprints
International Authorship and Collaboration Across Biorxiv Preprints
International Authorship and Collaboration Across Biorxiv Preprints
META-RESEARCH
(a)
Total preprints
26000
10000
1000
100
25
(b) (c)
United States
Other
United Kingdom
Unknown
Germany
France
China
Canada
Australia
0 8,000 16,000 24,000 0 8,000 16,000 24,000 32,000
Preprints, senior author Preprints, any author
Australia
(d) Canada
China
Proportion of total preprints, senior author
France
Germany
Other
United Kingdom
United States
Unknown
Figure 1. Preprints per country. (a) A heat map indicating the number of preprints per country, based on the institutional affiliation of the senior
author. The color coding uses a log scale. (b) The total preprints attributed to the seven most prolific countries. The x-axis indicates total preprints
listing a senior author from a country; the y-axis indicates the country. The ‘Other’ category includes preprints from all countries not listed in the plot. (c)
Similar to panel b, but showing the total preprints listing at least one author from the country in any position, not just the senior position. (d) Proportion
Figure 1 continued on next page
Figure 1 continued
of total senior-author preprints from each country (y-axis) over time (x-axis), starting in November 2013 and continuing through December 2019. Each
colored segment indicates the proportion of total preprints attributed to a single country (using same color scheme as panels (b and c), as of the end
of the month indicated on the x-axis.
The online version of this article includes the following source data and figure supplement(s) for figure 1:
Source data 1. Preprints per country.
Source data 2. Preprint counting methods at the country level.
Figure supplement 1. Preprint-level collaboration.
Figure supplement 2. Preprints with no country assignment.
author’s institutional affiliation to summarize 1200 preprints each (Table 1). Brazil, with 704
country-level participation and outcomes. manuscripts, has the 15th-most preprints and is
the first South American country on the list, fol-
lowed by Argentina (163 preprints) in 32nd
Results place. South Africa (182 preprints) is the first
Country level bioRxiv participation over African country on the list, in 29th place, fol-
time lowed by Ethiopia (57 preprints) in 42nd place
We retrieved author data for 67,885 preprints (Figure 1—source data 1). It is noticeable that
for which the most recent version was posted South Africa and Ethiopia both have high opt-in
before 2020. First, we attributed each preprint rates for a program operated by PLOS journals
to a single country, using the affiliation of the that enabled submissions to be sent directly to
last individual in the author list, considered by bioRxiv (PLOS, 2019). We found similar results
convention in the life sciences to be the ‘senior when we looked at which countries were most
author’ who supervised the work (see Methods). highly represented based on authorship at any
North America, Europe and Australia dominate position (Table 1). Overall, US authors appear
the top spots (Figure 1a): 26,598 manuscripts on the most bioRxiv preprints – 34,676 manu-
(39.2%) have a last author from the United scripts (51.1%) include at least one US author
States (US), followed by 7151 manuscripts (Figure 1c).
(10.5%) from the United Kingdom (UK; Over time, the country-level proportions on
Figure 1b), though China (4.1%), Japan (1.9%) bioRxiv have remained remarkably stable
and India (1.8%) are the sources of more than (Figure 1d), even as the number of preprints
All 11 countries with more than 1000 preprints attributed to a senior author affiliated with that country. The percen-
tages in the ‘Preprints, any author’ column sum to more than 100% because preprints may be counted for more than
one country. A full list of countries is provided in Figure 1—source data 1.
(a) (b)
USA
USA
UK
bioRxiv adoption
Israel
2.0
10,000 Switzerland
1.5
UK France
1.0
0.5 Canada
Germany Netherlands
France
Denmark
Senior−author preprints
China
Canada Germany
Switzerland Australia Japan Ethiopia
Netherlands India (24 other countries...)
1,000
Spain China
Israel Sweden
Brazil Italy
Denmark South Korea
Norway
Belgium Saudi Arabia
Finland
South Korea
Singapore Austria Poland
New Zealand Russia
Portugal Mexico Taiwan Greece
Ireland South Africa Poland Thailand
Argentina
Czechia Iran
Hungary Chile Russia
100 Saudi Arabia Turkey
Colombia Malaysia
Greece
Ethiopia Malaysia
Bangladesh Iran
Philippines Qatar Thailand
Turkey
10,000 100,000 1,000,000 0.0 0.5 1.0 1.5 2.0
Citable documents bioRxiv adoption
Figure 2. BioRxiv adoption per country. (a) Correlation between two scientific output metrics. Each point is a country; the x-axis (log scale) indicates
the total citable documents attributed to that country from 2014 to 2019, and the y-axis (also log scale) indicates total senior-author preprints attributed
to that country overall. The red line demarcates a ‘bioRxiv adoption’ score of 1.0, which indicates that a country’s share of bioRxiv preprints is identical
to its share of general scholarly outputs. Countries to the left of this line have a bioRxiv adoption score greater than 1.0. A score of 2.0 would indicate
that its share of preprints is twice as high as its share of other scholarly outputs (See Discussion for more about this measurement.) (b) The countries
with the 10 highest and 10 lowest bioRxiv adoption scores. The x-axis indicates each country’s adoption score, and the y-axis lists each country in order.
All panels include only countries with at least 50 preprints.
The online version of this article includes the following source data for figure 2:
Source data 1. Country productivity and bioRxiv adoption.
grew exponentially. For example, at the end of Preprint adoption relative to overall
2015 Germany accounted for 4.7% of bioRxiv’s scientific output
2460 manuscripts, and at the end of 2019 it was We noted that some patterns may be obscured
responsible for 5.4% of 67,885 preprints. How- by countries that had hundreds or thousands of
ever, the proportion of preprints from countries times as many preprints as other countries, so
outside the top seven contributing countries is we re-evaluated these ranks after adjusting for
growing slowly (Figure 1d): from 19.4% at the overall scientific output (Figure 2a). The cor-
end of 2015 to 23.1% at the end of 2019, by rected measurement, which we call ‘bioRxiv
which time bioRxiv was hosting preprints from adoption,’ is the proportion of preprints from
senior authors affiliated with 136 countries. each country divided by the proportion of world-
wide research outputs from that country (see
Methods). The US posted 26,598 preprints and
published about 3.5 million citable documents,
Vietnam Canada
Colombia
Tanzania France
Slovakia Croatia
Ecuador Germany
Indonesia
Egypt Netherlands
Thailand
Sweden
Croatia Estonia Switzerland
Greece
Egypt
Greece
Kenya United
Kingdom
Ecuador Iceland
Estonia Indonesia
Bangladesh Kenya
Iceland Peru
Turkey Slovakia
Tanzania
Peru
Colombia
Thailand United
Canada
States
Germany
Turkey
France
Figure 3. Contributor countries. (a) Bar plot indicating the international senior author rate (y-axis) by country (x-axis) – that is, of all international
preprints with a contributor from that country, the percentage of them that include a senior author from that country. All 17 contributor countries are
listed in red, with the five countries with the highest senior-author rates (in grey) for comparison. (b) A bar plot with the same y-axis as panel (a). The
x-axis indicates the international collaboration rate, or the proportion of preprints with a contributor from that country that also include at least one
author from another country. (c) is a bar plot indicating the total international preprints featuring at least one author from that country (the median
value per country is 19). (d) On the left are the 17 contributor countries. On the right are the countries that appear in the senior author position of
preprints that were co-authored with contributor countries. (Supervising countries with 25 or fewer preprints with contributor countries were excluded
from the figure.) The width of the ribbons connecting contributor countries to senior-author countries indicates the number of preprints supervised by
the senior-author country that included at least one author from the contributor country. Statistically significant links were found between four
combinations of supervising countries and contributors: Australia and Bangladesh (Fisher’s exact test, q = 1.01 10 11); the UK and Thailand
(q = 9.54 10 4); the UK and Greece (q = 6.85 10 3); and Australia and Vietnam (q = 0.049). All p-values reflect multiple-test correction using the
Benjamini–Hochberg procedure.
The online version of this article includes the following source data and figure supplement(s) for figure 3:
Source data 1. Combinations of senior authors with collaborator countries.
Source data 2. Links between contributor countries and the senior-author countries they write with.
Source data 3. International collaboration.
Figure supplement 1. Map of contributor countries.
Figure supplement 2. International collaboration correlations.
Figure supplement 3. Correlation between three measurements of international collaboration.
for a bioRxiv adoption score of 2.31 (Figure 2b). 15,820 citable documents published between
Nine of the 12 countries with adoption scores 2014 and 2019 (Figure 2—source data 1).
above 1.0 were from North America and Europe, By comparison, some countries are present
but Israel has the third-highest score (1.67) on bioRxiv at much lower frequencies than
based on its 640 preprints. Ethiopia has the would be expected, given their overall participa-
10th-highest bioRxiv adoption (1.08): though tion in scientific publishing (Figure 2b): Turkey
only 57 preprints list a senior author with an affil- published 249,086 citable documents from 2014
iation in Ethiopia, the country had a total of through 2019 but was the senior author on only
Czechia Italy
60%
South Korea Russia
Mexico UNKNOWN
Iran South Korea
Chile Brazil
China India
50%
Russia Chile
Brazil Taiwan
Taiwan Iran
Argentina China
0 100 200 300 400 125 175 225 275 0% 20% 40% 60% 80%
Downloads per preprint Downloads per preprint (first 6 mos.) Publication rate
Figure 4. Preprint outcomes. All panels include countries with at least 100 senior-author preprints. (a) A box plot indicating the number of downloads
per preprint for each country. The dark line in the middle of the box indicates the median, and the ends of each box indicate the first and third
quartiles, respectively. ‘Whiskers’ and outliers were omitted from this plot for clarity. The red line indicates the overall median. (b) A plot showing the
relationship (Spearman’s = 0.485, p=0.00274) between total preprints and downloads. Each point represents a single country. The x-axis indicates the
total number of senior-author preprints attributed to the country. The y-axis indicates the median number of downloads for those preprints. (c) A plot
showing the relationship (Spearman’s = 0.777, p=2.442 10 8) between downloads and publication rate. Each point represents a single country. The
x-axis indicates the median number of downloads for all preprints listing a senior author affiliated with that country. The y-axis indicates the proportion
of preprints posted before 2019 that have been published. (d) A bar plot indicating the proportion of preprints posted before 2019 that are now
flagged as ‘published’ on the bioRxiv website. The x-axis (and color scale) indicates the proportion, and the y-axis lists each country. The red line
indicates the overall publication rate.
The online version of this article includes the following source data for figure 4:
Source data 1. Published pre-2019 preprints by country.
Source data 2. Publication rates and DOI usage.
C h zi l y o m
Preprints Cell
BrarmanKingds
Genetic Epidemiology
Geited tate
403 JCI Insight
Japnadaa
Science
United S
Castrali
Indance
148 mSphere
an
Fr ina
Ital ia
mSystems
Au y
Un
Journal of Vision
55 Amer. Jour. of Human Genetics
PLOS ONE Cell Systems
20 Scientific Reports The Plant Cell
Nature Communications Physical Chemistry B
NeuroImage Genetics in Medicine
Molecular Cell
Molecular Biology and Evolution Jour. of Biological Chemistry
Nucleic Acids Research Nature Biotechnology
Bioinformatics Neuron
q−value eLife mBio
Biochemistry PNAS
0.05 JCI Insight The Plant Journal
Genetics in Medicine Molecular Bio. of the Cell
0.01 Genetic Epidemiology Journal of Structural Biology
0.001 Physical Chemistry B Nature Neuroscience
G3
Journal of Vision ACS Synthetic Biology
The Plant Cell Physical Biology
Plant Direct Genetics
Nature Biotechnology Journal of Neuroscience
mSystems Nature Structural & Mol. Bio.
ACS Synthetic Biology Nature
Cell Systems Genes & Development
Journal
Nature Neuroscience The ISME Journal
Journal of Biological Chemistry Nature Methods
RNA
mSphere Genome Research
Neuron PLOS Genetics
Amer. Jour. of Human Genetics
Journal
Developmental Biology
Molecular Cell Journal of Bacteriology
PLOS Pathogens Systematic Biology
Molecular Bio. of the Cell Structure
Nature Journal of Virology
Science PLOS Pathogens
eLife
Cell Journal of Neurophysiology
Nature Methods Infection and Immunity
mBio Current Biology
Cell Reports Cell Reports
Genome Research Nature Genetics
Journal of Neuroscience Developmental Cell
G3 Journal of The Royal Society Interface
Genetics PLOS ONE
PLOS Genetics Molecular Systems Biology
Human Brain Mapping
PNAS Cerebral Cortex
Human Molecular Genetics Scientific Reports
Biology Open Molecular Ecology Resources
Royal Society Open Science GigaScience
Molecular Psychiatry NeuroImage
International Journal of Epidemiology Frontiers in Microbiology
Translational Psychiatry Ecology and Evolution
Journal of The Royal Society Interface Antimicrobial Agents and Chemotherapy
Journal of Cell Science Frontiers in Immunology
Royal Society Open Science
Nature Genetics Methods in Ecology and Evolution
Development Journal of Cell Science
PLOS Biology Journal of Theoretical Biology
Genome Biology Biology Open
−50% −25% 0% 25% 50% 75% 100%
Overrepresentation of United States
Figure 5. Overrepresentation of US preprints. (a) A heat map indicating all disproportionately strong (q < 0.05) links between countries and journals,
for journals that have published at least 15 preprints from that country. Columns each represent a single country, and rows each represent a single
journal. Colors indicate the raw number of preprints published, and the size of each square indicates the statistical significance of that link—larger
squares represent smaller q-values. See Figure 5—source data 1 for the results of each statistical test. (b) A bar plot indicating the degree to which US
preprints are over- or under-represented in a journal’s published bioRxiv preprints. The y-axis lists all the journals that published at least 15 preprints
with a US senior author. The x-axis indicates the overrepresentation of US preprints compared to the expected number: for example, a value of ‘0%’
would indicate the journal published the same proportion of US preprints as all journals combined. A value of ‘100%’ would indicate the journal
published twice as many U. preprints as expected, based on the overall representation of the US among published preprints. Journals for which the
difference in representation was less than 15% in either direction are not displayed. The red bars indicate which of these relationships were significant
using the Benjamini–Hochberg-adjusted results from c2 tests shown in panel A.
The online version of this article includes the following source data for figure 5:
Source data 1. Journal–country links.
r = 0.949, p=8.73 10 38), a pattern that has in each preprint (Figure 1—figure supplement
also been observed, at a less dramatic rate, in 1), we found that 24,927 preprints (36.7%)
published literature (Adams et al., 2005; included authors from two or more countries;
Wuchty et al., 2007; Bordons et al., 2013). 3041 preprints (4.5%) were from four or more
Examining the number of countries represented countries, and one preprint, ‘Fine-mapping of
150 breast cancer risk regions identifies 178 high least one contributor from another country (Fig-
confidence target genes,’ listed 343 authors ure 3—figure supplement 2b). This indicates
from 38 countries, the most countries listed on that countries with mostly international preprints
any single preprint. The mean number of coun- (Figure 3b) also tended to have fewer interna-
tries represented per preprint is 1.612, which tional collaborations (Figure 3c) than other
has remained fairly stable since 2014 despite countries. Third, we found a negative correlation
steadily growing author lists overall (Figure 1— between international collaboration rate and the
figure supplement 1). proportion of international preprints for which a
We then looked at countries appearing on at country appears as senior author (Figure 3—fig-
least 50 international preprints to examine basic ure supplement 2c), demonstrating that coun-
patterns in international collaboration. We found tries that appear mostly on international
that a number of countries with comparatively preprints (Figure 3b) are less likely to appear as
low output contributed almost exclusively to senior author of those preprints (Figure 3). Simi-
international collaborations: for example, of the lar patterns have been observed in previous
76 preprints that had an author with an affiliation studies: González-Alcaide et al., 2017 found
in Vietnam, 73 (96%) had an author from another countries ranked lower on the Human Develop-
country. Upon closer examination, we found ment Index participated more frequently in
these countries were part of a larger group, international collaborations, and a review of
which we call ‘contributor countries,’ that (1) oncology papers found that researchers from
appear mostly on preprints with authors from low- and middle-income countries collaborated
other countries, but (2) seldom as the senior on randomized control trials, but rarely as senior
author. For this analysis, we defined a contribu- author (Wong et al., 2014).
tor country as one that has contributed to at After generating a list of preprints with
least 50 international preprints but appears in authors from contributor countries, we examined
the senior author position of less than 20% of which countries appeared most frequently in the
them. (We excluded countries with less than 50 senior author position of those preprints
preprints to minimize the effect of dynamics that (Figure 3d). Among the 2133 preprints with an
could be explained by countries with just one or author from a contributor country, 494 (23.2%)
two labs that frequently worked with interna- had a senior author listing an affiliation in the US
tional collaborators.) 17 countries met these cri- (Figure 3—source data 1). The UK was listed as
teria (Figure 3—figure supplement 1): for senior author on the next-most preprints with
example, of the 84 international preprints that contributor countries, at 318 (14.9%), followed
had an author with an affiliation in Uganda, only by Germany (4.2%) and France (3.1%). Given the
5 (6%) had an author from Uganda in the senior large differences in preprint authorship between
author position. This figure was also less than countries, we tested which of these senior-
12% for Vietnam, Tanzania, Slovakia and Indone- author relationships was disproportionately
sia: by comparison, the figure for the US was large. Using Fisher’s exact test (see Methods),
48.7% (Figure 3a). we found four links between contributor coun-
In addition to a high percentage of interna- tries and senior-author countries that were sig-
tional collaborations and a low percentage of nificant (Figure 3—source data 2). The
senior-author preprints, another characteristic of strongest link is between Bangladesh and Aus-
contributor countries is a comparatively low tralia: of the 82 international preprints with an
number of preprints overall. To define this sub- author from Bangladesh, 22 list a senior author
set of countries more clearly, we examined with an affiliation in Australia. Authors in Viet-
whether there was a relationship between any of nam appear with disproportionate frequency on
the three factors we identified across all coun- preprints with a senior author in Australia as well
tries with at least 50 international preprints. We (9 of 67 preprints). The other two links are to
found consistent patterns for all three (see Meth- senior authors in the UK, with contributing
ods). First, countries with fewer international col- authors from Thailand (50 of 187 preprints) and
laborations also tend to appear as senior author Greece (41 of 155 preprints).
on a smaller proportion of those preprints (Fig-
ure 3—figure supplement 2a). Second, we also Differences in preprint downloads and
observed a negative correlation between total publication rates
international collaborations and international After quantifying which countries were posting
collaboration rate – that is, the proportion of preprints, we also examined whether there were
preprints a country contributes to that include at differences in preprint outcomes between
approach to preprints and open access authors from the most-downloaded countries
(Humberto and Babini, 2020; ALLEA, 2018; (mostly in western Europe) have broader social-
Becerril-Garcı´a, 2019; Mukunth, 2019), and fur- media reach than authors in low-download coun-
ther research is required to determine what tries such as Argentina and Taiwan? What role
drives certain countries to preprint servers— does language play in outcomes, and why do
what incentives are present for biologists in Fin- countries that get more downloads also tend to
land but not Greece, for example. There is also have higher publication rates? We also found
more to be done regarding the trade-offs of some journals had particularly strong affinities
using a more distributed set of repositories that for preprints from some countries over others:
are specific to disciplines or countries (e.g. INA- even when accounting for differing publication
Rxiv in Indonesia or PaleorXiv for paleontology), rates across countries, we found dozens of jour-
which could also influence the observed levels of nal–country links that disproportionately favored
bioRxiv adoption. Our results make it clear that the US and UK. While it’s possible this finding is
those reading bioRxiv (or soliciting submissions coincidental, it demonstrates that journals can
from the platform) are reviewing a biased sam- embrace preprints while still perpetuating some
ple of worldwide scholarship. of the imbalances that preprints could be theo-
There are two findings that may be particu- retically alleviating.
larly informative about the state of open science Our study has several limitations. First, bio-
in biology. First, we present evidence of contrib- Rxiv is not the only preprint server hosting biol-
utor countries—countries from which authors ogy preprints. For example, arXiv’s Quantitative
appear almost exclusively in non-senior roles on Biology category held 18,024 preprints at the
preprints led by authors from more prolific coun- end of 2019 (https://arxiv.org/help/stats/2019_
tries (Figure 3). While there are many reasons by_area/index), and repositories such as Indone-
these dynamics could arise, it is worth noting sia’s INA-Rxiv (https://osf.io/preprints/inarxiv/)
that the current corpus of bioRxiv preprints con- hold multidisciplinary collections of country-spe-
tains the same familiar disparities observed in cific preprints. We chose to focus on bioRxiv for
published literature (Mammides et al., 2016; several reasons: primarily, bioRxiv is the preprint
Burgman et al., 2015; Wong et al., 2014; Gon- server most broadly integrated into the tradi-
zález-Alcaide et al., 2017). Critically, we found tional publishing system (Barsh et al., 2016;
the three characteristics of contributor countries Vence, 2017; eLife, 2020). In addition, bioRxiv
(low international collaboration count, high inter- currently holds the largest collection of biology
national collaboration rate, low international preprints, with metadata available in a format
senior author rate) are strongly correlated with we were already equipped to ingest (Abdill and
each other (Figure 3). When looking at interna- Blekhman, 2019c). Analyzing data from only a
tional collaboration using pairwise combinations single repository also avoids the issue of differ-
of these three measurements, countries fall ent websites holding metadata that is mis-
along tidy gradients (Figure 3—figure supple- matched or collected in different ways.
ment 3)—which means not only that they can be Comparing publication rates between reposito-
used to delineate properties of contributor ries would also be difficult, particularly because
countries, but that if a country fits even one of bioRxiv is one of the few with an automated
these criteria, they are more likely to fit the method for detecting when a preprint has been
other two as well. published. Second, this ‘worldwide’ analysis of
Second, we found numerous country-level dif- preprints is explicitly biased toward English-lan-
ferences in preprint outcomes, including a posi- guage publishing. BioRxiv accepts submissions
tive correlation at the country level between only in English, and the primary motivation for
downloads per preprint and publication rate this work was the attention being paid to bio-
(Figure 4c). This raises an important consider- Rxiv by organizations based mostly in the US
ation that when evaluating the role of preprints, and western Europe. In addition, bibliometrics
some benefits may be realized by authors in databases such as Scopus and Web of Science
some countries more consistently than others. If have well-documented biases in favor of English-
one of the goals of preprinting one’s work is to language publications (Mongeon and Paul-Hus,
solicit feedback from the community 2016; Archambault et al., 2006; de Moya-Ane-
(Sarabipour et al., 2019; Sever et al., 2019), gón et al., 2007), which could have an effect on
what are the implications of the average Brazil- observed publication rates and the bioRxiv
ian preprint receiving 37% fewer downloads adoption scores that depend on scientific output
than the average Dutch preprint? Do preprint derived from Scopus.
There were also 4985 preprints (7.3%) for separate entry for each author of each preprint,
which we were not able to confidently assign a and stored name, email address, affiliation,
country of origin. An evaluation of these (see ORCID identifier, and the date of the most
Methods) showed that the most prolific coun- recent version of the preprint that has been
tries were also underrepresented in the indexed in the Rxivist database. While the origi-
‘unknown’ category, compared to the 148 other nal web crawler performs author consolidation
countries with at least one author. While it is during the paper index process (i.e. ‘Does this
impractical to draw country-specific conclusions new paper have any authors we already recog-
from this, it suggests that the preprint counts for nize?’), this new tool creates a new entry for
countries with comparatively low participation each preprint; we make no connections for
may be slightly higher than reported, an issue authors across preprints in this analysis, and infer
that may be exacerbated in more granular analy- author country separately for every author of
ses, such as at the institutional level. Country- every paper. It is also important to note that for
level differences in metrics such as downloads longitudinal analyses of preprint trends, each
and publication rate may also be confounded preprint is associated with the date on its most
with field-level differences: on average, geno- recent version, which means a paper first posted
mics preprints are downloaded twice as many in 2015, but then revised in 2017, would be
times as microbiology preprints (Abdill and listed in 2017. The final version of the preprint
Blekhman, 2019b), for example, so countries metadata was collected in the final weeks of Jan-
with a disproportionate number of preprints in a uary 2020—because preprints were filtered
particular field could receive more downloads using the most recent known date, those posted
due to choice of topic, rather than country of before 2020, but revised in the first month of
origin. Further study is required to determine 2020, were not included in the analysis. In addi-
whether these two factors are related and in tion, 95 preprints were excluded because the
which direction. bioRxiv website repeatedly returned errors when
In summary, we find country-level participa- we tried to collect the metadata, leaving a total
tion on bioRxiv differs significantly from existing of 67,885 preprints in the analysis. Of these,
patterns in scientific publishing. Preprint out- there were 2409 manuscripts (3.6%) for which
comes reflect particularly large differences we were unable to scrape affiliation data for at
between countries: comparatively wealthy coun- least one author, including 137 preprints with no
tries in Europe and North America post more affiliation information for any author.
preprints, which are downloaded more fre- bioRxiv maintains an application program-
quently, published more consistently, and matic interface (API) that provides machine-
favored by the largest and most well-known readable data about their holdings. However,
journals in biology. While there are many poten- the information it exposes about authors and
tial explanations for these dynamics, the quanti- their affiliations is not as complete as the infor-
fication of these patterns may help stakeholders mation available from the website itself, and
make more informed decisions about how they only the corresponding author’s institutional
read, write and publish preprints in the future. affiliation is included (https://api.biorxiv.org/).
Therefore, we used the more complete data in
the Rxivist database (Abdill and Blekhman,
Methods 2019b), which includes affiliations for all authors.
All data on published preprints was pulled
Ethical statement directly from bioRxiv. However, it is also possi-
This study was submitted to the University of ble, if not likely, that the publication of many
Minnesota Institutional Review Board (study preprints goes undetected by its system.
#00008793), which determined the work did not Fraser et al., 2020 developed a method of
qualify as human subjects research and did not searching for published preprints in Scopus and
require IRB oversight. Crossref databases and found most had already
been picked up by bioRxiv’s detection process,
Preprint metadata though bioRxiv states that preprints published
We used existing data from the Rxivist web with new titles or authors can go undetected
crawler (Abdill and Blekhman, 2019c) to build a (https://www.biorxiv.org/about-biorxiv), and pre-
list of URLs for every preprint on bioRxiv.org. liminary data suggests this may affect thousands
We then used this list as the input for a new tool of preprints (Abdill and Blekhman, 2019b).
that collects author data: we recorded a How these effects differ by country of origin
remains unclear—perhaps authors from some authors are affiliated with an institution in the
countries are more likely to have their titles UK, the UK would receive 0.6 ‘credits’ for that
changed by journal editors, for example—but preprint. We found the complete-normalized
bias at the country level may also be more pro- preprint counts to be almost identical to the
nounced for other reasons. The assignment of counts distributed based on straight counting
Digital Object Identifiers (DOIs) to papers pro- (Pearson’s r = 0.9971, p=4.48 10 197). While
vides a useful proxy for participation in the there are numerous proposals for proportioning
‘western’ publishing system. Each published bio- differing levels of recognition to authors at dif-
Rxiv preprint is listed with the DOI of its pub- ferent positions in the author list (e.g.
lished version, but DOI assignment is not yet Hagen, 2013; Kim and Diesner, 2015), the
universally adopted. Boudry and Chartron, close link between the complete-normalized
2017 examined papers from 2015 indexed by count and the count based on senior authorship
PubMed and found DOI assignment varied indicates that senior authors are at least an accu-
widely based on the country of the publisher. rate proxy for the overall number of individual
96% of publications in Germany had a DOI, for authors, at the country level.
example, plus 98% of UK publications and more When computing the average authors per
than 99% of Brazilian publications. However, paper, the harmonic mean is used to capture the
only 31% of papers published in China had average ‘contribution’ of an author, as in
DOIs, and just 2% (33 out of 1582) of papers Glänzel and Schubert, 2005—in short, this
published in Russia. There are 45 countries that shows that authors were responsible for about
overlap between our analysis and that of one-third of a preprint in 2014, but less than
Boudry and Chartron, 2017. Of these, we
one-fourth of a preprint as of 2019.
found a modest correlation (Spearman’s
rho = 0.295, p=0.0489) between a country’s pre- Data collection and management
print publication rate and the rate at which pub-
All bioRxiv metadata was collected in a relational
lishers in that country assigned DOIs (Figure 4—
PostgreSQL database (https://www.postgresql.
source data 2). This indicates that countries with
org). The main table, ‘article_authors,’ recorded
higher rates of DOI issuance (for publications
one entry for each author of each preprint, with
dating back to 1955) also tend to have higher
the author-level metadata described above.
observed rates of preprint publication.
Another table associated each unique affiliation
string with an inferred institution (see Institu-
Attribution of preprints
tional affiliation assignment below), with other
Throughout the analysis, we define the ‘senior
tables linking institutions to countries and pre-
author’ for each preprint as the author appear-
prints to publications. (See the repository storing
ing last in the author list, a longstanding practice
the database snapshot for a full description of
in biomedical literature (Riesenberg, 1990;
the database schema.) Analysis was performed
Buehring et al., 2007) corroborated by a 2003
by querying the database for different combina-
study, which found that 91% of publications indi-
cated a corresponding author that was in the tions of data and outputting them into CSV files
first- or last-author position (Mattsson et al., for analysis in R (R Core Team, 2019). For exam-
2011). Among the 59,562 preprints for which ple, data on ‘authors per preprint’ was collected
the country was known for the first and last by associating all the unique preprints in the
author, 7965 (13.4%) preprints included a first ‘article_authors’ table with a count of the num-
author associated with a different country than ber of entries in the table for that preprint. Simi-
the senior author. lar consolidation was done at many other levels
When examining international collaboration, as well—for example, since each author is asso-
we also considered whether more nuanced ciated with an affiliation string, and each affilia-
methods of distributing credit would be more tion string is associated with an institution, and
informative. Our primary approach—assigning each institution is associated with a country, we
each preprint to the one country appearing in can build queries to evaluate properties of pre-
the senior author position—is considered prints grouped by country.
straight counting (Gauffriau et al., 2008). We
repeated the process using complete-normal- Contributor countries
ized counting (Figure 1—source data 2), which The analysis described in the ‘Collaboration’ sec-
splits a single credit among all authors of a pre- tion measured correlations between three coun-
print. So, for a preprint with 10 authors, if six try-level descriptors, calculated for all countries
that contributed to more than 50 international the Research Organization Registry (see below)
preprints: is that they use different criteria for the inclusion
of separate states. In most cases, SCImago pro-
i. International collaborations. The total
number of international preprints includ- vides more specific distinctions than ROR: for
ing at least one author from that country. example, Puerto Rico is listed separately from
ii. International collaboration rate. Of all the US in the SCImago dataset, but not in the
preprints listing an author from that ROR dataset. We did not alter these distinc-
country, the proportion of them that tions—as a result, nations with disputed or com-
includes at least one author from another plex borders may have slightly inflated bioRxiv
country. adoption scores. For example, preprints attrib-
iii. International senior-author rate. Of all
uted to institutions in Hong Kong are counted in
the international collaborations associ-
the total for China, but the 108,197 citable
ated with a country, the proportion of
them for which that country was listed as documents from Hong Kong in the SCImago
the senior author. dataset are not included in the China total.
which authors do not specify any affiliation infor- There were also corrections made to existing
mation at all, including in the PDF manuscript institutional assignments, which are used to
itself. We are also limited by the content of the make the country-level inferences about author
ROR system (https://ror.org/about/): Though location. It appears the ROR API struggles with
there are tens of thousands of institutions in the institutions that are commonly expressed as
dataset and its basis, the Global Research Identi- acronyms—affiliation strings including ‘MIT,’ for
fier Database (https://www.grid.ac/stats), has example, was sometimes incorrectly coded not
extensive coverage around the world, the trans- as ‘Massachusetts Institute of Technology’ in the
lation of affiliation strings is likely more effective US, but as ‘Manukau Institute of Technology’ in
for regions that have more extensive coverage. New Zealand, even when other clues within the
affiliation string indicated it was the former.
Country level accuracy of ROR assignments Other affiliation strings were more broadly opa-
Across 67,885 total preprints, we indexed que— ‘Centre for Research in Agricultural Geno-
488,660 total author entries, one for each author mics (CRAG) CSIC-IRTA-UAB-UB,’ for example.
of each preprint. These entries each included A full list of manual edits is included in the ‘man-
one of 136,456 distinct affiliation strings, which ual_edits.sql’ file.
we processed using the ROR API before making In total, 12,487 institutional assignments were
manual corrections. corrected, affecting 52,037 author entries
We first focused on assigning countries to (10.6%). Prior to the corrections, an evaluation
preprints that were in the ‘unknown’ category. of the ROR assignments in a random sample
We started by manually adding institutional (n = 488) found the country-level accuracy was
92.2 ± 2.4%, at a 95% confidence interval. After
assignments to ‘unknown’ affiliation strings that
an initial round of corrections, the country-level
were associated with 10 or more authors. We
accuracy improved to 96.5 ± 1.6%. (These sam-
then used sub-strings within affiliation strings to
ples were sized to evaluate errors in the assign-
find matches to existing institutions, and finally
ment of institutions rather than countries, which,
generated a list of individual words that
once institution-level analysis was removed from
appeared most frequently in ‘unknown’ affilia-
the study, became irrelevant.) After another
tion strings. We searched this list for words indi-
round of corrections that assigned countries to
cating an affiliation that was at least as specific
14,690 authors in the ‘unknown’ category, we
as a country (e.g. ‘Italian,’ ‘Boston,’ ‘Guang-
pulled another random sample of corrected
dong’) and associated any affiliation strings that
affiliations—using a 95% confidence interval, the
included that word with an institution in the cor-
sample size required to detect 96.5% assign-
responding country. Finally, we evaluated any
ment accuracy with a 2% margin of error was cal-
authors still in the ‘unknown’ category by search- culated to be 325 (Naing et al., 2006). Manually
ing for the presence of a country-specific top- evaluating the country assignments of this sam-
level domain in their email addresses—for exam- ple showed the country-level accuracy of the
ple, uncategorized authors with an email corrected affiliations was 95.7 ± 2.2%.
address ending in ‘.nl’ were assigned to the Though preprints that were assigned a coun-
Netherlands. Generic domains such as ‘.com’ try could be categorized with high accuracy, we
were not categorized, with the exception of ‘. also sought to characterize the preprints that
edu,’ which was assigned to the US. While these remained in the ‘unknown’ category after correc-
corrections would have negatively impacted the tions, to evaluate whether there was a bias in
institution-level accuracy, it was a more practical which preprints were categorized at all. Among
approach to generate country-level the successfully classified preprints, the distribu-
observations. tion across countries is heavily skewed—the 27
There were also corrections made that placed most prolific countries (15%) account for 95.3%
more affiliations into the ‘unknown’ category— of categorized preprints. Accordingly, character-
there is an ROR institution called ‘Computer Sci- izing the prevalence of individual countries
ence Department,’ for example, that contained would require an impractically large sample
spurious assignments. Prior to correction, 23,158 made up of a large portion of all uncategorized
(17%) distinct affiliation strings were categorized preprints. Instead, we split the countries into
as ‘unknown,’ associated with 71,947 authors. two groups: the first contained the 27 most pro-
After manual corrections, there were 20,099 lific countries. The second group contained the
unknown affiliation strings associated with remaining 148 countries, which account for the
51,855 authors. remaining 2960 preprints (4.7%). We used this
Nuñez MA, Barlow J, Cadotte M, Lucas K, Newton E, Sarabipour S, Debat HJ, Emmott E, Burgess SJ,
Pettorelli N, Stephens PA. 2019. Assessing the uneven Schwessinger B, Hensel Z. 2019. On the value of
global distribution of readership, submissions and preprints: an early career researcher perspective. PLOS
publications in applied ecology: obvious problems Biology 17:e3000151. DOI: https://doi.org/10.1371/
without obvious solutions. Journal of Applied Ecology journal.pbio.3000151, PMID: 30789895
56:4–9. DOI: https://doi.org/10.1111/1365-2664.13319 Šavrič B, Patterson T, Jenny B. 2019. The equal earth
Okike K, Kocher MS, Mehlman CT, Heckman JD, map projection. International Journal of Geographical
Bhandari M. 2008. Nonscientific factors associated Information Science 33:454–465. DOI: https://doi.org/
with acceptance for publication in the Journal of Bone 10.1080/13658816.2018.1504949
and Joint Surgery (American Volume). The Journal of Schwarz GJ, Kennicutt RC. 2004. Demographic and
Bone and Joint Surgery-American Volume 90:2432– citation trends in astrophysical journal papers and
2437. DOI: https://doi.org/10.2106/JBJS.G.01687, preprints. arXiv. http://arxiv.org/abs/astro-ph/
PMID: 18978412 0411275.
Penfold NC, Polka JK. 2020. Technical and social Sever R, Roeder T, Hindle S, Sussman L, Black K-J,
issues influencing the adoption of preprints in the life Argentine J, Manos W, Inglis JR. 2019. bioRxiv: the
sciences. PLOS Genetics 16:e1008565. DOI: https:// preprint server for biology. bioRxiv. DOI: https://doi.
doi.org/10.1371/journal.pgen.1008565, PMID: 32310 org/10.1101/833400
942 South A. 2017. Rnaturalearth: world map data from
PLOS. 2019. Trends in preprints. https://plos.org/ natural earth. R package version 0.1.0. The R
blog/announcement/trends-in-preprints/ [Accessed Foundation. https://CRAN.R-project.org/package=
July 14, 2020]. rnaturalearth [Accessed July 14, 2020].
R Core Team. 2019. R: a language and environment Vence T. 2017. Journals seek out preprints. The
for statistical computing (Version 3.6.2). http://r- Scientist. https://www.the-scientist.com/news-opinion/
project.org [Accessed July 14, 2020]. journals-seek-out-preprints-32183 [Accessed July 14,
Research Organization Registry. 2019. ROR API. 2020].
Github. a3b153c. https://github.com/ror-community/ Wickham H. 2016. Ggplot2: Elegant Graphics for Data
ror-api Analysis. Springer. DOI: https://doi.org/10.1007/978-0-
Riesenberg D. 1990. The order of authorship: who’s 387-98141-3
on first? JAMA 264:1857. DOI: https://doi.org/10. Wong JC, Fernandes KA, Amin S, Lwin Z,
1001/jama.1990.03450140079039 Krzyzanowska MK. 2014. Involvement of low- and
Ross JS, Gross CP, Desai MM, Hong Y, Grant AO, middle-income countries in randomized controlled trial
Daniels SR, Hachinski VC, Gibbons RJ, Gardner TJ, publications in oncology. Globalization and Health 10:
Krumholz HM. 2006. Effect of blinded peer review on 83. DOI: https://doi.org/10.1186/s12992-014-0083-7,
abstract acceptance. JAMA 295:1675–1680. PMID: 25498958
DOI: https://doi.org/10.1001/jama.295.14.1675, Woodruff A, Brewer C. 2017. Colorbrewer. Github. 2.
PMID: 16609089 0. https://github.com/axismaps/colorbrewer
Saposnik G, Ovbiagele B, Raptis S, Fisher M, Johnston Wuchty S, Jones BF, Uzzi B. 2007. The increasing
SC. 2014. Effect of english proficiency and research dominance of teams in production of knowledge.
funding on acceptance of submitted articles to Stroke Science 316:1036–1039. DOI: https://doi.org/10.1126/
journal. Stroke 45:1862–1868. DOI: https://doi.org/10. science.1136099, PMID: 17431139
1161/STROKEAHA.114.005413, PMID: 24876265