Professional Documents
Culture Documents
Database Resources of The National Center For Biotechnology Information
Database Resources of The National Center For Biotechnology Information
Received September 15, 2020; Revised September 25, 2020; Editorial Decision September 28, 2020; Accepted October 08, 2020
* To whom correspondence should be addressed. Tel: +1 301 496 2475, Fax: +1 301 480 9241; Email: sayers@ncbi.nlm.nih.gov
Genomes
Nucleotide 429 731 711 DNA and RNA sequences
BioSample 14 628 076 descriptions of biological source materials
SRA 11 807 161 high-throughput DNA and RNA sequence read archive
Taxonomy 2 401 136 taxonomic classification and nomenclature catalog
Genes
GEO Profiles 128 414 055 gene expression and molecular abundance profiles
Gene 28 377 759 collected information about gene loci
GEO datasets 4 002 373 functional genomics studies
PopSet 350 627 sequence sets from phylogenetic and population studies
HomoloGene 141 268 homologous gene sets for selected organisms
Genetics
SNP 720 643 623 short genetic variations
dbVar 6 030 887 genome structural variation studies
ClinVar 845 008 human variations of clinical significance
MedGen 335 277 medical genetics literature and links
GTR 76 814 genetic testing registry
dbGaP 1 397 genotype/phenotype interaction studies
Proteins
Protein 874 272 642 protein sequences
Identical protein groups 329 946 078 protein sequences grouped by identity
Protein clusters 1 137 329 sequence similarity-based protein clusters
Structure 167 650 experimentally-determined biomolecular structures
Sparcle 149 462 conserved domain architectures
Conserved domains 59 951 conserved protein domains
Chemicals
PubChem substance 285 048 146 deposited substance and chemical information
PubChem compound 111 325 418 chemical information with structures, information and links
PubChem BioAssay 1 229 071 bioactivity screening studies
BioSystems 983 968 molecular pathways with links to genes, proteins and chemicals
RECENT DEVELOPMENTS the interface make it easier to discover related content such
as similar articles, references and citations.
Literature updates
Since 2012 PubMed has allowed users to search for a dis-
PubMed. After previewing an updated version of tinct author, and not merely for an author name, through
PubMed in 2019, we activated this updated version as an automatic author name disambiguation algorithm (8).
the default system in May 2020. Among the numerous We recently enhanced this algorithm by leveraging the sig-
enhancements is a responsive layout that offers better nificant growth of ORCID use in PubMed articles. Users
support for accessing PubMed content on increasingly can search PubMed directly with ORCIDs using the fol-
popular small-screen devices such as mobile phones and lowing syntax: 0000-0001-6166-3199[auid]. Additionally, to
tablets. The interface is compatible with any screen size and further strengthen how PubMed handles synonyms, we de-
provides a fresh, consistent look and feel throughout the veloped and implemented an updated algorithm to obtain
application, no matter how one accesses it. Search results word synonym pairs that improve PubMed indexing and re-
can now be sorted using ‘Best Match’ ordering that employs trieval (9).
a machine learning algorithm to help users find relevant The updated version of PubMed takes advantage of sev-
citations quickly (7). Search results also include ‘snippets’, eral new technologies to improve the user experience. The
highlighted text fragments from the article abstract that underlying document data indexed in the updated version is
are selected based on their relatedness to the query. These a merger of content from PubMed, Bookshelf and PubMed
snippets give users additional information to help them Central (PMC). This combined dataset allows us to display
decide if an article is useful. Additional improvements to relevant information not previously available in a PubMed
D12 Nucleic Acids Research, 2021, Vol. 49, Database issue
90.00%
80.00%
70.00%
60.00%
Annual Growth
50.00%
40.00%
20.00%
10.00%
0.00%
ei l i n ly
Gr a r
Bi G A
B am R
O ro e
ta t
he Su no s
bC Co sta e
m po e
oA d
ed rot y
Ce ein
Ge l
B e
ru ks
M re
Ta par H
xo cle
Po my
le et
m P
bM s
db ed
ed P
NL d en
Pr oCo ata r
ei c g
Ho O Pr ters
ys e
s
ps
de
Bi loG es
GE Clu ons
ra
Da jec
Bi C V a
bC em Ge Set
Pu ain
m
bM P ssa
ot lle lo
GE ioP pl
Pu m b m
he m nc
oS en
Do SN
M Ga
Bi un
oS T
SR
eS
b
n V
uc pS
St oo
u
o l
ou
G
o
nt
m ofi
te
m
no
ct
s
se
C
S
As
n
N
d
ve
ot
er
Pr
Pu
Pu h
ns
bC
al
Co
c
Pu
en
Id
Figure 1. Annual growth rates of the number of records in each Entrez database as of 9 September 2020.
record, such as reference citations from PMC. While legacy made by the national science and technology advisors of
PubMed limited the number of variants for a wildcard a dozen countries, including the US, to support ongoing
(‘*’) search, PubMed is now capable of unlimited wildcard public health emergency response efforts. Specifically, this
searches thanks to Solr (https://lucene.apache.org/solr/), the call asked publishers and societies to agree voluntarily to
open-source enterprise search system that PubMed now make their publications and supporting data related to
uses for document indexing. Users will find that PubMed COVID-19 and the novel coronavirus immediately acces-
now has greater scalability and reliability, provided not only sible in PMC and other appropriate public repositories. As
by Solr, but also by the MongoDB storage solution and the of August, this initiative included more than 50 publishers
modern cloud architecture that together ensure both redun- and has added or updated the licenses on 80 000 articles in
dancy between data centers and also trustworthy backup PMC to support secondary re-use and analysis.
environments. When visiting PubMed, users will enjoy a On 9 June 2020, NLM launched a pilot project to test the
modern web experience using the latest web technologies viability of allowing users to obtain from PMC preprints re-
and standards, all provided by the Django web framework. sulting from NIH-funded research (https://www.ncbi.nlm.
nih.gov/pmc/about/nihpreprints/). The primary goal of this
PubMed Central (PMC). PMC continued to expand ac- NIH Preprint Pilot is to explore approaches to increase
cess to biomedical and life science literature over the past the discoverability of early NIH research results. Follow-
year, with the corpus now including more than 6 mil- ing standard NLM practice, citations for these preprint
lion journal articles and author manuscripts. Reflecting the records are available in PubMed to increase the discover-
NLM’s ongoing commitment to public access to research ability of this content. To ensure transparency, large ban-
results supported by the NIH and other funding partners, ners clearly identify these preprint records in PMC and
we released a new NIH Manuscript Submission (NIHMS) PubMed. The banners explain that the papers have not been
system in January 2020. The new NIHMS system stream- peer reviewed, and link to information about the pilot for
lines the author manuscript submission process to PMC additional context. The pilot will run for a minimum of 12
and offers more transparent options to aid authors and months and will focus on preprints that relate to the current
investigators in avoiding processing delays and ensuring COVID-19 pandemic. Lessons we learn during that time
timely compliance with NIH policy. In the first 6 months will inform future NLM efforts involving preprints.
after release, we received nearly 40 000 successful submis-
sions (https://www.nihms.nih.gov/about/statistics/). Bookshelf. The NCBI Bookshelf provides free online ac-
PMC launched the Public Health Emergency COVID-19 cess to over 8500 books and documents in life science and
Initiative (https://www.ncbi.nlm.nih.gov/pmc/about/covid- healthcare from over 150 content providers. In the past year,
19/) in March 2020, which helped enable the creation of we migrated the content management of two large full-text
the COVID-19 Open Research Dataset (CORD-19) hosted toxicology databases into its archive. The first, LactMed, is
by the Allen Institute. This initiative resulted from a call an NLM database of over 1500 peer-reviewed summaries
Nucleic Acids Research, 2021, Vol. 49, Database issue D13
containing information on drugs and other chemicals to Multiple Sequence Alignment Viewer (MSAV). SV is avail-
which breastfeeding mothers may be exposed. It also in- able as a standalone application (https://www.ncbi.nlm.nih.
cludes information on the levels of such substances in breast gov/projects/sviewer/) and also appears as a fully-functional
milk and infant blood, and the possible adverse effects in embed on many NCBI pages, including Gene and SNP
the nursing infant. The second, LiverTox, is an NIDDK records and the NCBI Variation Viewer. SV is also the
database of over 1100 peer-reviewed documents contain- graphical viewer for Nucleotide and Protein records and
ing information on the diagnosis, cause, frequency, clini- the BLAST (10) and Primer-BLAST (11) results pages.
cal patterns and management of liver injury attributable MSAV (https://www.ncbi.nlm.nih.gov/projects/msaviewer/)
to both prescription and nonprescription medications and displays multiple sequence alignments created by NCBI
also selected herbal and dietary supplements. Migrating BLAST, COBALT and NCBI Virus pages, and also displays
these full-text databases into the Bookshelf increases their custom alignments from researchers.
discoverability in two ways: the citations for each full-text In the last year, we have further enhanced the sequence
LactMed and LiverTox summary record are now available and data download capabilities within SV (https://www.
Sequence Read Archive. NCBI maintains the NIH Se- Program. These datasets provide researchers easy, no-cost
quence Read Archive (SRA), an archival database de- access to more than 13 000 SRA runs that include Coron-
signed to support storage, retrieval and analysis of next- aviridae content identified by a kmer-based approach using
generation nucleotide sequence data. This archive now in- the SRA Taxonomy Analysis Tool.
cludes 8.8 Petabytes of publicly available data and another We have made several additional updates that allow more
4.6 Petabytes of controlled-access dbGaP data. While the effective use of cloud-hosted SRA data. By providing SRA
archive holds tremendous promise for biomedical research, run metadata and BioSample data on GCP BigQuery and
the sheer size of this dataset makes it difficult to store, re- AWS Athena, we give researchers more ways to identify se-
trieve and analyze. NCBI remains committed to continuing quence sets of interest. In addition to maintaining originally
to expand SRA and improving access to the archive based submitted source and SRA formatted data in the cloud, we
on FAIR data principles. are introducing normalized data formats that bin base qual-
As part of the NIH Science and Technology Research ity scores to binary read assessments or align reads to refer-
Infrastructure for Discovery, Experimentation and Sus- ences. These reduce file size and can expedite certain com-
tainability Initiative (https://datascience.nih.gov/strides/), putations. We have also updated the SRA data location ser-
NCBI is now maintaining the entire SRA on two com- vice and toolkit (https://github.com/ncbi/sra-tools/wiki/02.
mercial cloud platforms, Amazon Web Services (AWS) and -Installing-SRA-Toolkit) to retrieve data from cloud loca-
Google Cloud Platform (GCP). The cloud environment tions in the desired data format.
offers many advantages for the transfer and analysis of
data (12), and the availability of SRA on cloud platforms Pathogens. The NCBI Pathogen Detection Project (https:
should facilitate large-scale computational operations and //www.ncbi.nlm.nih.gov/pathogens/) helps public health
collaborations. This idea has been supported by early pilot scientists investigate foodborne disease outbreaks by inte-
experiments in which thousands of metagenomic samples grating pathogen genomic sequences obtained from cul-
were analyzed for organismal content using cloud-adapted tured bacterial isolates and quickly clustering and identi-
workflows and software (13). An additional example are fying related sequences (14). As of August 2020, over 600
COVID-focused datasets (including source and normalized 000 pathogen isolates covering 31 bacterial taxa and one
SRA file formats) recently added to the AWS Public Dataset emerging fungal pathogen, Candida auris, are actively be-
Nucleic Acids Research, 2021, Vol. 49, Database issue D15
ing analyzed. We make these analysis results available in the to the ClinVar Submission Portal so that all of the required
Isolates Browser on a daily basis (https://www.ncbi.nlm.nih. fields in a file submission are validated before the file is sub-
gov/pathogens/isolates). As part of our effort to improve mitted to ClinVar. As of August 2020, submitters also have
the compliance of the pipeline results (15) with FAIR prin- the option to stop a submission and correct errors immedi-
ciples, we annotate the generated assemblies using PGAP ately, or to submit all records that pass validation and re-
(16), submit them to GenBank and incorporate them into ceive a report of any failures to correct later.
the NCBI Assembly resource. These assemblies link back In January 2020, we added a new feature to ClinVar that
to the Isolates Browser and also to the SNP Tree Viewer if allows a user to ‘follow’ a particular variant and be notified
that particular isolate is a member of a clonally related clus- if the overall clinical interpretation in ClinVar changes, for
ter. The Isolates Browser allows users to subset isolates and example from a pathogenic category to a non-pathogenic
then download the assembled sequence and/or annotation one. This feature makes it easier for a laboratory to become
of the submitted assemblies. aware of variants that may need to be re-evaluated, and for
clinicians to know when they should contact their clinical
of this feature, users will need a recent version of the between proteins and ligands or other proteins, visual-
BLAST+ package (2.9.0 release or later). The older ization of electrostatic potentials as computed by Delphi
BLAST database version (v4) has been deprecated. (21) and visualization of membrane bilayer location rela-
The new BLAST databases are available on the NCBI tive to membrane protein structures as provided by OPM
FTP site as well as on GCP and AWS cloud providers, (22). Finally, a Jupyter notebook widget version of iCn3D
and all three sites now offer the same 23 databases. (called icn3dpy) enables users to view 3D structures in a
These databases range from a Betacoronavirus collection Jupyter environment. iCn3D is available at https://github.
(https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE TYPE= com/ncbi/icn3d, and novel features are demonstrated on
BlastSearch&BLAST SPEC=Betacoronavirus) to Ref- the gallery page at https://www.ncbi.nlm.nih.gov/Structure/
Seq representative genomes for eukaryotes, prokary- icn3d/icn3d.html#gallery.
otes and viruses. Also included are the default NCBI
nucleotide collection (nt), the non-redundant pro-
Chemical updates
tein (nr) database, and genome assemblies for human
Twitter and LinkedIn) and the several mailing lists and RSS et al. (2013) BLAST: a more efficient report with usability
feeds that provide updates on services and databases. Links improvements. Nucleic Acids Res., 41, W29–W33.
11. Ye,J., Coulouris,G., Zaretskaya,I., Cutcutache,I., Rozen,S. and
to these resources are in the NCBI page footer and on NCBI Madden,T.L. (2012) Primer-BLAST: a tool to design target-specific
Insights. primers for polymerase chain reaction. BMC Bioinformatics, 13, 134.
12. Langmead,B. and Nellore,A. (2018) Cloud computing for genomic
data analysis and collaboration. Nat. Rev. Genet., 19, 208–219.
ACKNOWLEDGEMENTS 13. Connor,R., Brister,R., Buchmann,J.P., Deboutte,W., Edwards,R.,
The authors would like to thank all of the NCBI staff who Marti-Carreras,J., Tisza,M., Zalunin,V., Andrade-Martinez,J.,
Cantu,A. et al. (2019) NCBI’s virus discovery Hackathon: Engaging
through their dedicated efforts continue to allow NCBI to research communities to identify cloud infrastructure requirements.
provide our full collection of services to the community. Genes (Basel), 10, 714.
14. NCBI Resource Coordinators. (2017) Database resources of the
National Center for Biotechnology Information. Nucleic Acids Res.,
FUNDING 45, D12–D17.
15. Sayers,E.W., Agarwala,R., Bolton,E.E., Brister,J.R., Canese,K.,