Professional Documents
Culture Documents
A Comprehensive Review of Database Resources in CH
A Comprehensive Review of Database Resources in CH
A Comprehensive Review of Database Resources in CH
iq.unesp.br/ecletica
| Vol. 45 | n. 3 | 2020 |
+
Corresponding author: Syed Sauban Ghani, Phone: +96658392-0235, Email address: Syed_SG@jic.edu.sa
It is must to have a better understanding of prerequisite for this search method. The method
how the data is organized and interconnected for that is frequently adopted by the database creators
the effective search in any database. The is the inclusion of the “forms” which is already
databases are broadly distributed in two major having names of the usual fields such as
group’s viz. full-text and the structures, based on "bibliographic data" or "physical parameters" to
the category of data contained4. A full-text give a user-friendly interface.
database is generally a set of documents in which Search by chemical identifier is probably the
indexes are created to facilitate their fast search. simplest search by keyword, with the difference
This type of database is commonly run by that chemical identifiers are a little more difficult
publishers of magazines and books, patent offices to define. However, the free programs available to
for patents or academic institutions. The largest us can generate these identifiers if we are able to
database of this type on this globe is definitely enter the structure of the compound (e.g. Chem-
operated by Google, in which documents and Sketch or Marvin- Sketch)11. The most common
other accessible files are uploaded on the internet search for information on chemical compounds or
in the form of websites5. On the other hand, their chemical products is the search for structures
structured databases normally include a set of or substructures. Less common were also searched
tables that contain records or rows, all of which in structured databases under chemical, physical,
have the same structure well defined by a set of or biological properties of the chemical
fields or columns6. Each record is always assigned compounds. There are many considerations that
a unique "ID" or "Number" called as identifier or are involved in the construction and searching of
the primary keys, which are easily referenced. The chemical databases. Chemical structures that are
Chemical Abstracts Service Registry Number commonly stored in databases, such as text, differ
(CAS RN) is an example of the primary key for significantly from other entities therefore the
the structure in the REGISTRY database7. different search modes too differ significantly,
Structured databases are generally classified into however some matches can be drawn. The reason
two large groups in terms of its contents as for the existence of different databases is that each
bibliographic and factographic. The bibliographic of them have its own function, however, none of
databases usually do not have the full text of a them is perfectly a subset of any other. The
document, however it records information about a subsequent process for any chemical databases
single publication, patents, and similar using a specialized structural editor is to create a
documents8. Typical fields in bibliographic search query, give a chemical structure or
records are - author, article title, journal name, substructure of a search compound. JME Editor
volume, issue, year of publishing, pages, etc. The was the most widely used structural editor of this
Digital Object Identifier (DOI) is a comparatively kind but for the last couple of years or more have
new parameter, which describes the distinctive started to phase out this technology12.
placement of a document on the Web9. The Consequently, the creators of chemical databases
bibliographic databases are secondary sources that opt for JavaScript-based editors that is the
interpret, analyze, and summarize, the primary recognizable technology of the future. So, the best
source information to increase usability and speed among the structural editors nowadays is Marvin
of delivery, such as an online encyclopedia. JS, which is widely used in the application of
Moreover, factographic databases consist of Reaxys. The most exciting and commercially
specific information extracted from primary available program for drawing chemical formulas
documents, particularly in the area of chemistry is Chem Draw, which is marketed by Cambridge
that have details about chemical reactions and Soft. The most recent version of this editor
chemical compounds such as toxicological, permits the user to search for the diagrams
spectral, physical, or chemical characterization10. directly in SciFinder13.
It is must for the databases to allow the user to The commercial chemical databases of the
search for records by all field values as well as to resource are the most popular and most widely
create search queries for logical operators in order used web applications of Scopus, Reaxys,
to be regarded as an ideal database system. SciFinder and Web of Science (WoS), in which
Additionally, the thorough knowledge of the the above search technologies are possible and are
database structure, the syntax of the search more closely related to this article. The chemical
language, and specific IT skills are the databases are nowadays searched to give novel
The open Web compromises a rich pool of 2.2. Crystallography Open Database (COD)
various chemical data sources if the user knows
where to find out. It has been over many years Crystallography Open Database (COD) is an
since some emerging chemical databases were open-access collection of crystal structures of
dominated by a handful of established players, the organic, inorganic, metal-organics compounds,
field has practically opened up to a variety of and minerals, excluding biopolymers. This
innovative newcomers. Although some of the database is specifically designed to store
original databases are no longer active, it is information about the structure of molecules and
inspiring to see that several them continue to run crystals18. All data on this site have been placed in
and even flourish. It is of course more likely that the public domain by the contributors. The COD
still many more services will be created and some can provide a link to CIF if there is a CIF
of them will become irrelevant in the coming available somewhere in the internet. The
years. The Internet now offers a varied range of Crystallography Open Database has more than
free online chemistry databases, and this list is 360,000 entries and has various contributors, as
being continuously updated with new information well as contains CIFs as prescribed by the
and new entries. The following list summarizes International Union of Crystallography19.
some of the databases that are freely available for COD has a website
the users. http://www.crystallography.net which provides
proficiencies for all registered users to deposit
published or unpublished crystallographic
structures as personal communications or pre-
publication depositions. Having such sort of a
setup that enables extension of the COD database companies. ZINC can be easily used for download
by several users simultaneously. It also increases using the website http://zinc.docking.org. It is
the chances for growth of the COD database and currently built from the catalogues of ten major
may be considered as one-step towards creating a compound vendors in several common file
worldwide Internet-based collaborative platform formats including SMILES, mol2, 3D SDF, and
committed to the collection and curation of DOCK flexi base format and the number of
structural knowledge. Each structure deposited molecules in ZINC is continuously growing. This
into the COD generate a unique seven-digit database has been designed in such a way that it
number, called COD number, which identifies a organizes data relationally so that it remains
particular illustration of a structure determination. compatible to attain the objectives of efficient
In general, COD does not accept duplicate loading, incremental updates, querying, and data
structures. subsetting. These steps make them fast and
efficient. Though exporting subsets of the
2.3. PubMed database can make them slow, but this problem
has been resolved by exporting the molecule
PubMed is a freely accessible web interface subsets from the database into ready-to-download
(since 1997) designed to search for records compressed files, and database-intensive work is
located primarily in the MEDLINE database of scheduled in batch mode. This totally bypasses the
references and abstracts. It comprises more than relational database and subsets are downloaded
28 million citations for biomedical literature from speedily once it is ready. The web-based interface
MEDLINE, life science journals, and online is fast as well as supports moderately complex
books20. Citations may include links to full-text queries and users may search ZINC based on
content from PubMed Central and publisher web several criteria. The ZINC server enables users to
sites. PubMed also provides access to older upload and process their own molecules, as we
references even from the print form of Index often come across molecules such as positive and
Medicus dating back to 1951 or earlier in addition negative controls that we need to dock that are not
to MEDLINE. This bibliographic database is part of the existing database22. Henceforth, ZINC
indexed by journal entries and other primary is much useful for virtual screening by experts and
sources related to medicine. There is also non-specialists equally and assist more
information about publications in the field of researchers to attempt computational ligand
medicinal chemistry or biochemistry. The discovery.
PubMed identifier (PMID) is the primary key
used in PubMed to identify the unknown in this 2.5. ChemSpider
database. The tool provided in PubMed facilitates
saving searches, filtering search results saving sets ChemSpider is an open access chemical
of references retrieved as part of a PubMed structure database, which provide rapid text and
search, configuring display formats of search structure search access to over 67 million
terms and the extensive range of further options. structures from hundreds of data sources.
PubMed records with recent increases in activity. ChemSpider is one of the chemistry community’s
primary online public compound databases.
2.4. ZINC ChemSpider serves data for tens of thousands of
chemists every day and it lays the foundation for
ZINC is a commercially available free many important international projects to integrate
database of compounds for virtual screening and chemistry and biological data, facilitate drug
this database has brought virtual screening discovery efforts and help to identify new
libraries to a comprehensive range of structural chemicals from under the ocean23. It is not just a
biologists and medicinal chemists. It contains search engine based on terabytes of chemistry
more than 35 million available compounds in data but also acts as a crowdsourcing community
ready-to-dock, 3D formats21. Due to its structure- for chemists those have contributed their data,
based virtual screening, it has numerous skills, and knowledge for the enhancement and
significant successes in recent years and is curation of the database. Therefore, it can be said
nowadays a common technique in initial stage of that ChemSpider seems like Wikipedia by
drug discovery in many of the pharmaceutical promising participation and contributions from the
scientific community. ChemSpider can link open- For the known compound, it provides
and closed-access chemistry journals, extensive information like all its possible names
environmental data, PubChem, Chemical Entities and identifiers (both standard and nonstandard),
of Biological Interest (ChEBI), chemical vendors, experimental and calculated physicochemical
Wikipedia, The Kyoto Encyclopedia of Genes and properties, toxicity and biological activity data,
Genomes (KEGG), and few other patent spectra (NMR, IC, MS, UV-vis), publications,
databases24. These links allow a ChemSpider user patents, etc. The information that is available
to collect information of their interest, such as depends on what has been gained from the
from where to buy a chemical, chemical toxicity, original sources and the links to it are available.
metabolism data, and so on. Amassing this level The role of ChemSpider is to get information
of related information through a usual search about all the compounds available on the web at
engine like Google or Bing is a time-consuming one central location, make it easy to search and
process. Additional features have been added to standardize their structures and names. It also
each of the chemical structures within the improves the quality of chemical sources by using
database, such as structure identifiers like automated control of the structure and manual
SMILES, InChI, IUPAC, and Index Names, as management of collaborating experts as well as
well as many physico-chemical properties25. provides a platform for data input and storage.
ChemSpider also offers access to a series of Additionally, it tries to make it easy to access all
property prediction algorithms. The user can data using a web interface optimized for mobile
access this database by browsing devices, mobile applications, and web services for
http://www.chemspider.com/. The ChemSpider data capture. It does integrate data into the RSC
homepage as it appears on the desktop has been publication using the first links and use validated
shown in Fig. 1. The provider of this service has chemical names to search in Google Scholar,
been the Royal Society of Chemistry (RSC) since PubMed and RSC books, journals, and databases.
2009, which gives more value to other positive
and useful services.
2.6. Google Scholar Despite of the similarities between them still there
are differences between them that are worth of a
Google Scholar is a freely available web-based detailed analysis.
search engine, available since 2004, that indexes
the full text or metadata of scholarly literature 3.1. Scopus
through a range of publishing formats and
disciplines. The indexed resources include online Scopus is the Elsevier’s largest abstract and
journals, conference papers, books, dissertations, citation database of peer-reviewed literature,
thesis, patents, and any other significant literature. which was launched in 2004. The literature covers
It is estimated that it contains approximately 389 more than 49 million records including scientific
million documents comprising articles, citations journals, conference proceedings, and books27.
and patents which makes it the world's largest Scopus offers a complete summary of the global
academic search engine in 2018 26. Google research output in the area of science,
Scholar has now become indispensable for engineering, medicine, humanities, and social
research and research dissemination that provides sciences. Scopus database is the leading
a systematized and instant process for users to searchable citation and abstract source for
build on through a sort of digital snowball for searching literature that is continuously expanding
literature retrieval. The reason for the excel of and updating. Scopus offers smart tools that have
Google can be attributed to its sophisticated the sorting and refining features to track, analyze
natural language processing. In addition to the and visualize research of more than 27 million
search, Google Scholar users are also able to citations and abstracts dating back to 1960s28.
create a personal profile with a list of their own Researchers across the globe believe that use of
publications and can generate census statistics and Scopus had positive influence on the research
H-indexes like that of Web of Science. finding as it is easy to use, saves time as well as
Nevertheless, if a user wishes to use a structured provides quality outcome. The content on Scopus
query in accordance to the field values in the is derived from over 5,000 publishers, which is
bibliographic record or to find documents that reviewed by an independent Content Selection
have not been issued, it is preferable to resort to and Advisory Board (CSAB) and then selected for
paid databases. The difficulties faced by the indexing in Scopus. The metadata that is provided
Google Scholar users are that they are not aware by publishers includes the following: authors
that when it is updated, includes old articles, as name, affiliations, document title, volume, issue,
well as no suggestions are provided for limiting pages, year, electronic identification (EID), source
searches. title, citation count, document type and digital
object identifier (DOI). This metadata is
3. Results and Discussion integrated to different websites and platforms,
which provides more precise search and enables
The three major commercial web database retrieval of scientific information. Scopus
applications that are widely used in the field of provides International Standard Serial Number
chemistry are Scopus, Web of Science (WoS) and (ISSN) for journals, conference series or book
SciFinder. All the three databases contain series for series publication and International
extensive search options and are somehow Standard Book Number (ISBN) for one-time
remarkably similar in their chemical content as conference or book publication29. The overall
well as in their search mode, search effectiveness view of the working pattern of Scopus is given in
and interface. They have periodically undergone Fig. 2.
significant overhauling However, the coverage of a journal by Scopus
and have intense competition between them. may be discontinued for a certain period i.e. it has
This competition has led to improvements in the breaks for some journals whereas for some
services offered by them, which is, however, journals Scopus makes a partial coverage. Several
advantageous for users. As these databases are studies can be found in the literature making
expensive, it is not feasible to have all these detailed descriptions of the main features of
databases, therefore the scientific libraries must Scopus and comparing the databases with the aim
decide that which citation database will meet the of assessing the number of citations obtained by a
requests of the consumers more effectively. particular set of documents in each of them.
Studies have analyzed the set of journals covered navigate. Scopus have some extraordinary
by each database as well as their interface features such as: it allows the user to go both
accessibility and usability compared with Scopus, forwards and backwards in time by linking to both
from the point of view of the number of items citing and cited documents; it can link to the
included, and of testing the breadth of coverage. publisher's web site to view the document; citation
The rankings from Scopus and WoS match at the accuracy is so accurate that 99% of citing
top and the bottom but deviate considerably in the references and citing articles matched exactly; can
middle positions30. If the user is aware with search work in all the common web browsers like
devices such as drop-down boxes and check Chrome, Internet explorer or Mozilla.
boxes, even for the beginner Scopus is easy to
Chemicus and Current Chemical Reactions permit to provide a distinct picture of the full impact of
the creation of structure drawings, consequently research output, and act as an important tool for
allowing users to locate chemical compounds and data attribution and detection. WoS has the most
reactions. advanced features for citation analysis. It allows
Key features of Data Citation Index (DCI) on H-index to be honored that is now extensively
WoS is to search directly through millions of used to assess the quality of the scholar and is able
records from hundreds of evaluated data to search for multi-field quotes with respect to
repositories in the Sciences, Social Sciences, and other databases. The newer WoS features provide
Humanities34. Each DCI record links directly to search by the grant agency or the grant number. It
the repository so that users can quickly access the is also possible to export the found export logs in
associated research data. Citations to data sets are various formats to a file or a web-based version of
indexed so that the user can measure their impact EndNote's personal bibliographic database. For
as well as track their influence. The Data Citation subsequent import into another bibliographic
Index offers a single point of access to research database, it is appropriate to use the RIS
data from repositories across disciplines structured export scheme, which is a recognized
throughout the world. In this index, descriptive standard for these purposes. Controlling WoS is
records are generated for data objects and linked the easiest to learn with the help of a video
to literature articles in the Web of Science. As tutorial, which are available to foreign operators.
data citation practices increase, the resource aims
and research as each chemical record retains the acquisition and usage of any of the databases
links to the original source of the material, thereby presented here, as there may be changes in scope,
associating a micro attribution and these links let a configuration, vendor, etc. Libraries willing to
database user source information of particular subscribe the database should make their choice
interest. Each of the commonly used chemical based on the needs of the library.
databases presented here has at least some overlap
with each of the remaining ones, which means 5. Acknowledgment
that each of these databases appears to have its
own “niche”. The user, looking for a variety The author is thankful to Jubail Industrial
would like to give attention to each of them. College for providing institutional access to the
These databases thus seem to be a valuable commercial websites for downloading the
resource to the chemical community as they offer research articles.
a large collection of compounds, either with
related sample availability or with a diverse and 6. References
unique structure set. As all the investigated
databases developed over the years, the detailed [1] Masic, I., Review of most important biomedical
results of these databases essentially signify a databases for searching of biomedical scientific
snapshot in time. The description reported here literature, Donald School Journal of Ultrasound in
Obstetrics and Gynecology 6 (4) (2012) 343-361.
may give a useful overview relative to some of the
https://doi.org/10.5005/jp-journals-10009-1258.
most important large chemical databases
available. In PubChem, unique chemical [2] Walters, W. P., Stahl, M. T., Murcko, M. A.,
structures are extracted from the Substance Virtual screening—an overview, Drug Discovery
database and stored in the Compound database Today 3 (4) (1998) 160-178.
that provides an accumulated interpretation of https://doi.org/10.1016/S1359-6446(97)01163-X.
information for a given chemical structure. COD
database establishes a worldwide Internet-based [3] Hunter, L., Cohen, K. B., Biomedical language
collaborative platform committed to the collection processing: what's beyond PubMed? Molecular Cell 21
and curation of structural knowledge. PubMed (5) (2006) 589-594.
https://doi.org/10.1016/j.molcel.2006.02.012.
provided a general description of PubMed
including its content and unique characteristics. [4] Tenopir, C., Ro, J. S., Full Text Databases,
ChemSpider provides the variety of information Greenwood Press, Westport, 1990.
of a given compound including physical and
chemical properties, molecular structure, synthetic [5] Bar-Ilan, J., Which h-index? - A comparison of
methods, spectral data, and systematic WoS, Scopus and Google Scholar, Scientometrics 74
nomenclature for millions of compounds in a (2008) 257-271. https://doi.org/10.1007/s11192-008-
single Web site. The ZINC database provides 3D 0216-y.
molecules in several formats compatible with
most docking programs. Google Scholar helps to [6] Chang, K. C-C., He, B., Li, C., Patel, M., Zhang,
Z., Structured databases on the web: Observations and
identify the collection of publications for a
implications, ACM SIGMOD Record 33 (3) (2004) 61-
specific research topic. There is a high association 70. https://doi.org/10.1145/1031570.1031584.
between WoS and Scopus databases that allows
searching and sorting the queries by anticipated [7] Dittmar, P. G., Stobaugh, R. E., Watson, C. E., The
parameters such as first author, citation, and Chemical Abstracts Service Chemical Registry System.
institution, etc. regarding impact factor and h- I. General Design, Journal of Chemical Information
index. SciFinder meets its goal of effectively and Computer Sciences 16 (2) (1976) 111-121.
exploring the scientific literature and the search https://doi.org/10.1021/ci60006a016.
results are mostly truly relevant and often
astonishingly inclusive regardless of the level of [8] Wright, K., McDaid, C., Reporting of article
retractions in bibliographic databases and online
complexity or syntax of the query. The database
journals, Journal of the Medical Library Association 99
that ought to be used depends on the user and (2) (2011) 164-167. https://doi.org/10.3163/1536-
desired information. Therefore, the user must 5050.99.2.010.
investigate the up-to-date condition of the specific
database before establishing a decision of
[13] Mendelsohn, L. D., ChemDraw 8 Ultra, Windows [22] Sterling, T., Irwin, J. J., ZINC 15 – Ligand
and Macintosh Versions, Journal of Chemical Discovery for Everyone, Journal of Chemical
Information and Computer Sciences 44 (6) (2004) Information and Modeling 55 (11) (2015) 2324-2337.
2225-2226. https://doi.org/10.1021/ci040123t. https://doi.org/10.1021/acs.jcim.5b00559.
[14] Bharti, N., Leonard, M., Singh, S., Review and [23] Pence, H. E., Williams, A., ChemSpider: An
Comparison of the Search Effectiveness and User Online Chemical Information Resource, Journal of
Interface of Three Major Online Chemical Databases, Chemical Education 87 (11) (2010) 1123-1124.
Journal of Chemical Education 93 (5) (2016) 852-863. https://doi.org/10.1021/ed100697w.
https://doi.org/10.1021/acs.jchemed.5b00601.
[24] Hettne, K. M., Williams, A. J., van Mulligen, E.
[15] Šilhánek, J., Comparisons of the most important M., Kleinjans, J., Tkachenko, V., Kors, J. A., Erratum
chemistry databases - Scifinder program and reaxys to: Automatic vs. manual curation of a multi-source
database system, Chemicke Listy 108 (1) (2014) 81- chemical dictionary: the impact on text mining, Journal
106. of Cheminformatics 2 (4) (2010) 1-7.
https://projekty.upce.cz/sites/default/files/groups/admin https://doi.org/10.1186/1758-2946-2-4.
s/luva3059/2014_01_83-90.pdf.
[25] Williams, A., Tkachenko, V., The Royal Society
[16] Wang, Y., Xiao, J., Suzek, T. O., Zhang, J., Wang, of Chemistry and the delivery of chemistry data
J., Bryant, S. H., PubChem: a public information repositories for the community, Journal of Computer-
system for analyzing bioactivities of small molecules, Aided Molecular Design 28 (10) (2014) 1023-1030.
Nucleic Acids Research 37 (2 Suppl) (2009) W623- https://doi.org/10.1007/s10822-014-9784-5.
W633. https://doi.org/10.1093/nar/gkp456.
[26] Zientek, L. R., Werner, J. M., Campuzano, M. V.,
[17] Kim, S., Thiessen, P. A., Bolton, E. E., Chen, J., Nimon, K., The Use of Google Scholar for Research
Fu, G., Gindulyte, A., Han, L., He, J., He, S., and Research Dissemination, New Horizons in Adult
Shoemaker, B. A., Wang, J., Yu, B., Zhang, J., Bryant, Education & Human Resource Development 30 (1)
S. H., PubChem Substance and Compound databases, (2018) 39-46. https://doi.org/10.1002/nha3.20209.
Nucleic Acids Research 44 (D1) (2016) 1202-1213.
https://doi.org/10.1093/nar/gkv951. [27] Burnham, J. F., Scopus database: a review,
Biomedical Digital Libraries 3 (1) (2006) 1-8.
[18] Gražulis, S., Daškevič, A., Merkys, A., https://doi.org/10.1186/1742-5581-3-1.
Chateigner, D., Lutterotti, L., Quirós, M.,
Serebryanaya, N. R., Moeck, P., Downs, R. T., Le Bail, [28] Bar-Ilan, J., Tale of Three Databases: The
A., Crystallography Open Database (COD): an open- Implication of Coverage Demonstrated for a Sample
access collection of crystal structures and platform for Query, Frontiers in Research Metrics and Analytics 3
world-wide collaboration, Nucleic Acids Research 40 (6) (2018) 1-9.
(D1) (2012) D420-D427. https://doi.org/10.3389/frma.2018.00006.
https://doi.org/10.1093/nar/gkr900.