Professional Documents
Culture Documents
Biological Databases:: Primary Data Bases, Secondary Data Bases, Entrez, Genbank, Pubmed
Biological Databases:: Primary Data Bases, Secondary Data Bases, Entrez, Genbank, Pubmed
Biological Databases:: Primary Data Bases, Secondary Data Bases, Entrez, Genbank, Pubmed
Abstract
With so much advancement in the field of genomics and molecular biology, a need arose for
the storage and organization of piling up sequence data. For this purpose biological databases
were developed. Data retrieval and information discovery are two main aims of biological
databases. In this article, information is provided about different types of biological data
bases and why an interconnection between these biological databases is required and a
briefing on pitfalls of biological databases. GenBank format has been described
comprehensively.
Keywords: primary data bases, secondary data bases, Entrez, GenBank, Pubmed
Introduction
As the world of genomics advances, a lot of sequence data which is raw is being generated.
To collect, store and organize this large amount of raw data in one place, biological databases
were established. This raw data can be further utilized by organizing it in different manners.
Databases
A Database can be defined as computerized records or automated archives that can store and
organize data. Data that needs to be stored is usually in a raw form and thus is processed and
organized in databases. Stored data in databases is easy to retrieve for further use. To manage
the data, both computer hardware and software are included in databases. The main purpose
for establishment and development of databases is the retrieval of data with much ease
because data is well organized in structure records and thus can be retrieved effortlessly.
Flat files; object oriented and relational files are three different database structures are also used
by biological databases.
Primary databases
Primary databases are raw sequence archives or structural data that is submitted by
scientific community. The content is original and is usually controlled by the submitters.
Primary biological databases are also known as archival databases. Examples include
GenBank and protein data bank (PDB).
Primary databases are large, open to public sequence databases that store the raw DNA/RNA
sequence data that is produced and submitted by scientific community from all around the
world. GenBank, EMBL and DDJB are examples of these databases and they’re freely
available to public on internet. Most of the data entered is with low annotation level.
GenBank, EMBL and DDJB collaborate closely with each other. Whatever data is submitted
to these three is usually prearranged to be submitted to scientific journals so that it could be
made openly available. All three of these databases together constitute international
nucleotide sequence database collaboration. Any sequence uploaded on one of these three
databases can accessed on all three databases. Although the draw data is same on all three
databases but their formats slightly differ from each other.
For biological molecules that have 3D structures there is only one platform called PDB
(protein data bank). In PDB, Flat file format is used to present basic information such as
proteins and author name, experimental k details, cofactors and secondary structures.
Secondary databases
Secondary databases are also called repositories of curated data. These are built up
from primary data and needs computational processing or manual curation of the data. This
category includes translated protein sequence databases that contain functional annotation.
Secondary database are controlled by third party such as NCBI. Examples: SWISS-PROT
and PIR (protein information resource).
Sequence Annotation information in primary database very limited. Therefore the raw data
in primary databases needs some post processing to covert into more sophisticated biological
data. And for this purpose, secondary databases are required that process data from primary
to a more organized form. Computational processing of the primary data is different among
different secondary databases. Best example of secondary databases if SWISS-PROT that
provides more detailed sequences annotation with detailing on structure function and protein
family assignment. Other examples include TrEMBL in EMBL. Each entry annotation is of
high quality because it is carefully curated by expert humans. Almost all the information is
obtained from scientific literature and is carefully entered. Database also provides cross
references to similar interest related resources online.
UniPort that was made by assembling SWISSPROT, TrEMBL and PIR has more coverage
and retains high quality, cross-reference and low redundancy that are the original features of
SWISS-PROT. Also, Pfam and Block databases provide protein family classification according
to functions and structures.
Specialized databases: These databases are related to some particular research interest
such as databases that specialize in some particular organism or a particular type of data.
Examples: Flybase, HIV sequence Database
These databases can either be composed of sequences or other type of information about a
particular organism. These might have data that overlaps with primary databases or
completely new data associated with unique organizations and additional annotations that is
submitted by experts of the field. Taxonomic specific genome databases are included in this
category.
An interconnection between all three types of databases is necessary. Secondary database are
derived from primary data therefore primary databases act as central storehouse or archives
of raw sequence data on which secondary databases depend. So it’s very frequent that to
upload sequence information, secondary and specialized databases has to connect or link to
primary biological databases. So for this purpose, entries in one database are cross-reference
or linked to similar entries in other biological databases to make it convenient for the user.
However there is one obstacle that hinders this process of interconnection between different
biological databases. It is the difference in formats of data saved on these databases that
makes one data format incompatible with the other database system. Fortunately, COBRA
(Common Object Request Broker Architecture) language has made it possible to
communicate between databases without the need to understand each other’s database
structures. Another protocol called XML (eXtensible Markup language) helps in
communication between the databases.
For retrieval of data from databases, Boolean operator is required. There are different terms
such as AND, OR and NOT which corresponds to different commands. Parenthesis is used
and anything within the parenthesis will be searched up first.
Entrez
A biological data retrieval system that provides access to protein and nucleotides sequences.
It is produced and maintained by NCBI. Entrez allows text-bases searches access to
annotated genomic data, other structural sequences and taxonomic data, full papers, abstracts
and citations. It integrates information by cross-referencing between NCBI databases. It is
highly convenient for users as they don’t have to visit separate database for related
information. Entrez can be effectively used if the main features of the search engine are
understood. Several options that are common to all NCBI databases are
PubMed
GenBank
The most complete information on almost every organism’s nucleotide sequence data
can be found in GenBank. It has a huge collection of annotated nucleic acid sequence
data. It can be high throughput raw sequence data, sequence polymorphism, DNA,
RNA, cDNA or ESTs. For protein sequences, there is GenPept.
Searches can be made in GenBank by
using text-based search keywords or
Using molecular sequence (to search by sequence similarity using BLAST.)
GenBank format
GenBank is a relational database. Its output is flat file which is easy to read. Flat files has
further three sections
Conclusion:
Biological databases are central to biological research. Three main types of electronic
databases namely flat files, object oriented and relational databases form the content
on the basis of which primary, secondary and specialized biological databases are
developed. Biological databases need to be interconnected and the problems with
them must be cleared as soon as possible because one error in one sequences data
leads to further errors. These should have low redundancy. To understand the
knowledge presented by these databases, one needs to understand the database
formats first.