Biological Databases:: Primary Data Bases, Secondary Data Bases, Entrez, Genbank, Pubmed

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Biological Databases

Abstract

With so much advancement in the field of genomics and molecular biology, a need arose for
the storage and organization of piling up sequence data. For this purpose biological databases
were developed. Data retrieval and information discovery are two main aims of biological
databases. In this article, information is provided about different types of biological data
bases and why an interconnection between these biological databases is required and a
briefing on pitfalls of biological databases. GenBank format has been described
comprehensively.

Keywords: primary data bases, secondary data bases, Entrez, GenBank, Pubmed

Introduction

As the world of genomics advances, a lot of sequence data which is raw is being generated.
To collect, store and organize this large amount of raw data in one place, biological databases
were established. This raw data can be further utilized by organizing it in different manners.

Databases

A Database can be defined as computerized records or automated archives that can store and
organize data. Data that needs to be stored is usually in a raw form and thus is processed and
organized in databases. Stored data in databases is easy to retrieve for further use. To manage
the data, both computer hardware and software are included in databases. The main purpose
for establishment and development of databases is the retrieval of data with much ease
because data is well organized in structure records and thus can be retrieved effortlessly.

Some specific terminologies associated with database are


 Entry: every record is called an entry and the actual data is held in certain fields
within each entry
 Value: a certain piece of information that is used by the uses to retrieve a particular
file/record from the database
 Making a query: the whole process of using a certain piece of information for
retrieval of data from the database is called making a query.

Flat files; object oriented and relational files are three different database structures are also used
by biological databases.

Types of biological data bases

There are three main types:


 Primary databases
 Secondary databases
 Specialized databases

Primary databases

Primary databases are raw sequence archives or structural data that is submitted by
scientific community. The content is original and is usually controlled by the submitters.
Primary biological databases are also known as archival databases. Examples include
GenBank and protein data bank (PDB).

Primary databases are large, open to public sequence databases that store the raw DNA/RNA
sequence data that is produced and submitted by scientific community from all around the
world. GenBank, EMBL and DDJB are examples of these databases and they’re freely
available to public on internet. Most of the data entered is with low annotation level.

GenBank, EMBL and DDJB collaborate closely with each other. Whatever data is submitted
to these three is usually prearranged to be submitted to scientific journals so that it could be
made openly available. All three of these databases together constitute international
nucleotide sequence database collaboration. Any sequence uploaded on one of these three
databases can accessed on all three databases. Although the draw data is same on all three
databases but their formats slightly differ from each other.

For biological molecules that have 3D structures there is only one platform called PDB
(protein data bank). In PDB, Flat file format is used to present basic information such as
proteins and author name, experimental k details, cofactors and secondary structures.

Secondary databases

Secondary databases are also called repositories of curated data. These are built up
from primary data and needs computational processing or manual curation of the data. This
category includes translated protein sequence databases that contain functional annotation.
Secondary database are controlled by third party such as NCBI. Examples: SWISS-PROT
and PIR (protein information resource).

Sequence Annotation information in primary database very limited. Therefore the raw data
in primary databases needs some post processing to covert into more sophisticated biological
data. And for this purpose, secondary databases are required that process data from primary
to a more organized form. Computational processing of the primary data is different among
different secondary databases. Best example of secondary databases if SWISS-PROT that
provides more detailed sequences annotation with detailing on structure function and protein
family assignment. Other examples include TrEMBL in EMBL. Each entry annotation is of
high quality because it is carefully curated by expert humans. Almost all the information is
obtained from scientific literature and is carefully entered. Database also provides cross
references to similar interest related resources online.

UniPort that was made by assembling SWISSPROT, TrEMBL and PIR has more coverage
and retains high quality, cross-reference and low redundancy that are the original features of
SWISS-PROT. Also, Pfam and Block databases provide protein family classification according
to functions and structures.
Specialized databases: These databases are related to some particular research interest
such as databases that specialize in some particular organism or a particular type of data.
Examples: Flybase, HIV sequence Database

These databases can either be composed of sequences or other type of information about a
particular organism. These might have data that overlaps with primary databases or
completely new data associated with unique organizations and additional annotations that is
submitted by experts of the field. Taxonomic specific genome databases are included in this
category.

Interconnection between biological databases

An interconnection between all three types of databases is necessary. Secondary database are
derived from primary data therefore primary databases act as central storehouse or archives
of raw sequence data on which secondary databases depend. So it’s very frequent that to
upload sequence information, secondary and specialized databases has to connect or link to
primary biological databases. So for this purpose, entries in one database are cross-reference
or linked to similar entries in other biological databases to make it convenient for the user.

However there is one obstacle that hinders this process of interconnection between different
biological databases. It is the difference in formats of data saved on these databases that
makes one data format incompatible with the other database system. Fortunately, COBRA
(Common Object Request Broker Architecture) language has made it possible to
communicate between databases without the need to understand each other’s database
structures. Another protocol called XML (eXtensible Markup language) helps in
communication between the databases.

Pitfalls of biological databases

Problems related to biological databases are as follow:

 Overreliance on biological databases: over relying on database provided sequence


information and related annotations has caused some major problems because the
information taken from there is not reliable sometimes.
 Errors in sequence databases: errors in sequencing on nucleotides have caused
major problems because these errors sometime lead to frameshift which result in a
completely different gene sequence making translation of the gene into protein
impossibility. Simply means that gene sequence gets contaminated. Nucleotides that
were sequenced before1990s have more errors so one has to be more careful when
dealing with older sequence data.
 High levels of redundancy in primary databases: the information in primary
databases get duplicated due to repeated submission of identical or overlapping
sequence by different or same writers, revision of annotations, dumping of EST data
or poor database management leads to high levels of redundancy. RefSeq (non-
redundant database), where all identical and associated sequences from the same
organism are place under one entry, has been developed by NCBI as a solution for
high redundancy. RefSeq can be considered a secondary database. SWISS-PROT is
another secondary database with low redundancy. UniGene is another approach in
which sequence-clustered databases are created that unite EST sequences derived
from the same gene
 Erroneous annotations: same gene sequences with different names or different
unrelated gene sequences under same names. This causes further problems because
errors in one sequence lead to errors in the next subsequent sequence. Therefore this
problem is solved by re-annotating the previously annotated gene sequences with
errors. Gene Ontology is a system that provides consistent and distinct naming.
Disagreements between researchers or careless assignment of proteins functions by
submitters and omissions or mistyping might be cause of these errors. Some
corrections can be done at informatics level while others need experimental work to
be performed again.

For retrieval of data from databases, Boolean operator is required. There are different terms
such as AND, OR and NOT which corresponds to different commands. Parenthesis is used
and anything within the parenthesis will be searched up first.
Entrez

A biological data retrieval system that provides access to protein and nucleotides sequences.
It is produced and maintained by NCBI. Entrez allows text-bases searches access to
annotated genomic data, other structural sequences and taxonomic data, full papers, abstracts
and citations. It integrates information by cross-referencing between NCBI databases. It is
highly convenient for users as they don’t have to visit separate database for related
information. Entrez can be effectively used if the main features of the search engine are
understood. Several options that are common to all NCBI databases are

 limits: restricts search to a subset of a certain database or certain type of data


 preview/index: performs a new search by connecting different searched with Boolean
operators
 history: a record of the previous searches is kept that can be revised or combined for
new searches
 clipboard: stores results for later viewing for certain amount of time
 send to clipboard: used for storing information in the clip board

PubMed

Biomedical database “PubMed” is accessible from Entrez. PubMed contains abstracts


and full-papers from nearly 4000 journals. It retrieves information under Medical
subject headings (MeSH). It has some 20,000 vocabulary words that are standardized
and controlled for indexing articles. It is more like a thesaurus that allows describing
a concept by converting searched keywords into standardized terms. Thus it allows
“smart” searches. The data we want to retrieve can be broaden by allowing “related
articles” option in PubMed.
Complex searches can be done by using Boolean operators or other complex
combinations of preview and limits. PubMed used certain tags that act as identifiers
for each field and are placed within parentheses. Such as (AU) for author and (JID)
for journal name.
Apart from PubMed, OMIM (Online Mendelian Inheritance in Man) is also
accessible from Entrez. OMIM is a non sequence based database that stores
information regarding certain genetic disorders and diseases with detailed information
regarding the genes involved. So it can be used to study the genes related to certain
disease.

GenBank
The most complete information on almost every organism’s nucleotide sequence data
can be found in GenBank. It has a huge collection of annotated nucleic acid sequence
data. It can be high throughput raw sequence data, sequence polymorphism, DNA,
RNA, cDNA or ESTs. For protein sequences, there is GenPept.
Searches can be made in GenBank by
 using text-based search keywords or
 Using molecular sequence (to search by sequence similarity using BLAST.)

GenBank format

GenBank is a relational database. Its output is flat file which is easy to read. Flat files has
further three sections

 Header: describes sequence origin, organism’s identifications and unique identifiers


associated with record. Header has locus which describes location of the sequence in
database followed by sequence length and molecule type. Then a three letter code for
GenBank divisions. (Total divisions in GenBank are 17). Abbreviations such as PLN
for plant, BCT for bacteria and PRI for primate Information are used. Date of
publishing is mentioned next to division. After locus comes definition which is
summary information for sequence record. (Name of the sequence and its status;
complete or partial etc). Accession number: unique number assigned to a piece of
DNA when it was first submitted. Accession number is followed by version no. and
gene index.
 Feature: annotation information about gene and gene product. It contains organisms
scientific name and protein number and name
 Sequence: contains the sequence data.

(Courtesy: Google images)

Conclusion:
Biological databases are central to biological research. Three main types of electronic
databases namely flat files, object oriented and relational databases form the content
on the basis of which primary, secondary and specialized biological databases are
developed. Biological databases need to be interconnected and the problems with
them must be cleared as soon as possible because one error in one sequences data
leads to further errors. These should have low redundancy. To understand the
knowledge presented by these databases, one needs to understand the database
formats first.

You might also like