Lecture 2

Bioinformatics
Introduction to Biological Databases
• Publicly available databanks now contain billions of

nucleotides of DNA sequence data collected from
thousands of different organisms
• Major questions:
1. How do these databases store these data?
2. What strategies are followed to extract
information from them?
•Three publicly accessible databases store large
amounts of nucleotide and protein
1. GenBank of the National Institute of Health
(NIH) database, which is part of the National
Center for Biotechnology Information (NCBI)
2. DNA Database of Japan (DDBJ).
3. European Bioinformatics Institute (EBI) in
Hinton, England.
•Protein data bank Europe (PDBe), worldwide
protein data bank (wwPDB) etc.
Key Bioinformatics Databases in
detail
•NCBI We will use this database a lot!!!!
•http://www.ncbi.nlm.nih.gov/
•The National Center for
Biotechnology Information
advances science and health by
providing access to biomedical and
genomic information.
•EBI-EMBL
•http://www.ebi.ac.uk/
•The European Bioinformatics Institute
(EBI) is a non-profit academic organization
that forms part of the European
Molecular Biology Laboratory (EMBL)
database
•The EBI is a centre for research and
services in bioinformatics. The Institute
manages databases of biological data
including nucleic acid, protein sequences
and macromolecular structures
•(DDBJ)
•http://www.ddbj.nig.ac.jp/
•DNA Data Bank of Japan (DDBJ) is the
sole nucleotide sequence data bank in
Asia, which is officially certified to
collect nucleotide sequences from
researchers and to issue the
internationally recognized accession
number to data submitters.
•SIB
•http://www.isb-sib.ch/
•The SIB Swiss Institute of Bioinformatics is
an academic, non-profit foundation
established in 1998. SIB coordinates
research and education in bioinformatics
throughout Switzerland and provides high
quality bioinformatics services to the
national and international research
community.
GENBANK: DATABASE of Most known
nucleotide and protein sequences
• GenBank is a database consisting of most known public
DNA and protein sequences
• In addition to storing these sequences, GenBank
contains bibliographic and biological annotation
• Data from GenBank are available free of charge from
NCBI in the National Library of Medicine at the NIH
• GenBank is part of the International Nucleotide
Sequence Database Collaboration (INSDC), which
comprises the DNA DataBank of Japan (DDBJ), the
European Nucleotide Archive (ENA), and GenBank at
NCBI. These three organizations exchange data on a
daily basis.
Amount of Sequence Data
•GenBank contains over 3 1 billion
nucleotides from 24 million
sequences (released April 2003).
•The number of bases in GenBank
has doubled approximately every
14 months.
Organisms in GenBank
• Over 100,000 different species are represented in GenBank,

• with over 1000 new species added per month (Benson et al., 2002).
Organization and type of data:
How data is stored
Types of data in GenBank
•To help organize the available
information, each sequence name in
a GenBank record is followed by its
data file division and primary
accession number.
•The following codes are used to

designate the data file divisions:
The data file divisions:
1. PRI: primate sequences
2. ROD: rodent sequences
3. MAM: other mammalian sequences
4. VRT other vertebrate sequences
5. INV: invertebrate sequences
6. PLN: plant, fungal, and algal sequences
7. BCT: bacterial sequences
8. VRL: viral sequences
9. PHG: bacteriophage sequences
10. SYN: synthetic sequences
11. UNA: unannotated sequences
12. EST: sequences (expressed sequence tags)
13. PAT: patent sequences
14. STS: STS sequences (sequence-tagged sites)
15. GSS: GSS sequences (genome survey sequences)
16. HTG: HTGS sequences (high-throughput genomic sequences)
What are ESTs?
Expressed Sequence Tags (ESTS)
• The database of expressed sequence tags (dbEST) is a
division of GenBank that contains sequence data and
other information on cDNA sequences from a number of
organisms
• An EST is a partial DNA sequence of a cDNA clone
• All cDNA clones, and thus all ESTs, are derived from some
specific RNA source such as human brain or rat liver
• The RNA is converted into a more stable form, cDNA,
which may then be packaged into a cDNA library
• ESTs are typically randomly selected cDNA clones that
are rapidly sequenced on one strand
• ESTs are often 300-800 bp in length
Why is mRNA less stable then cDNA?
Role of and importance of ESTs
• Because ESTs represent a copy of the interesting part of a
genome, which is expressed - used as "tags" to fish a gene out of
a portion of the DNA by matching base pairs
• Powerful tools in the hunt for genes involved in hereditary
diseases
• ESTs also have a number of practical advantages:
- Their sequences can be generated rapidly and inexpensively,
- Only one sequencing experiment is needed per each cDNA
generated
- They do not have to be checked for sequencing errors because
mistakes do not prevent identification of the gene from which the
EST was derived
- Because they are expressed, then they clearly were transcribed
from functioning gene
Protein Databases
• Data may also be represented in databases for

proteins such as the nonredundant (nr) database of
GenBank, the SwissProt database and the Protein
Data Bank.
ESTs and UniGene
• A UniGene cluster is a database entry for a gene containing a

group of corresponding ESTs
• The goal of the UniGene (unique gene) project is to create one
unique entry for each gene and to collect all the ESTs associated
with that gene. For example, in the case of RBP4, there is only
one UniGene entry.
• Although there is only a single UniGene entry for RBP4, this entry
currently has over 200 human ESTs that match the RBP gene.
• This large number of ESTs reflects how abundantly the RBP gene
has been expressed in cDNA libraries that have been sequenced.
•
• UniGene clusters are created by sequence similarity searching
using ESTs, and by gathering aligned sequences into clusters
• Many clusters have only one member, representing unique
sequences, while other clusters have tens of thousands of EST
members.
• Conclusion: UniGene is a useful tool to look up information

about expressed genes. UniGene displays information about
the abundance of a transcript (expressed gene), as well as its
regional distribution of expression (e.g. brain vs. liver).
Cluster sizes in UniGene
This is a gene with

1 EST associated;
the cluster size is 1
Cluster sizes in UniGene
This is a gene with

10 ESTs associated;
the cluster size is 10
Sequence-Tagged Sites (STSs)
The dbSTS is an NCBI site containing STSs, which are short

genomic landmark sequences for which both DNA sequence
data and mapping data are available
• STSs have been obtained from several dozen organisms
• A typical STS is approximately the size of an EST
• An STS is a short DNA sequence that is easily recognizable
and occurs only once in a genome (or chromosome).
• Because they are sometimes polymorphic, containing short
sequence repeats, STSs can be useful for mapping studies
Genetic polymorphism is the simultaneous occurrence in the same locality of two or more discontinuous forms in such proportions that the rarest of them cannot be
maintained just by recurrent mutation or immigration, originally defined by Ford (1940).
Genome Survey Sequences (GSSs)
• The GSS division of GenBank is similar to the EST division, except

that its sequences are genomic in origin, rather than cDNA (mRNA).
• The GSS division contains the following types of data :
1. Random “single-pass read” genome survey sequences (Single pass
means that a sequence has been analyzed on the sequencer machine
only once)
2. Cosmid/BAC/YAC end sequences (are synthetic/artificial
chromosomes with integrated DNA fragments from a foreign source)
3. Polymerase chain reaction (PCR) sequences
High-Throughput Genomic Sequence (HTGS)
• The HTGS division was created to make “unfinished”

genomic sequence data rapidly available to the
scientific community.
• It was done in a coordinated effort between the
three international nucleotide sequence databases:
DDBJ, EMBL, and GenBank.
• The HTGS division contains unfinished DNA
sequences generated by the high-throughput
sequencing centers.

Lecture 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2

Uploaded by

Copyright:

Available Formats

Bioinformatics

Introduction to Biological Databases

• Publicly available databanks now contain billions of

• Over 100,000 different species are represented in GenBank,

•The following codes are used to

• Data may also be represented in databases for

• A UniGene cluster is a database entry for a gene containing a

• Conclusion: UniGene is a useful tool to look up information

This is a gene with

This is a gene with

The dbSTS is an NCBI site containing STSs, which are short

• The GSS division of GenBank is similar to the EST division, except

• The HTGS division was created to make “unfinished”

You might also like