Class12 Biological Database

16/5/2023
SIJ1004
Class 12
Molecular Sequence Databases
What is a database?
• A database is a computerized archive used to

• store and organize data.
• so that information can be retrieved by a variety of search criteria.
• Biological databases often used for knowledge discovery.
• A good database needs to be:
• structured with minimum redundancy
• searchable (data retrieval)
• updated periodically
• cross-referenced with other databases
• Tools for analysis and visualization
• Data is processed, organized, structured and presented in a useful context
 information
1
16/5/2023
Databases
• Books, articles 1968 -> 1985
• Computer tapes 1982 ->1992
• Floppy disks 1984 -> 1990
• CD-ROM 1989 -> ?
• FTP 1989 -> ?
• On-line services 1982 -> ?
• WWW 1993 -> ?
• DVD 2001 -> ?
Biological Database
• Libraries related to biological data and information
• Data from scientific experiments
• Published literature
• High-throughput experiment technologies
• Computational analysis
2
16/5/2023
What is molecular sequence?

• DNA
• RNA
• Protein
Some important biological databases

in bioinformatics
• GenBank ncbi.nlm.nih.gov Nucleotide sequences
• Ensembl ensembl.org Human/mouse genome
• PubMed ncbi.nlm.nih.gov Literature references
• NR ncbi.nlm.nih.gov Protein sequences
• UniProt uniprot.org/ Protein sequences
• InterPro ebi.ac.uk Protein domains
• OMIM ncbi.nlm.nih.gov Genetic diseases
• Expasy expasy.org Enzymes
• PDB rcsb.org/pdb/ Protein structures
• KEGG genome.ad.jp Metabolic pathways
• MINT mint.bio.uniroma2.it/ Protein-protein interactions
• GeneNet genenetwork.org / Gene networks
• Transpath genexplain.com/transpath Signal transduction pathways
3
16/5/2023
Central dogma of molecular biology

• The flow of information: DNA  RNA  protein
DNA Replication
Transcription Translation
Biological
Reverse transcription in retrovirus Activity
(Function)
Metabolite
Omics world
Central Dogma
Genome (DNA) Genomics Genotype
Transcription
Transcriptome (RNA) Transcriptomics

Translation
Proteome (Protein) Proteomics Phenotype
Function
Metabolome Metabolomics
(Metabolite)
4
16/5/2023
Types of data generated by molecular

biology research
• Molecular sequence:
• Nucleotide sequence (DNA or RNA)
• Protein (amino acids) sequence
• Genome sequence (complete or draft).
• Gene expression
• Gene polymorphism (variation)
• Macromolecular 3D structure
• Proteins sequence patterns or motifs
• Metabolic pathways
Genome size (haploid)

Genome size (bp)
Phi-X 174 5,386
Human Mitochondrion 16,569
Mycoplasma genitalium 490,885
Rikettsia prowazekii 1,111,523
Haemophilus influenzae 1,830,138
Streptococcus pneumoniae 2,160,837
Vibrio cholerae 4,033,460
Mycobacterium tuberculosis 4,411,532
Bacillus subtilis 4,214,814
Escherichia coli 4,639,211
Escherichia coli O157:H7 5,440,000
Saccharomyces cerevisiae 12,495,682
Caenorhabditis elegans 100,258,171
Arabidopsis thaliana 115,409,949 The haploid human
Drosophila melanogaster 122,653,977 genome (23 chromosomes)
Anopheles gambiae 178,244,063 is estimated to be about
Homo sapiens 3,200,000,000 3.2 billion bases long
Oryza sativa 3,900,000,000
10
5
16/5/2023
Primary Databases
• Experimental results submission by researchers  Original submissions
directly into the database
• Content controlled by the submitter
• Three major databases for nucleotide sequences:
• GenBank
http://www.ncbi.nlm.nih.gov/Genbank/
• European Molecular Biology Laboratory (EMBL)
www.embl.org
• DNA Databank of Japan (DDBJ)
http://www.ddbj.nig.ac.jp/
• Sequences are exchanged on a daily basis
11
• US:
• NIH (National Institute of Health)  NLM (National Library of
Medicine)  NCBI (National Center for Biotechnology Information) 
GenBank (database)
• European:
• EMBL (European Molecular Biology Laboratory)  EBI (European
Bioinformatics Institute)  ENA (European Nucleotide Archives)
12
6
16/5/2023
Primary sequence databases
Labs
Sequencing
Centers
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG
GenBank
Updated ONLY
by submitters
13
Secondary databases
• Significant processing of original raw data.
• Annotation.
• Functional links.
• Carefully curated database (manually or automate.
• High quality.
• SWISS-PROT, trEMBL and PIR combined in UniProt.
• http://www.uniprot.org/
• Incorporates:
• Function of the protein
• Subcellular localization of protein
• Post-translational modification
• Domains and sites
• Secondary structure
• Quaternary structure
• Similarities to other proteins
• Diseases associated with deficiencies in the protein
• Sequence conflicts, variants, etc. 14
7
16/5/2023
Primary vs. Secondary sequence databases

RefSeq
Labs
TATAGCCG
AGCTCCGATA
CCGATGACAA
Sequencing
Centers Genome
Curators Assembly
Updated
continually
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG by NCBI
GenBank
UniGene
Updated ONLY
by submitters
Algorithms
15
Specialized Databases
• Often focused on a specific aspect of an organism.
• Curated by experts.
• Highly annotated and processed data.
16
8
16/5/2023
Organism-specific genomic databases

Organism Database/resource URL
Escherichia coli EcoGene http://bmb.med.miami.edu/EcoGene/EcoWeb
EcoCyc (Encyclopedia of E. coli http://ecocyc.pangeasystems.com/ecocyc/ecocyc
genes and metabolism .html
Colibri http://genolist.pasteur.fr/Colibri
Bacillus subtilis SubtiList http://genolist.pasteur.fr/SubtiList
Saccharomyces Saccharomyces Genome Database http://genome-www.stanford.edu/Saccharmyces
cerevisiae (SGD)
Plasmodium falciparum PlasmoDB http://PlasmoDB.org
Arabidopsis thaliana MIPS Arabidopsis thaliana http://mips.gsf.de/proj/thal/db
Database (MAtDB)
The Arabidopsis information http://www.arabidopsis.org
resource (TAIR)
Drosophila FlyBase http://flybase.bio.indiana.edu
melanogaster
Caenorhabditis elegans A C. elegans DataBase (ACeDB) http://www.acedb.org
Mouse Mouse Genome Database (MGD) http://www.informatics.jax.org
Human Human Genome Organization http://www.hugo-international.org/
(HUGO)
Nucleotide sequence databases

• Primary database: Genebank, EMBL and DDBJ
• These 3 db contain mainly the same information (few differences in
the format and syntax)
• Serve as archives containing all sequences (single genes, ESTs,
complete genomes, etc.) derived from:
• Genome projects and sequencing centers
• Individual scientists
• Patent offices (i.e. USPTO, EPO)
• Non-confidential data are exchanged daily
• Size: more than 2.5 x107 sequences, over 3.2 x1010 bp;
• Sequences from > 500,000 different species;
18
9
16/5/2023
GenBank
• Public nucleotide
sequence database
• Development of
software tools for
sequence analysis
19
LOCUS AF115338 591 bp DNA linear BCT 19-AUG-1999

DEFINITION Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete
ACCESSION
VERSION
cds.
AF115338
AF115338.1 GI:4959391
GenBank Flat File
KEYWORDS .
SOURCE Pseudomonas fluorescens.
ORGANISM Pseudomonas fluorescens
•Title
REFERENCE
Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae;
Pseudomonas.
1 (bases 1 to 591)
Header •Taxonomy
AUTHORS Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R.
TITLE Influence of a putative ECF sigma factor on expression of the major
•Citation
outer membrane protein, OprF, in Pseudomonas aeruginosa and
Pseudomonas fluorescens
JOURNAL J. Bacteriol. 181 (16), 4746-4754 (1999)
MEDLINE 99369842
PUBMED 10438740
REFERENCE 2 (bases 1 to 591)
AUTHORS De Mot,R.
TITLE Direct Submission
JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics,
Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium
FEATURES Location/Qualifiers
source 1..591
/organism="Pseudomonas fluorescens"
/strain="M114"
/db_xref="taxon:294"
gene 1..591
/gene="sigX"
CDS 1..591
/gene="sigX"
/codon_start=1
Features (seq)
/transl_table=11
/product="ECF sigma factor SigX"
/protein_id="AAD34329.1"
/db_xref="GI:4959392"
/translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ
RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR
KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE
FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET"
BASE COUNT 157 a 133 c 170 g 131 t
ORIGIN
1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag
61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg
121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac
181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag
20gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag
241
301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag
DNA Sequence
361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg
421 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa
10
16/5/2023
Who can put data into GenBank?
Sequence data are submitted to GenBank from scientists from

around the world.
Warning: GenBank does not check the validity or accuracy of

sequences submitted. This is left up to the scientific community
to verify, like all published scientific data.
21
European Molecular Biology Laboratory
22
11
16/5/2023
23
ID AF115338 standard; DNA; PRO; 591 BP.

AC
SV
DT
AF115338;
AF115338.1
03-JUN-1999 (Rel. 59, Created)
EMBL Flat File
DT 23-AUG-1999 (Rel. 60, Last updated, Version 2)
DE Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds.
KW .
OS Pseudomonas fluorescens
OC Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas. •Title
RN
RP
RX
[1]
1-591
MEDLINE; 99369842.
Header •Taxonomy
RA
RT
Brinkman F.S., Schoofs G., Hancock R.E., De Mot R.;
"Influence of a putative ECF sigma factor on expression of the major outer •Citation
RT membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas
RT fluorescens";
RL J. Bacteriol. 181(16):4746-4754(1999).
RN [2]
RP 1-591
RA De Mot R.;
RT ;
RL Submitted (04-DEC-1998) to the EMBL/GenBank/DDBJ databases.
RL F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K.
RL Mercierlaan 92, Heverlee B-3001, Belgium
DR SPTREMBL; Q9X4L7; Q9X4L7.
FH Key Location/Qualifiers
FH
FT source 1..591
FT /db_xref="taxon:294"
FT /organism="Pseudomonas fluorescens"
FT /strain="M114"
FT CDS 1..591
FT
FT
FT
/codon_start=1
/db_xref="SPTREMBL:Q9X4L7"
/transl_table=11
Features (seq)
FT /gene="sigX"
FT /product="ECF sigma factor SigX"
FT /protein_id="AAD34329.1"
FT /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQR
FT TLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYRKE
FT RRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELEFQE
FT IADIMHMGLSATKMRYKRALDKLREKFAGETET"
SQ Sequence 591 BP; 157 A; 133 C; 170 G; 131 T; 0 other;
atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 60
ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 120
cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 180
gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag
24
gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag
tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag
240
300
360
DNA Sequence
gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg 420
12
16/5/2023
DDBJ - DNA Data Bank of Japan
25
26
13
16/5/2023
Genome database
• What Genomes Are Available?
• List of completed genomes increases almost every week
• GOLD Website: Listing of finished and “in progress” genomes
http://www.genomesonline.org/
27
Genome database in NCBI
28
14
16/5/2023
https://www.ncbi.nlm.nih.gov/genome/viruses/
29
Genome database at EMBL-EBI
30
15
16/5/2023
SRA database
31
Protein Databases
• Protein sequence databases
• Protein properties
• Protein localization and targeting
• Protein sequence motifs and active sites
• Protein domain databases; protein classification
• Databases of individual protein families
• Protein structure database
32
16
16/5/2023
Protein Information Resource
• https://pir.georgetown.edu/
• The oldest universal curated
protein sequence database.
• Published as the ‘Atlas of Protein
Sequence and Structure’ from
1965 to 1978 by the late Margaret
O Dayhoff.
• Established in 1984 as a successor
to the original National Biomedical
Research Foundation Protein
Sequence Database.
33
SWISS-PROT
• https://web.expasy.org/docs/swiss-prot_guideline.html
• Manually curated, non-redundant protein sequence database.
• Highly integrated with other databases.
34
17
16/5/2023
TrEMBL
• TrEMBL (Translation from EMBL) database (http://www.ebi.ac.uk/trembl/)
• Automatically curated and derived from the translation of all coding
sequences in the DDBJ/EMBL/GenBank nucleotide sequence database
that are not yet included in Swiss-Prot.
• http://www.bioinfo.pte.hu/more/TrEMBL.htm
35
UniProt: the next generation of protein

sequence databases
• Combine the Swiss-Prot, TrEMBL and PIR-PSD databases into a single
resource, UniProt (http://www.uniprot.org)
• UniProt Knowledgebase: continue the work of Swiss-Prot, TrEMBL and PIR
by providing an expertly curated database.
• UniProt Archive (UniParc): new and updated sequences are loaded on a
daily basis.
• UniProt non-redundant reference databases (UniProt NREF): which
provide non-redundant views on top of the UniProt Knowledgebase and
UniParc.
36
18
16/5/2023
UniProt
37
NCBI Protein database

• http://www.ncbi.nlm.nih.gov/protein
38
19
16/5/2023
FASTA
• Begins with a single-line description and followed (after newline character) by
lines of sequence data.
• The description line (defline) is distinguished from the sequence data by a
greater-than (">") symbol at the beginning.
• It is recommended that all lines of text be shorter than 80 characters in length.
>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4

MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R
39
FASTA
> Your favourite gene 1 - yfg1

MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R
> Your favourite gene 2 - yfg2
MQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIVIV
DTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENWTI
TSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLLEDNSKEW
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIV
40
20
16/5/2023
Other example of sequence data in Fasta format

>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
41
Accession number
• To label and identify sequence
 accessible information.
• A string of 4 to 12 characters
that are associated with a
molecular sequence record.
42
21
16/5/2023
Reference Sequence (RefSeq) Project

• GenBank: archival database (highly redundant).
• RefSeq entry corresponding to a given gene or gene product. (case of
splice variant or distant loci  several RefSeq entries).
• Non-redundant; one record for each gene, or each splice variant, from
each organism represented
• Each record is intended to present an encapsulation of the current
understanding of a gene or protein, similar to a review article
43
Formats of Accession Numbers for RefSeq Entries
44
22
16/5/2023
The J. Craig Venter Institute (JCVI)

• Non-profit genomics research institute founded in 2006.
• Consolidation of four organizations:
• Center for the Advancement of Genomics,
• The Institute for Genomic Research (TIGR),
• Institute for Biological Energy Alternatives, and
• J. Craig Venter Science Foundation Joint Technology Center.
• https://www.jcvi.org/
• Genomics and the societal implications of genomics
• genomic medicine; environmental genomic analysis; clean energy;
synthetic biology; and ethics, law, and economics
45
46
23

Class12 Biological Database

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class12 Biological Database

Uploaded by

Copyright:

Available Formats

16/5/2023

• A database is a computerized archive used to

What is molecular sequence?

Some important biological databases

Central dogma of molecular biology

Transcriptome (RNA) Transcriptomics

Types of data generated by molecular

Genome size (haploid)

Primary sequence databases

Primary vs. Secondary sequence databases

Organism-specific genomic databases

Nucleotide sequence databases

LOCUS AF115338 591 bp DNA linear BCT 19-AUG-1999

Who can put data into GenBank?

Sequence data are submitted to GenBank from scientists from

Warning: GenBank does not check the validity or accuracy of

European Molecular Biology Laboratory

ID AF115338 standard; DNA; PRO; 591 BP.

DDBJ - DNA Data Bank of Japan

Genome database in NCBI

Genome database at EMBL-EBI

Protein Information Resource

UniProt: the next generation of protein

NCBI Protein database

>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4

> Your favourite gene 1 - yfg1

Other example of sequence data in Fasta format

Reference Sequence (RefSeq) Project

The J. Craig Venter Institute (JCVI)

You might also like