Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

16/5/2023

SIJ1004

Class 12
Molecular Sequence Databases

What is a database?

• A database is a computerized archive used to


• store and organize data.
• so that information can be retrieved by a variety of search criteria.
• Biological databases often used for knowledge discovery.
• A good database needs to be:
• structured with minimum redundancy
• searchable (data retrieval)
• updated periodically
• cross-referenced with other databases
• Tools for analysis and visualization
• Data is processed, organized, structured and presented in a useful context
 information

1
16/5/2023

Databases
• Books, articles 1968 -> 1985
• Computer tapes 1982 ->1992
• Floppy disks 1984 -> 1990
• CD-ROM 1989 -> ?
• FTP 1989 -> ?
• On-line services 1982 -> ?
• WWW 1993 -> ?
• DVD 2001 -> ?

Biological Database
• Libraries related to biological data and information
• Data from scientific experiments
• Published literature
• High-throughput experiment technologies
• Computational analysis

2
16/5/2023

What is molecular sequence?


• DNA
• RNA
• Protein

Some important biological databases


in bioinformatics
• GenBank ncbi.nlm.nih.gov Nucleotide sequences
• Ensembl ensembl.org Human/mouse genome
• PubMed ncbi.nlm.nih.gov Literature references
• NR ncbi.nlm.nih.gov Protein sequences
• UniProt uniprot.org/ Protein sequences
• InterPro ebi.ac.uk Protein domains
• OMIM ncbi.nlm.nih.gov Genetic diseases
• Expasy expasy.org Enzymes
• PDB rcsb.org/pdb/ Protein structures
• KEGG genome.ad.jp Metabolic pathways
• MINT mint.bio.uniroma2.it/ Protein-protein interactions
• GeneNet genenetwork.org / Gene networks
• Transpath genexplain.com/transpath Signal transduction pathways

3
16/5/2023

Central dogma of molecular biology


• The flow of information: DNA  RNA  protein

DNA Replication

Transcription Translation
Biological
Reverse transcription in retrovirus Activity
(Function)

Metabolite

Omics world

Central Dogma
Genome (DNA) Genomics Genotype
Transcription

Transcriptome (RNA) Transcriptomics


Translation
Proteome (Protein) Proteomics Phenotype

Function
Metabolome Metabolomics
(Metabolite)

4
16/5/2023

Types of data generated by molecular


biology research
• Molecular sequence:
• Nucleotide sequence (DNA or RNA)
• Protein (amino acids) sequence
• Genome sequence (complete or draft).
• Gene expression
• Gene polymorphism (variation)
• Macromolecular 3D structure
• Proteins sequence patterns or motifs
• Metabolic pathways

Genome size (haploid)


Genome size (bp)
Phi-X 174 5,386
Human Mitochondrion 16,569
Mycoplasma genitalium 490,885
Rikettsia prowazekii 1,111,523
Haemophilus influenzae 1,830,138
Streptococcus pneumoniae 2,160,837
Vibrio cholerae 4,033,460
Mycobacterium tuberculosis 4,411,532
Bacillus subtilis 4,214,814
Escherichia coli 4,639,211
Escherichia coli O157:H7 5,440,000
Saccharomyces cerevisiae 12,495,682
Caenorhabditis elegans 100,258,171
Arabidopsis thaliana 115,409,949 The haploid human
Drosophila melanogaster 122,653,977 genome (23 chromosomes)
Anopheles gambiae 178,244,063 is estimated to be about
Homo sapiens 3,200,000,000 3.2 billion bases long
Oryza sativa 3,900,000,000
10

5
16/5/2023

Primary Databases
• Experimental results submission by researchers  Original submissions
directly into the database
• Content controlled by the submitter
• Three major databases for nucleotide sequences:
• GenBank
http://www.ncbi.nlm.nih.gov/Genbank/
• European Molecular Biology Laboratory (EMBL)
www.embl.org
• DNA Databank of Japan (DDBJ)
http://www.ddbj.nig.ac.jp/
• Sequences are exchanged on a daily basis

11

• US:
• NIH (National Institute of Health)  NLM (National Library of
Medicine)  NCBI (National Center for Biotechnology Information) 
GenBank (database)
• European:
• EMBL (European Molecular Biology Laboratory)  EBI (European
Bioinformatics Institute)  ENA (European Nucleotide Archives)

12

6
16/5/2023

Primary sequence databases

Labs

Sequencing
Centers

TATAGCCG TATAGCCG
TATAGCCG TATAGCCG

GenBank
Updated ONLY
by submitters

13

Secondary databases
• Significant processing of original raw data.
• Annotation.
• Functional links.
• Carefully curated database (manually or automate.
• High quality.
• SWISS-PROT, trEMBL and PIR combined in UniProt.
• http://www.uniprot.org/
• Incorporates:
• Function of the protein
• Subcellular localization of protein
• Post-translational modification
• Domains and sites
• Secondary structure
• Quaternary structure
• Similarities to other proteins
• Diseases associated with deficiencies in the protein
• Sequence conflicts, variants, etc. 14

7
16/5/2023

Primary vs. Secondary sequence databases


RefSeq
Labs
TATAGCCG
AGCTCCGATA
CCGATGACAA

Sequencing
Centers Genome
Curators Assembly

Updated
continually
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG by NCBI

GenBank
UniGene
Updated ONLY
by submitters
Algorithms
15

Specialized Databases
• Often focused on a specific aspect of an organism.
• Curated by experts.
• Highly annotated and processed data.

16

8
16/5/2023

Organism-specific genomic databases


Organism Database/resource URL
Escherichia coli EcoGene http://bmb.med.miami.edu/EcoGene/EcoWeb
EcoCyc (Encyclopedia of E. coli http://ecocyc.pangeasystems.com/ecocyc/ecocyc
genes and metabolism .html
Colibri http://genolist.pasteur.fr/Colibri
Bacillus subtilis SubtiList http://genolist.pasteur.fr/SubtiList
Saccharomyces Saccharomyces Genome Database http://genome-www.stanford.edu/Saccharmyces
cerevisiae (SGD)
Plasmodium falciparum PlasmoDB http://PlasmoDB.org
Arabidopsis thaliana MIPS Arabidopsis thaliana http://mips.gsf.de/proj/thal/db
Database (MAtDB)
The Arabidopsis information http://www.arabidopsis.org
resource (TAIR)
Drosophila FlyBase http://flybase.bio.indiana.edu
melanogaster
Caenorhabditis elegans A C. elegans DataBase (ACeDB) http://www.acedb.org
Mouse Mouse Genome Database (MGD) http://www.informatics.jax.org
Human Human Genome Organization http://www.hugo-international.org/
(HUGO)

Nucleotide sequence databases


• Primary database: Genebank, EMBL and DDBJ
• These 3 db contain mainly the same information (few differences in
the format and syntax)
• Serve as archives containing all sequences (single genes, ESTs,
complete genomes, etc.) derived from:
• Genome projects and sequencing centers
• Individual scientists
• Patent offices (i.e. USPTO, EPO)
• Non-confidential data are exchanged daily
• Size: more than 2.5 x107 sequences, over 3.2 x1010 bp;
• Sequences from > 500,000 different species;

18

9
16/5/2023

GenBank

• Public nucleotide
sequence database
• Development of
software tools for
sequence analysis

19

LOCUS AF115338 591 bp DNA linear BCT 19-AUG-1999


DEFINITION Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete

ACCESSION
VERSION
cds.
AF115338
AF115338.1 GI:4959391
GenBank Flat File
KEYWORDS .
SOURCE Pseudomonas fluorescens.
ORGANISM Pseudomonas fluorescens
•Title
REFERENCE
Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae;
Pseudomonas.
1 (bases 1 to 591)
Header •Taxonomy
AUTHORS Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R.
TITLE Influence of a putative ECF sigma factor on expression of the major
•Citation
outer membrane protein, OprF, in Pseudomonas aeruginosa and
Pseudomonas fluorescens
JOURNAL J. Bacteriol. 181 (16), 4746-4754 (1999)
MEDLINE 99369842
PUBMED 10438740
REFERENCE 2 (bases 1 to 591)
AUTHORS De Mot,R.
TITLE Direct Submission
JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics,
Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium
FEATURES Location/Qualifiers
source 1..591
/organism="Pseudomonas fluorescens"
/strain="M114"
/db_xref="taxon:294"
gene 1..591
/gene="sigX"
CDS 1..591
/gene="sigX"
/codon_start=1
Features (seq)
/transl_table=11
/product="ECF sigma factor SigX"
/protein_id="AAD34329.1"
/db_xref="GI:4959392"
/translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ
RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR
KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE
FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET"
BASE COUNT 157 a 133 c 170 g 131 t
ORIGIN
1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag
61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg
121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac
181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag
20gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag
241
301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag
DNA Sequence
361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg
421 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa

10
16/5/2023

Who can put data into GenBank?

Sequence data are submitted to GenBank from scientists from


around the world.

Warning: GenBank does not check the validity or accuracy of


sequences submitted. This is left up to the scientific community
to verify, like all published scientific data.

21

European Molecular Biology Laboratory

22

11
16/5/2023

23

ID AF115338 standard; DNA; PRO; 591 BP.


AC
SV
DT
AF115338;
AF115338.1
03-JUN-1999 (Rel. 59, Created)
EMBL Flat File
DT 23-AUG-1999 (Rel. 60, Last updated, Version 2)
DE Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds.
KW .
OS Pseudomonas fluorescens
OC Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas. •Title
RN
RP
RX
[1]
1-591
MEDLINE; 99369842.
Header •Taxonomy
RA
RT
Brinkman F.S., Schoofs G., Hancock R.E., De Mot R.;
"Influence of a putative ECF sigma factor on expression of the major outer •Citation
RT membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas
RT fluorescens";
RL J. Bacteriol. 181(16):4746-4754(1999).
RN [2]
RP 1-591
RA De Mot R.;
RT ;
RL Submitted (04-DEC-1998) to the EMBL/GenBank/DDBJ databases.
RL F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K.
RL Mercierlaan 92, Heverlee B-3001, Belgium
DR SPTREMBL; Q9X4L7; Q9X4L7.
FH Key Location/Qualifiers
FH
FT source 1..591
FT /db_xref="taxon:294"
FT /organism="Pseudomonas fluorescens"
FT /strain="M114"
FT CDS 1..591
FT
FT
FT
/codon_start=1
/db_xref="SPTREMBL:Q9X4L7"
/transl_table=11
Features (seq)
FT /gene="sigX"
FT /product="ECF sigma factor SigX"
FT /protein_id="AAD34329.1"
FT /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQR
FT TLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYRKE
FT RRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELEFQE
FT IADIMHMGLSATKMRYKRALDKLREKFAGETET"
SQ Sequence 591 BP; 157 A; 133 C; 170 G; 131 T; 0 other;
atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 60
ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 120
cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 180
gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag
24
gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag
tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag
240
300
360
DNA Sequence
gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg 420

12
16/5/2023

DDBJ - DNA Data Bank of Japan

25

26

13
16/5/2023

Genome database
• What Genomes Are Available?
• List of completed genomes increases almost every week
• GOLD Website: Listing of finished and “in progress” genomes
http://www.genomesonline.org/

27

Genome database in NCBI

28

14
16/5/2023

https://www.ncbi.nlm.nih.gov/genome/viruses/

29

Genome database at EMBL-EBI

30

15
16/5/2023

SRA database

31

Protein Databases
• Protein sequence databases
• Protein properties
• Protein localization and targeting
• Protein sequence motifs and active sites
• Protein domain databases; protein classification
• Databases of individual protein families
• Protein structure database

32

16
16/5/2023

Protein Information Resource

• https://pir.georgetown.edu/
• The oldest universal curated
protein sequence database.
• Published as the ‘Atlas of Protein
Sequence and Structure’ from
1965 to 1978 by the late Margaret
O Dayhoff.
• Established in 1984 as a successor
to the original National Biomedical
Research Foundation Protein
Sequence Database.

33

SWISS-PROT
• https://web.expasy.org/docs/swiss-prot_guideline.html
• Manually curated, non-redundant protein sequence database.
• Highly integrated with other databases.

34

17
16/5/2023

TrEMBL
• TrEMBL (Translation from EMBL) database (http://www.ebi.ac.uk/trembl/)
• Automatically curated and derived from the translation of all coding
sequences in the DDBJ/EMBL/GenBank nucleotide sequence database
that are not yet included in Swiss-Prot.
• http://www.bioinfo.pte.hu/more/TrEMBL.htm

35

UniProt: the next generation of protein


sequence databases
• Combine the Swiss-Prot, TrEMBL and PIR-PSD databases into a single
resource, UniProt (http://www.uniprot.org)
• UniProt Knowledgebase: continue the work of Swiss-Prot, TrEMBL and PIR
by providing an expertly curated database.
• UniProt Archive (UniParc): new and updated sequences are loaded on a
daily basis.
• UniProt non-redundant reference databases (UniProt NREF): which
provide non-redundant views on top of the UniProt Knowledgebase and
UniParc.

36

18
16/5/2023

UniProt

37

NCBI Protein database


• http://www.ncbi.nlm.nih.gov/protein

38

19
16/5/2023

FASTA
• Begins with a single-line description and followed (after newline character) by
lines of sequence data.
• The description line (defline) is distinguished from the sequence data by a
greater-than (">") symbol at the beginning.
• It is recommended that all lines of text be shorter than 80 characters in length.

>gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4


MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R

39

FASTA

> Your favourite gene 1 - yfg1


MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI
IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN
LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES
SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE
R
> Your favourite gene 2 - yfg2
MQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIVIV
DTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENWTI
TSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLLEDNSKEW
EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIV

40

20
16/5/2023

Other example of sequence data in Fasta format


>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME
LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK

41

Accession number
• To label and identify sequence
 accessible information.
• A string of 4 to 12 characters
that are associated with a
molecular sequence record.

42

21
16/5/2023

Reference Sequence (RefSeq) Project


• GenBank: archival database (highly redundant).
• RefSeq entry corresponding to a given gene or gene product. (case of
splice variant or distant loci  several RefSeq entries).
• Non-redundant; one record for each gene, or each splice variant, from
each organism represented
• Each record is intended to present an encapsulation of the current
understanding of a gene or protein, similar to a review article

43
Formats of Accession Numbers for RefSeq Entries

44

22
16/5/2023

The J. Craig Venter Institute (JCVI)


• Non-profit genomics research institute founded in 2006.
• Consolidation of four organizations:
• Center for the Advancement of Genomics,
• The Institute for Genomic Research (TIGR),
• Institute for Biological Energy Alternatives, and
• J. Craig Venter Science Foundation Joint Technology Center.
• https://www.jcvi.org/
• Genomics and the societal implications of genomics
• genomic medicine; environmental genomic analysis; clean energy;
synthetic biology; and ethics, law, and economics

45

46

23

You might also like