Professional Documents
Culture Documents
Class12 Biological Database
Class12 Biological Database
SIJ1004
Class 12
Molecular Sequence Databases
What is a database?
1
16/5/2023
Databases
• Books, articles 1968 -> 1985
• Computer tapes 1982 ->1992
• Floppy disks 1984 -> 1990
• CD-ROM 1989 -> ?
• FTP 1989 -> ?
• On-line services 1982 -> ?
• WWW 1993 -> ?
• DVD 2001 -> ?
Biological Database
• Libraries related to biological data and information
• Data from scientific experiments
• Published literature
• High-throughput experiment technologies
• Computational analysis
2
16/5/2023
3
16/5/2023
DNA Replication
Transcription Translation
Biological
Reverse transcription in retrovirus Activity
(Function)
Metabolite
Omics world
Central Dogma
Genome (DNA) Genomics Genotype
Transcription
Function
Metabolome Metabolomics
(Metabolite)
4
16/5/2023
5
16/5/2023
Primary Databases
• Experimental results submission by researchers Original submissions
directly into the database
• Content controlled by the submitter
• Three major databases for nucleotide sequences:
• GenBank
http://www.ncbi.nlm.nih.gov/Genbank/
• European Molecular Biology Laboratory (EMBL)
www.embl.org
• DNA Databank of Japan (DDBJ)
http://www.ddbj.nig.ac.jp/
• Sequences are exchanged on a daily basis
11
• US:
• NIH (National Institute of Health) NLM (National Library of
Medicine) NCBI (National Center for Biotechnology Information)
GenBank (database)
• European:
• EMBL (European Molecular Biology Laboratory) EBI (European
Bioinformatics Institute) ENA (European Nucleotide Archives)
12
6
16/5/2023
Labs
Sequencing
Centers
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG
GenBank
Updated ONLY
by submitters
13
Secondary databases
• Significant processing of original raw data.
• Annotation.
• Functional links.
• Carefully curated database (manually or automate.
• High quality.
• SWISS-PROT, trEMBL and PIR combined in UniProt.
• http://www.uniprot.org/
• Incorporates:
• Function of the protein
• Subcellular localization of protein
• Post-translational modification
• Domains and sites
• Secondary structure
• Quaternary structure
• Similarities to other proteins
• Diseases associated with deficiencies in the protein
• Sequence conflicts, variants, etc. 14
7
16/5/2023
Sequencing
Centers Genome
Curators Assembly
Updated
continually
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG by NCBI
GenBank
UniGene
Updated ONLY
by submitters
Algorithms
15
Specialized Databases
• Often focused on a specific aspect of an organism.
• Curated by experts.
• Highly annotated and processed data.
16
8
16/5/2023
18
9
16/5/2023
GenBank
• Public nucleotide
sequence database
• Development of
software tools for
sequence analysis
19
ACCESSION
VERSION
cds.
AF115338
AF115338.1 GI:4959391
GenBank Flat File
KEYWORDS .
SOURCE Pseudomonas fluorescens.
ORGANISM Pseudomonas fluorescens
•Title
REFERENCE
Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae;
Pseudomonas.
1 (bases 1 to 591)
Header •Taxonomy
AUTHORS Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R.
TITLE Influence of a putative ECF sigma factor on expression of the major
•Citation
outer membrane protein, OprF, in Pseudomonas aeruginosa and
Pseudomonas fluorescens
JOURNAL J. Bacteriol. 181 (16), 4746-4754 (1999)
MEDLINE 99369842
PUBMED 10438740
REFERENCE 2 (bases 1 to 591)
AUTHORS De Mot,R.
TITLE Direct Submission
JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics,
Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium
FEATURES Location/Qualifiers
source 1..591
/organism="Pseudomonas fluorescens"
/strain="M114"
/db_xref="taxon:294"
gene 1..591
/gene="sigX"
CDS 1..591
/gene="sigX"
/codon_start=1
Features (seq)
/transl_table=11
/product="ECF sigma factor SigX"
/protein_id="AAD34329.1"
/db_xref="GI:4959392"
/translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ
RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR
KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE
FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET"
BASE COUNT 157 a 133 c 170 g 131 t
ORIGIN
1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag
61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg
121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac
181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag
20gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag
241
301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag
DNA Sequence
361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg
421 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa
10
16/5/2023
21
22
11
16/5/2023
23
12
16/5/2023
25
26
13
16/5/2023
Genome database
• What Genomes Are Available?
• List of completed genomes increases almost every week
• GOLD Website: Listing of finished and “in progress” genomes
http://www.genomesonline.org/
27
28
14
16/5/2023
https://www.ncbi.nlm.nih.gov/genome/viruses/
29
30
15
16/5/2023
SRA database
31
Protein Databases
• Protein sequence databases
• Protein properties
• Protein localization and targeting
• Protein sequence motifs and active sites
• Protein domain databases; protein classification
• Databases of individual protein families
• Protein structure database
32
16
16/5/2023
• https://pir.georgetown.edu/
• The oldest universal curated
protein sequence database.
• Published as the ‘Atlas of Protein
Sequence and Structure’ from
1965 to 1978 by the late Margaret
O Dayhoff.
• Established in 1984 as a successor
to the original National Biomedical
Research Foundation Protein
Sequence Database.
33
SWISS-PROT
• https://web.expasy.org/docs/swiss-prot_guideline.html
• Manually curated, non-redundant protein sequence database.
• Highly integrated with other databases.
34
17
16/5/2023
TrEMBL
• TrEMBL (Translation from EMBL) database (http://www.ebi.ac.uk/trembl/)
• Automatically curated and derived from the translation of all coding
sequences in the DDBJ/EMBL/GenBank nucleotide sequence database
that are not yet included in Swiss-Prot.
• http://www.bioinfo.pte.hu/more/TrEMBL.htm
35
36
18
16/5/2023
UniProt
37
38
19
16/5/2023
FASTA
• Begins with a single-line description and followed (after newline character) by
lines of sequence data.
• The description line (defline) is distinguished from the sequence data by a
greater-than (">") symbol at the beginning.
• It is recommended that all lines of text be shorter than 80 characters in length.
39
FASTA
40
20
16/5/2023
41
Accession number
• To label and identify sequence
accessible information.
• A string of 4 to 12 characters
that are associated with a
molecular sequence record.
42
21
16/5/2023
43
Formats of Accession Numbers for RefSeq Entries
44
22
16/5/2023
45
46
23