02 Databases

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

4/03/13

Reference
Storing sequences

Databases and le formats
Zvelebil and Baum, Understanding Bioinformatics chapter 3

Molecular biology databases



Sequence databases

Annotated
Low-annotation
Specialized

Databases- terminology

Database- collection of information related to a specic subject (e.g. a phone book)
Record- an entry in a database (e.g. your entry in the phone book)
Field- a component of a record (e.g. your address & number)

Structural databases
Motif databases
Genome databases

Proteome databases
RNA expression
Literature
Populations
Mutations
Polymorphisms
Organisms
Pathways

Flat-le databases store data as text les


Flat-le databases

Pros
Easy to put together and distribute
No need for expensive or complicated database management software
Cons
Detailed targeted searching is difcult
Searching is not efcient

4/03/13

Relational databases contain interconnected tables


Relational databases

Require a Relational Database Management System (RDBMS)
Queried using SQL (or more commonly, a GUI front-end)


SELECT protab1.protein-name, protab2.protein-sequence FROM protab1, protab2 WHERE protab1.protein-code = protab2.protein-code AND protab1.protein-code = P1002;

Data in a database

Primary data

e.g. DNA sequence, protein sequence, protein 3D structure coordinates

Database Record Structure



A sequence database record contains both sequence and annotations
Record divided into 3 sections:

Header
Feature table
Sequence

Annotations

e.g. Authors, literature references, protein function, organism of origin, location of coding regions in DNA sequence etc

GenBank Anatomy: Header



LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM HUMSOMI 2667 bp DNA linear PRI 13-JAN-1995 Human somatostatin I gene and flanks. J00306 J00306.1 GI:338287 neuropeptide Y; somatostatin; somatostatin I; somatostatin-14; somatostatin-28. Homo sapiens (human) Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. 1 (bases 1126 to 1368; 2246 to 2605) Shen,L.P., Pictet,R.L. and Rutter,W.J. Human somatostatin I: sequence of the cDNA Proc. Natl. Acad. Sci. U.S.A. 79 (15), 4575-4579 (1982) 6126875 2 (bases 1 to 2667) Shen,L.P. and Rutter,W.J. Sequence of the human somatostatin I gene Science 224 (4645), 168-171 (1984) 6142531 Original source text: Human fetal liver DNA, Charon 4A library, clone pHSI-1-2.7 [2], and pancreatic somatostatinoma tissue,

GenBank Anatomy: Features



FEATURES source Location/Qualifiers 1..2667 /organism="Homo sapiens" /mol_type="genomic DNA" /db_xref="taxon:9606" /map="3q28" prim_transcript 1126..2605 /note="som I mRNA" CDS join(1231..1368,2246..2458) /note="preprosomatostatin I" /codon_start=1 /protein_id="AAA60566.1" /db_xref="GI:338288" /translation="MLSCRLQCALAALSIVLALGCVTGAPSDPRLRQFLQKSLAAAAG KQELAKYFLAELLSEPNQTENDALEPEDLSQAAEQDEMRLELQRSANSNPAMAPRERK AGCKNFFWKTFTSC" sig_peptide 1231..1302 /note="prosomatostatin I signal peptide" mat_peptide 2372..2455 /product="somatostatin-28 peptide" mat_peptide 2414..2455 /product="somatostatin-14 peptide" gene 1231..1368 /gene="SST" exon <1231..1368 /gene="SST" /note="preprosomatostatin I; G00-119-604" /number=1 intron 1369..2245 /note="som I cds intron A" exon 2246..>2458 /note="preprosomatostatin I" /number=2

REFERENCE AUTHORS TITLE JOURNAL PUBMED REFERENCE AUTHORS TITLE JOURNAL PUBMED COMMENT

4/03/13

GenBank Anatomy: Sequence



ORIGIN 1 61 121 181 241 301 361 421 481 541 601 661 721 781 841 901 961 1021 1081 1141 1201 1261 1321 1381 1441 1501 1561 1621 1681 1741 1801 1861 1921 1981 2041 2101 2161 2221 2281 2341 2401 2461 2521 2581 2641 Chromosome 3q28; 1 bp upstream gaattcaagg acaggttttc ttaaactttc tttaaccaag aatcttttga tcctttccac tctgggagtt cctagacctt atatgtctaa accggcagga atctgttaga aaactcagag tggaggtctg gttttgctca aagtgtgcag actctgcctt actgattcaa attctgaggt gctctaaatt tctcagcacc aaaatgaagt cccgtacata tacacacact catacatata gaagtctcag ttgctgagaa agagggaggg tagagaaaag tgaagttctt ttagagcccc tggaggaaat aaagagggct cagtccttct ttagaggaat gattctgatc tccacctacc acactaaaat gttagagtat gatgacagat aagggtgata gtgtatttgc tctttaagag ttgagtgtgc atgtgtggga gtgaaattgt aataaaagat tgtataaatc gtggggcatg attttttttt ttttaagtaa gccactttag gattttgcga ggctaatggt gcgtaaaagg tgacgtcaga gagagagttt aaaacagagg ggagcgaggt tcggagccat cgctgctgcc actctccagc tcggctttcg cggcgccgag gctgcgctgt ccatcgtcct ggccctgggc ctccgtcagt ttctgcagaa gtccctggct cctcgacgtc tcccggattc tccagccctc gacgtaaggg atgctcagtc cttctaaaga ccgaagctct tgagaaaact atcaaaggct atgataagcg cagtcggtca cagttcaggt aaaatctggg tagttgtctg ggcacgaagc tgggaagcgc tgacccaggt gctgaaacgc cctcacttcc aaacgtcggg actagggcaa tttgagattg tgagaagcct cgttccccta ttctactttg tagctcgtga tgtgaaaatt accagaacgg tctttatgtg tgtgtgtgtg tgtgtgtgtg tgttaagtct acagggacag gcctttttgt gtaacttggt aattatagca tttaaatgtg acaaaaaatg cgcagctgta ttctttaatt taggcttttc aaattttttc tttctatccc ttctgcccta tacaggaact tgaacccaac cagacggaga atgatgccct gcaggatgaa atgaggcttg agctgcagag ccgagaacgc aaagctggct gcaagaattt ttcttaacta gtattgtcca tatcagacct ttccctaatc ctccaagtct tcagcgagac caaaataaaa ttatggtgaa attatgaaaa ttttgttcaa taatacataa taagctt of EcoRI site. tttgtttcta ggagatcagg atatagatat acaatagtgg actggggctt cctgacataa ctcagtagaa ggaacactgg tatgtgaagg agaacaattt ttattgaata atttcttaga ccatttcaat ctctctctct tatggtcaca atagaaaggc agggtgagcc agagtacttc gttacatctt caaggccttt accgtccata tttcattctc atacacatgc cctgttgctt ggagttgtct gggtacattt ctgagtgttt gagcctctgt ggaatgtgta tgctcatagc tggaattgtg tgtgcctgtg atcttgtcac ctcccctgtc gctggtgaga tctgggggcg gagacggttg agagcacaca tgctgatccg cgcctagagt atgctgtcct gccgcctcca tgtgtcaccg gcgctccctc gctgccgcgg ggaagcaggt cctaagcctt gctcctgccc gttttggtgc ttttctgggt agaatcccct tctaactctt gagttcttac ttggcattca gacaatggcg tctatccctg agacctctga agctgctacc aggggcaatc taaagaccga cagttttact tggtaaaaat gaattaaact gttggcacac tgtgtgtgtg tgtttgtgcg aaaggttgca gaaacatttg actatcctta tttttatatc aaaactggat tttgtgtgtg cattgtcctc cccacttctc ggccaagtac ttcttggcag ggaacctgaa gatctgtccc atctgctaac tcaaacccgg cttctggaag actttcacat ctgatccctc gcccccacac ccttgcatta gaaactgaaa atgtgaattt ggtttctatt cagagctgaa tcacatatgt aactatgctt ctttggaatg actgaccatt ttgccttcca ctctttccct aggtagatca tcccccattg tatgagataa aaatctgtta gttgggcctt gtgtgcattt ttgtgtgtaa actgagtgaa cgtgtgcagt ttctgtgatt cctcctagcc agccgcttta ttgaccagcc gtgcgcgctg ggaccccaga aaggagactc cattggtttg ccctcagctc tttttccccc agaaaattac gtgctgaccc tcttagcgta acgccgtatg ggtaaaacaa actttatctt tgtgtgtgtg agctcttaaa cttgattgat accaaatctg tttctctctt agctgctgtc aggctgctga ctatggcacc cctgttagct cccatctctc actgtaaata gagtaaatct

Accessing sequence databases



Searching the header

Searching the annotations for keywords (organism, gene name etc)

//

Using the right words: ontologies



MOLECULAR FUNCTION
Nucleic acid binding enzyme Adenosine triphophatase

Searching the sequences



Searching for sequences similar to a query sequence using programs such as BLAST
Searching for sequences containing particular patterns

DNA binding

helicase DNA helicase

Chromatin binding

ATP-dependant helicase

DNA-dependant Adenosine triphosphatase

ATP-dependant DNA helicase

Finding databases for bioinformatics



Google is your friend (but be critical!)
Nucleic Acids Research annual database supplement

Nucleotide sequence repositories



Central repositories for all known public nucleotide sequences
Annotations and sequences are entered and curated by submitters

Quality control issues
Lack of consistency of annotations

4/03/13

Nucleotide sequence repositories



Main repositories:

GenBank (US)
EMBL (Europe)
DDBJ (Japan)

Genbank

Currently maintained by the National Center for Biotechnology Information (part of the National Library of Medicine) in Bethesda, MD
Database available for download and searching using Entrez and BLAST
http://www.ncbi.nlm.nih.gov

All 3 databases exchange data daily and should contain the same sequences
Databases differ in their format and in the services they offer for searching and submission

EMBL

Currently maintained by the European Bioinformatics Institute in Hinxton, UK
Available for download and search using SRS, BLAST, fasta
http://www.ebi.ac.uk

DDBJ (DNA database of Japan)



National Institute of Genetics, Japan
Available for download and search using SRS, BLAST, fasta etc
http://www.ddbj.nig.ac.jp/

Nucleotide sequence data



Genomic DNA (whole or partial genomes)
cDNA and mRNA
ESTs

Genbank divisions

1. PRI - primate sequences
2. ROD - rodent sequences
3. MAM - other mammalian sequences
4. VRT - other vertebrate sequences
5. INV - invertebrate sequences
6. PLN - plant, fungal, and algal sequences
7. BCT - bacterial sequences
8. VRL - viral sequences
9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
16. HTG - HTGS sequences (high throughput genomic sequences)
17. HTC - HTC sequences (high throughput cDNA sequences)

4/03/13

LOCUS DEFINITION

EST

Expressed Sequence Tags (ESTs) are short (usually about 300-500 bp), single-pass sequence reads from mRNA (cDNA). Typically they are produced in large batches. They represent a snapshot of genes expressed in a given tissue and/or at a given developmental stage. They are tags (some coding, others not) of expression for a given cDNA library.

EST entry

ACCESSION NID KEYWORDS SOURCE

ORGANISM REFERENCE AUTHORS TITLE JOURNAL COMMENT

T12742 157 bp mRNA EST 28-OCT-1993 zEST00149-5 Zea mays cDNA clone csuh00149/umc382 5' end similar to similar to short chain alcohol dehydrogenase. T12742 g409680 EST. Maize clone=csuh00149/umc382 library=Maize Leaf, Stratagene #937005 strain=B73 vector=Uni-ZAP primer=SK Rsite1=EcoR1 Rsite2=Xho1 mRNA isolated from illuminated leaves and sheaths of 5 week old plant. cDNA directionally cloned into vector. . Zea mays Eucaryotae; Embryophyta; Magnoliophyta; Liliopsida; Cyperales; Poaceae; Zea. 1 (bases 1 to 157) Baysdorfer,C. The Maize cDNA Program Unpublished (1993)

Contact: Baysdorfer C California State University Dept Biol Sci, California State Univ, Hayward, CA 94542 Tel: 5108813459 Fax: 5107272035 Email: cbaysdor@s1.csuhayward.edu. FEATURES Location/Qualifiers source 1..157 /organism="Zea mays" /clone="csuh00149/umc382" /strain="B73" BASE COUNT 33 a 42 c 51 g 26 t 5 others ORIGIN 1 CCTCAAGGGC GTCGACNNNA TGCCCGAGGA CGTCGCCCAG GNNGTGCTCT 51 ACCTGGCCAG CGACGAGGCG AGGTACGTCA GCGCGGTCAA CCTCATGGTG 101 GACGGAGGCT TCACAGCCGT AAACAATAAC CTCAGGGCGT TTGAGGATTA 151 GTTGAGG

Protein sequence databases



Genbank proteins/TrEMBL
SWISSPROT
PIR
NRL3D
UniProt

Genbank protein, TrEMBL



Translations of CDS features in Genbank or EMBL
Limited annotations (annotations are usually the nucleotide annotations)
Most up-to-date

LOCUS DEFINITION genes,

CYNPCP_1 Cyanidium caldarium phycocyanin beta-subunit (cpcB) and cpcA complete cds. g304585 21-Apr-1996 L13467 Cyanidium caldarium Nucleic Acid Features translated to generate this entry: CDS 483..1001 /gene="cpcB" /standard_name="phycocyanin" /codon_start=1 /function="light harvesting" /evidence=experimental /product="phycocyanin beta subunit" /db_xref="PID:g304585" Complete Complete 137230 18237.53 172 aa

Genbank Protein entry


NID DATE ACCESSION ORGANISM COMMENT

SWISSPROT

Maintained at the Swiss Bioinformatics Institute and EBI
Manually curated: high-quality (but not perfect) annotations
Not as up to date
http://www.expasy.org

AMINO CARBOXY CHECKSUM WEIGHT LENGTH ORIGIN

Composition 27 11 8 12 3 Ala Arg Asn Asp Cys A R N D C 7 7 12 0 12 Gln Glu Gly His Ile Q E G H I 14 5 6 4 4 Leu Lys Met Phe Pro



L K M F P 13 9 0 5 13 Ser Thr Trp Tyr Val S T W Y V

Mol. Wt. Unmod. Chain = 18237.53 1 51 101 151 MLDAFAKVVA IVTNAARALF DSSVLDDRCL GDCSALMAEV QADARGEFLS SEQPQLIQPG NGLRETYQAL GTYFDRAATA

Number Of Residues = 172

NTQLDALSKM VSEGNKRLDV VNRITSNASA GIAYTNRRMA ACLRDMEIIL RYVSYAIIAG GVPGASVAVG IEKMKDSAIA IANDPSGITT VQ

4/03/13

SWISSPROT entry

ID AC DT DT DT DE GN OS OG OC RN RP RC RX RA RL CC CC CC CC DR DR DR DR KW KW FT FT FT FT SQ

PHCA_GALSU STANDARD; PRT; 162 AA. P00306; 21-JUL-1986 (REL. 01, CREATED) 01-FEB-1996 (REL. 33, LAST SEQUENCE UPDATE) 01-FEB-1996 (REL. 33, LAST ANNOTATION UPDATE) C-PHYCOCYANIN ALPHA CHAIN. CPCA. GALDIERIA SULPHURARIA (CYANIDIUM CALDARIUM). CHLOROPLAST. EUKARYOTA; PLANTA; PHYCOPHYTA; RHODOPHYTA (RED ALGAE). [1] SEQUENCE FROM N.A. STRAIN=IIID2; MEDLINE; 95232204. TROXLER R.F., YAN Y., JIANG J.W., LIU B.; PLANT PHYSIOL. 107:985-994(1995). -!- FUNCTION: LIGHT-HARVESTING PHOTOSYNTHETIC BILE PIGMENT-PROTEIN FROM THE PHYCOBILIPROTEIN COMPLEX. -!- SUBUNIT: HETERODIMER OF AN ALPHA AND A BETA CHAIN. -!- PTM: CONTAINS ONE COVALENTLY LINKED BILIN CHROMOPHORE. EMBL; L13467; G304586; -. EMBL; S77125; G998372; -. PIR; A00314; CFKKA. HSSP; P07122; 1CPC. PHYCOBILISOME; ELECTRON TRANSPORT; PHOTOSYNTHESIS; BILE PIGMENT; CHLOROPLAST. BINDING 84 84 PHYCOCYANOBILIN CHROMOPHORE. CONFLICT 61 61 S -> Q (IN REF. 3). CONFLICT 95 95 V -> I (IN REF. 3). CONFLICT 101 101 V -> A (IN REF. 3). SEQUENCE 162 AA; 17505 MW; A4BF84C3 CRC32; Composition

TrEMBL and SWISS-PROT



EMBL
(DNA)

Short peptide fragments Immunoglobulins T-Cell receptors REM-TrEMBL
Patented sequences Synthetic peptides Non-protein DNA translations

Auto-translation

Automatic sorting

26 8 11 5 2

Ala Arg Asn Asp Cys

A R N D C

7 9 12 1 9

Gln Glu Gly His Ile

Q E G H I

13 5 4 3 6

Leu Lys Met Phe Pro

L K M F P

12 9 1 12 7

Ser Thr Trp Tyr Val

S T W Y V

Annotation

Mol. Wt. Unmod. Chain = 17505.47 1 51 101 151

Number Of Residues = 162

TrEMBL

SP-TrEMBL

others

SWISS-PROT

MKTPITEAIA AADNQGRFLS NTELQAVNGR YQRAAASLEA ARSLTSNAER LINGAAQAVY SKFPYTSQMP GPQYASSAVG KAKCARDIGY YLRMVTYCLV VGGTGPMDEY LIAGLEEINR TFDLSPSWYV EALNYIKANH GLSGQAANEA NTYIDYAINA LS

UniProt

Unied protein database incorporating PIR, SWISS-PROT, TrEMBL
http://www.uniprot.org

UniProt components

UniProt Knowledgebase (UniProtKB)

central access point for extensive curated protein information, including function, classication, and cross-reference

UniProt Non-redundant Reference (UniRef)



combines closely related sequences into a single record to speed searches

UniProt Archive (UniParc)



comprehensive repository, reecting the history of all protein sequences.

Specialized sequence databases



Focus on a specic type of sequences
Sequences are often modied or specially annotated
Usage depends on the database
Examples:

Ribosomal RNA databases
Immunology databases

Non-redundant databases

Sequence data only: cannot be browsed, can only be searched using a sequence
Combine sequences from more than one database
Identical duplicate sequences are removed
Examples:

NR Nucleic (genbank+EMBL+DDBJ+PDB DNA)
NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein)

4/03/13

Data submission and quality



Primary repositories provide tools to submit sequences
Much quality control is left to the submitter
Some automatic quality control, but errors sometimes creep in
Human annotation takes time

RefSeq

NCBI Reference Sequence Collection
Non-redundant
Validated data
Format consistency
Ongoing curation, automated and manual
Distinct accession numbers: XX_NNNNNN

eg: NC_123456, XP_123456

Genomic DNA, transcript (RNA), and protein products, for major research organisms

Some important points



Always use the latest version of the database
Pay attention to accession and version numbers

Genbank growth

Live demo

Searching for sequences at NCBI
Searching for sequences in Uniprot

Some reading

NCBI Handbook chapter 1
http://www.ncbi.nlm.nih.gov/books/ bookres.fcgi/handbook/ch1.pdf

4/03/13

Sequence le formats

Sequences can be stored on a computer in different formats/standards
Different software packages will require sequences to be stored in different formats
Programs such as readseq can be used to convert between formats

LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM

HSPPI 450 bp mRNA linear PRI 20-JUL-1993 Homo sapiens mRNA for insulinoma pre-proinsulin. X70508 X70508.1 GI:394765 preproinsulin. human. Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 450) AUTHORS Chekhranova,M.K., Shuvalova,E.R., Kutin,A.M., Butnev,V.Iu., Valentsova,A.B., Il'ina,E.N. and Pankov,Iu.A. TITLE Cloning, primary structure determination and expression of preproinsulin cDNA from human insulinoma in Escherichia coli JOURNAL Mol. Biol. (Mosk.) 26 (3), 596-600 (1992) MEDLINE 93024361 FEATURES Location/Qualifiers source 1..450 /organism="Homo sapiens" /db_xref="taxon:9606" /clone="pUEX1Ins12" /clone_lib="Human insulinoma cDNA library" sig_peptide 45..80 CDS 45..377 /codon_start=1 /product="pre-proinsulin" /protein_id="CAA49913.1" /db_xref="GI:394766" /db_xref="SWISS-PROT:P01308" /translation="MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCG ERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSL YQLENYCN" mat_peptide 78..374 /product="pre-proinsulin" BASE COUNT 86 a 152 c 136 g 76 t ORIGIN 1 gctgcatcag aagaggccat caagcacatc actgtccttc tgccatggcc ctgtggatgc 61 gcctcctgcc cctgctggcg ctgctggccc tctggggacc tgacccagcc gcagcctttg 121 tgaaccaaca cctgtgcggc tcacacctgg tggaagctct ctacctagtg tgcggggaac 181 gaggcttctt ctacacaccc aagacccgcc gggaggcaga ggacctgcag gtggggcagg 241 tggagctggg cgggggccct ggtgcaggca gcctgcagcc cttggccctg gaggggtccc 301 tgcagaagcg tggcattgtg gaacaatgct gtaccagcat ctgctccctc taccagctgg 361 agaactactg caactagacg cagcccgcag gcagcccccc acccgccgcc tcctgcaccg 421 agagagatgg aataaagccc ttgaaccagc //

Sequence le formats: Genbank format


Sequence le formats: Fasta format



>HSPPI 450 bp mRNA linear PRI 20-JUL-1993, 450 bases, 44C checksum. gctgcatcagaagaggccatcaagcacatcactgtccttctgccatggcc ctgtggatgcgcctcctgcccctgctggcgctgctggccctctggggacc tgacccagccgcagcctttgtgaaccaacacctgtgcggctcacacctgg tggaagctctctacctagtgtgcggggaacgaggcttcttctacacaccc aagacccgccgggaggcagaggacctgcaggtggggcaggtggagctggg cgggggccctggtgcaggcagcctgcagcccttggccctggaggggtccc tgcagaagcgtggcattgtggaacaatgctgtaccagcatctgctccctc taccagctggagaactactgcaactagacgcagcccgcaggcagcccccc acccgccgcctcctgcaccgagagagatggaataaagcccttgaaccagc

Sequence le formats: GCG format



HSPPI HSPPI 450 bp mRNA linear HSPPI Length: 450 Jul 23, 2003 1 gctgcatcag aagaggccat caagcacatc 51 ctgtggatgc gcctcctgcc cctgctggcg 101 tgacccagcc gcagcctttg tgaaccaaca 151 tggaagctct ctacctagtg tgcggggaac 201 aagacccgcc gggaggcaga ggacctgcag 251 cgggggccct ggtgcaggca gcctgcagcc 301 tgcagaagcg tggcattgtg gaacaatgct 351 taccagctgg agaactactg caactagacg 401 acccgccgcc tcctgcaccg agagagatgg PRI 20-JUL-1993 13:38 Check: 1100 .. actgtccttc tgccatggcc ctgctggccc tctggggacc cctgtgcggc tcacacctgg gaggcttctt ctacacaccc gtggggcagg tggagctggg cttggccctg gaggggtccc gtaccagcat ctgctccctc cagcccgcag gcagcccccc aataaagccc ttgaaccagc

You might also like