Professional Documents
Culture Documents
02 Databases
02 Databases
02 Databases
Reference
Storing sequences
Databases and le formats
Zvelebil and Baum, Understanding Bioinformatics chapter 3
Databases- terminology
Database- collection of information related to a specic subject (e.g. a phone book)
Record- an entry in a database (e.g. your entry in the phone book)
Field- a component of a record (e.g. your address & number)
Structural databases
Motif databases
Genome databases
Proteome databases
RNA expression
Literature
Populations
Mutations
Polymorphisms
Organisms
Pathways
Flat-le databases
Pros
Easy to put together and distribute
No need for expensive or complicated database management software
Cons
Detailed targeted searching is difcult
Searching is not efcient
4/03/13
Relational databases
Require a Relational Database Management System (RDBMS)
Queried using SQL (or more commonly, a GUI front-end)
SELECT protab1.protein-name, protab2.protein-sequence FROM protab1, protab2 WHERE protab1.protein-code = protab2.protein-code AND protab1.protein-code = P1002;
Data in a database
Primary data
e.g. DNA sequence, protein sequence, protein 3D structure coordinates
Annotations
e.g. Authors, literature references, protein function, organism of origin, location of coding regions in DNA sequence etc
REFERENCE AUTHORS TITLE JOURNAL PUBMED REFERENCE AUTHORS TITLE JOURNAL PUBMED COMMENT
4/03/13
DNA binding
Chromatin binding
ATP-dependant helicase
4/03/13
Genbank
Currently maintained by the National Center for Biotechnology Information (part of the National Library of Medicine) in Bethesda, MD
Database available for download and searching using Entrez and BLAST
http://www.ncbi.nlm.nih.gov
All 3 databases exchange data daily and should contain the same sequences
Databases differ in their format and in the services they offer for searching and submission
EMBL
Currently maintained by the European Bioinformatics Institute in Hinxton, UK
Available for download and search using SRS, BLAST, fasta
http://www.ebi.ac.uk
Genbank divisions
1. PRI - primate sequences
2. ROD - rodent sequences
3. MAM - other mammalian sequences
4. VRT - other vertebrate sequences
5. INV - invertebrate sequences
6. PLN - plant, fungal, and algal sequences
7. BCT - bacterial sequences
8. VRL - viral sequences
9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
16. HTG - HTGS sequences (high throughput genomic sequences)
17. HTC - HTC sequences (high throughput cDNA sequences)
4/03/13
LOCUS DEFINITION
EST
Expressed Sequence Tags (ESTs) are short (usually about 300-500 bp), single-pass sequence reads from mRNA (cDNA). Typically they are produced in large batches. They represent a snapshot of genes expressed in a given tissue and/or at a given developmental stage. They are tags (some coding, others not) of expression for a given cDNA library.
EST entry
T12742 157 bp mRNA EST 28-OCT-1993 zEST00149-5 Zea mays cDNA clone csuh00149/umc382 5' end similar to similar to short chain alcohol dehydrogenase. T12742 g409680 EST. Maize clone=csuh00149/umc382 library=Maize Leaf, Stratagene #937005 strain=B73 vector=Uni-ZAP primer=SK Rsite1=EcoR1 Rsite2=Xho1 mRNA isolated from illuminated leaves and sheaths of 5 week old plant. cDNA directionally cloned into vector. . Zea mays Eucaryotae; Embryophyta; Magnoliophyta; Liliopsida; Cyperales; Poaceae; Zea. 1 (bases 1 to 157) Baysdorfer,C. The Maize cDNA Program Unpublished (1993)
Contact: Baysdorfer C California State University Dept Biol Sci, California State Univ, Hayward, CA 94542 Tel: 5108813459 Fax: 5107272035 Email: cbaysdor@s1.csuhayward.edu. FEATURES Location/Qualifiers source 1..157 /organism="Zea mays" /clone="csuh00149/umc382" /strain="B73" BASE COUNT 33 a 42 c 51 g 26 t 5 others ORIGIN 1 CCTCAAGGGC GTCGACNNNA TGCCCGAGGA CGTCGCCCAG GNNGTGCTCT 51 ACCTGGCCAG CGACGAGGCG AGGTACGTCA GCGCGGTCAA CCTCATGGTG 101 GACGGAGGCT TCACAGCCGT AAACAATAAC CTCAGGGCGT TTGAGGATTA 151 GTTGAGG
CYNPCP_1 Cyanidium caldarium phycocyanin beta-subunit (cpcB) and cpcA complete cds. g304585 21-Apr-1996 L13467 Cyanidium caldarium Nucleic Acid Features translated to generate this entry: CDS 483..1001 /gene="cpcB" /standard_name="phycocyanin" /codon_start=1 /function="light harvesting" /evidence=experimental /product="phycocyanin beta subunit" /db_xref="PID:g304585" Complete Complete 137230 18237.53 172 aa
SWISSPROT
Maintained at the Swiss Bioinformatics Institute and EBI
Manually curated: high-quality (but not perfect) annotations
Not as up to date
http://www.expasy.org
Composition 27 11 8 12 3 Ala Arg Asn Asp Cys A R N D C 7 7 12 0 12 Gln Glu Gly His Ile Q E G H I 14 5 6 4 4 Leu Lys Met Phe Pro
L K M F P 13 9 0 5 13 Ser Thr Trp Tyr Val S T W Y V
Mol. Wt. Unmod. Chain = 18237.53 1 51 101 151 MLDAFAKVVA IVTNAARALF DSSVLDDRCL GDCSALMAEV QADARGEFLS SEQPQLIQPG NGLRETYQAL GTYFDRAATA
4/03/13
SWISSPROT entry
ID AC DT DT DT DE GN OS OG OC RN RP RC RX RA RL CC CC CC CC DR DR DR DR KW KW FT FT FT FT SQ
PHCA_GALSU STANDARD; PRT; 162 AA. P00306; 21-JUL-1986 (REL. 01, CREATED) 01-FEB-1996 (REL. 33, LAST SEQUENCE UPDATE) 01-FEB-1996 (REL. 33, LAST ANNOTATION UPDATE) C-PHYCOCYANIN ALPHA CHAIN. CPCA. GALDIERIA SULPHURARIA (CYANIDIUM CALDARIUM). CHLOROPLAST. EUKARYOTA; PLANTA; PHYCOPHYTA; RHODOPHYTA (RED ALGAE). [1] SEQUENCE FROM N.A. STRAIN=IIID2; MEDLINE; 95232204. TROXLER R.F., YAN Y., JIANG J.W., LIU B.; PLANT PHYSIOL. 107:985-994(1995). -!- FUNCTION: LIGHT-HARVESTING PHOTOSYNTHETIC BILE PIGMENT-PROTEIN FROM THE PHYCOBILIPROTEIN COMPLEX. -!- SUBUNIT: HETERODIMER OF AN ALPHA AND A BETA CHAIN. -!- PTM: CONTAINS ONE COVALENTLY LINKED BILIN CHROMOPHORE. EMBL; L13467; G304586; -. EMBL; S77125; G998372; -. PIR; A00314; CFKKA. HSSP; P07122; 1CPC. PHYCOBILISOME; ELECTRON TRANSPORT; PHOTOSYNTHESIS; BILE PIGMENT; CHLOROPLAST. BINDING 84 84 PHYCOCYANOBILIN CHROMOPHORE. CONFLICT 61 61 S -> Q (IN REF. 3). CONFLICT 95 95 V -> I (IN REF. 3). CONFLICT 101 101 V -> A (IN REF. 3). SEQUENCE 162 AA; 17505 MW; A4BF84C3 CRC32; Composition
Auto-translation
Automatic sorting
26 8 11 5 2
A R N D C
7 9 12 1 9
Q E G H I
13 5 4 3 6
L K M F P
12 9 1 12 7
S T W Y V
Annotation
TrEMBL
SP-TrEMBL
others
SWISS-PROT
MKTPITEAIA AADNQGRFLS NTELQAVNGR YQRAAASLEA ARSLTSNAER LINGAAQAVY SKFPYTSQMP GPQYASSAVG KAKCARDIGY YLRMVTYCLV VGGTGPMDEY LIAGLEEINR TFDLSPSWYV EALNYIKANH GLSGQAANEA NTYIDYAINA LS
UniProt
Unied protein database incorporating PIR, SWISS-PROT, TrEMBL
http://www.uniprot.org
UniProt components
UniProt Knowledgebase (UniProtKB)
central access point for extensive curated protein information, including function, classication, and cross-reference
Non-redundant databases
Sequence data only: cannot be browsed, can only be searched using a sequence
Combine sequences from more than one database
Identical duplicate sequences are removed
Examples:
NR Nucleic (genbank+EMBL+DDBJ+PDB DNA)
NR Protein (SWISS-PROT+TrEMBL+GenPept+PDB protein)
4/03/13
RefSeq
NCBI Reference Sequence Collection
Non-redundant
Validated data
Format consistency
Ongoing curation, automated and manual
Distinct accession numbers: XX_NNNNNN
eg: NC_123456, XP_123456
Genomic DNA, transcript (RNA), and protein products, for major research organisms
Genbank growth
Live demo
Searching for sequences at NCBI
Searching for sequences in Uniprot
Some reading
NCBI Handbook chapter 1
http://www.ncbi.nlm.nih.gov/books/ bookres.fcgi/handbook/ch1.pdf
4/03/13
Sequence le formats
Sequences can be stored on a computer in different formats/standards
Different software packages will require sequences to be stored in different formats
Programs such as readseq can be used to convert between formats
HSPPI 450 bp mRNA linear PRI 20-JUL-1993 Homo sapiens mRNA for insulinoma pre-proinsulin. X70508 X70508.1 GI:394765 preproinsulin. human. Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 450) AUTHORS Chekhranova,M.K., Shuvalova,E.R., Kutin,A.M., Butnev,V.Iu., Valentsova,A.B., Il'ina,E.N. and Pankov,Iu.A. TITLE Cloning, primary structure determination and expression of preproinsulin cDNA from human insulinoma in Escherichia coli JOURNAL Mol. Biol. (Mosk.) 26 (3), 596-600 (1992) MEDLINE 93024361 FEATURES Location/Qualifiers source 1..450 /organism="Homo sapiens" /db_xref="taxon:9606" /clone="pUEX1Ins12" /clone_lib="Human insulinoma cDNA library" sig_peptide 45..80 CDS 45..377 /codon_start=1 /product="pre-proinsulin" /protein_id="CAA49913.1" /db_xref="GI:394766" /db_xref="SWISS-PROT:P01308" /translation="MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCG ERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSL YQLENYCN" mat_peptide 78..374 /product="pre-proinsulin" BASE COUNT 86 a 152 c 136 g 76 t ORIGIN 1 gctgcatcag aagaggccat caagcacatc actgtccttc tgccatggcc ctgtggatgc 61 gcctcctgcc cctgctggcg ctgctggccc tctggggacc tgacccagcc gcagcctttg 121 tgaaccaaca cctgtgcggc tcacacctgg tggaagctct ctacctagtg tgcggggaac 181 gaggcttctt ctacacaccc aagacccgcc gggaggcaga ggacctgcag gtggggcagg 241 tggagctggg cgggggccct ggtgcaggca gcctgcagcc cttggccctg gaggggtccc 301 tgcagaagcg tggcattgtg gaacaatgct gtaccagcat ctgctccctc taccagctgg 361 agaactactg caactagacg cagcccgcag gcagcccccc acccgccgcc tcctgcaccg 421 agagagatgg aataaagccc ttgaaccagc //