1. Major biological databases like GenBank, EMBI, and DDBJ store billions of nucleotides of DNA sequence data from thousands of organisms.
2. GenBank is the largest public database, containing over 31 billion nucleotides from 24 million sequences. It is organized into divisions like EST, GSS, HTGS that contain expressed sequences, genome survey sequences, and unfinished high-throughput genomic sequences.
3. In addition to sequence data, databases provide annotation information and link sequences to bibliographic sources. Tools like UniGene cluster similar sequences to represent unique genes and show expression patterns.
1. Major biological databases like GenBank, EMBI, and DDBJ store billions of nucleotides of DNA sequence data from thousands of organisms.
2. GenBank is the largest public database, containing over 31 billion nucleotides from 24 million sequences. It is organized into divisions like EST, GSS, HTGS that contain expressed sequences, genome survey sequences, and unfinished high-throughput genomic sequences.
3. In addition to sequence data, databases provide annotation information and link sequences to bibliographic sources. Tools like UniGene cluster similar sequences to represent unique genes and show expression patterns.
1. Major biological databases like GenBank, EMBI, and DDBJ store billions of nucleotides of DNA sequence data from thousands of organisms.
2. GenBank is the largest public database, containing over 31 billion nucleotides from 24 million sequences. It is organized into divisions like EST, GSS, HTGS that contain expressed sequences, genome survey sequences, and unfinished high-throughput genomic sequences.
3. In addition to sequence data, databases provide annotation information and link sequences to bibliographic sources. Tools like UniGene cluster similar sequences to represent unique genes and show expression patterns.
• Publicly available databanks now contain billions of
nucleotides of DNA sequence data collected from thousands of different organisms • Major questions: 1. How do these databases store these data? 2. What strategies are followed to extract information from them? •Three publicly accessible databases store large amounts of nucleotide and protein 1. GenBank of the National Institute of Health (NIH) database, which is part of the National Center for Biotechnology Information (NCBI) 2. DNA Database of Japan (DDBJ). 3. European Bioinformatics Institute (EBI) in Hinton, England. •Protein data bank Europe (PDBe), worldwide protein data bank (wwPDB) etc. Key Bioinformatics Databases in detail •NCBI We will use this database a lot!!!! •http://www.ncbi.nlm.nih.gov/ •The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. •EBI-EMBL •http://www.ebi.ac.uk/ •The European Bioinformatics Institute (EBI) is a non-profit academic organization that forms part of the European Molecular Biology Laboratory (EMBL) database •The EBI is a centre for research and services in bioinformatics. The Institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures •(DDBJ) •http://www.ddbj.nig.ac.jp/ •DNA Data Bank of Japan (DDBJ) is the sole nucleotide sequence data bank in Asia, which is officially certified to collect nucleotide sequences from researchers and to issue the internationally recognized accession number to data submitters. •SIB •http://www.isb-sib.ch/ •The SIB Swiss Institute of Bioinformatics is an academic, non-profit foundation established in 1998. SIB coordinates research and education in bioinformatics throughout Switzerland and provides high quality bioinformatics services to the national and international research community. GENBANK: DATABASE of Most known nucleotide and protein sequences • GenBank is a database consisting of most known public DNA and protein sequences • In addition to storing these sequences, GenBank contains bibliographic and biological annotation • Data from GenBank are available free of charge from NCBI in the National Library of Medicine at the NIH • GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC), which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis. Amount of Sequence Data •GenBank contains over 3 1 billion nucleotides from 24 million sequences (released April 2003). •The number of bases in GenBank has doubled approximately every 14 months. Organisms in GenBank
• Over 100,000 different species are represented in GenBank,
• with over 1000 new species added per month (Benson et al., 2002). Organization and type of data: How data is stored Types of data in GenBank •To help organize the available information, each sequence name in a GenBank record is followed by its data file division and primary accession number.
•The following codes are used to
designate the data file divisions: The data file divisions: 1. PRI: primate sequences 2. ROD: rodent sequences 3. MAM: other mammalian sequences 4. VRT other vertebrate sequences 5. INV: invertebrate sequences 6. PLN: plant, fungal, and algal sequences 7. BCT: bacterial sequences 8. VRL: viral sequences 9. PHG: bacteriophage sequences 10. SYN: synthetic sequences 11. UNA: unannotated sequences 12. EST: sequences (expressed sequence tags) 13. PAT: patent sequences 14. STS: STS sequences (sequence-tagged sites) 15. GSS: GSS sequences (genome survey sequences) 16. HTG: HTGS sequences (high-throughput genomic sequences) What are ESTs? Expressed Sequence Tags (ESTS) • The database of expressed sequence tags (dbEST) is a division of GenBank that contains sequence data and other information on cDNA sequences from a number of organisms • An EST is a partial DNA sequence of a cDNA clone • All cDNA clones, and thus all ESTs, are derived from some specific RNA source such as human brain or rat liver • The RNA is converted into a more stable form, cDNA, which may then be packaged into a cDNA library • ESTs are typically randomly selected cDNA clones that are rapidly sequenced on one strand • ESTs are often 300-800 bp in length Why is mRNA less stable then cDNA? Role of and importance of ESTs • Because ESTs represent a copy of the interesting part of a genome, which is expressed - used as "tags" to fish a gene out of a portion of the DNA by matching base pairs • Powerful tools in the hunt for genes involved in hereditary diseases • ESTs also have a number of practical advantages: - Their sequences can be generated rapidly and inexpensively, - Only one sequencing experiment is needed per each cDNA generated - They do not have to be checked for sequencing errors because mistakes do not prevent identification of the gene from which the EST was derived - Because they are expressed, then they clearly were transcribed from functioning gene Protein Databases
• Data may also be represented in databases for
proteins such as the nonredundant (nr) database of GenBank, the SwissProt database and the Protein Data Bank. ESTs and UniGene
• A UniGene cluster is a database entry for a gene containing a
group of corresponding ESTs • The goal of the UniGene (unique gene) project is to create one unique entry for each gene and to collect all the ESTs associated with that gene. For example, in the case of RBP4, there is only one UniGene entry. • Although there is only a single UniGene entry for RBP4, this entry currently has over 200 human ESTs that match the RBP gene. • This large number of ESTs reflects how abundantly the RBP gene has been expressed in cDNA libraries that have been sequenced. • • UniGene clusters are created by sequence similarity searching using ESTs, and by gathering aligned sequences into clusters • Many clusters have only one member, representing unique sequences, while other clusters have tens of thousands of EST members.
• Conclusion: UniGene is a useful tool to look up information
about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). Cluster sizes in UniGene
This is a gene with
1 EST associated; the cluster size is 1 Cluster sizes in UniGene
This is a gene with
10 ESTs associated; the cluster size is 10 Sequence-Tagged Sites (STSs)
The dbSTS is an NCBI site containing STSs, which are short
genomic landmark sequences for which both DNA sequence data and mapping data are available • STSs have been obtained from several dozen organisms • A typical STS is approximately the size of an EST • An STS is a short DNA sequence that is easily recognizable and occurs only once in a genome (or chromosome). • Because they are sometimes polymorphic, containing short sequence repeats, STSs can be useful for mapping studies Genetic polymorphism is the simultaneous occurrence in the same locality of two or more discontinuous forms in such proportions that the rarest of them cannot be maintained just by recurrent mutation or immigration, originally defined by Ford (1940). Genome Survey Sequences (GSSs)
• The GSS division of GenBank is similar to the EST division, except
that its sequences are genomic in origin, rather than cDNA (mRNA). • The GSS division contains the following types of data : 1. Random “single-pass read” genome survey sequences (Single pass means that a sequence has been analyzed on the sequencer machine only once) 2. Cosmid/BAC/YAC end sequences (are synthetic/artificial chromosomes with integrated DNA fragments from a foreign source) 3. Polymerase chain reaction (PCR) sequences High-Throughput Genomic Sequence (HTGS)
• The HTGS division was created to make “unfinished”
genomic sequence data rapidly available to the scientific community. • It was done in a coordinated effort between the three international nucleotide sequence databases: DDBJ, EMBL, and GenBank. • The HTGS division contains unfinished DNA sequences generated by the high-throughput sequencing centers.