Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

LAB 1 - TEXT SEARCH FROM ONLINE

DATABASES
Online database interface
 To become familiar with the set up of online database
interface
 Know how to search for a program with a sequence
 Know how to work through protocols
Sequence database browsing and text
searching
DNA databases
 GenBank: located at the National Centre for Biotechnology
Information (1980s)
 EMBL (European Molecular Biology Laboratory): located at
European Bioinformatic Institute
 DDBJ (DNA Data Bank of Japan): located at the National
Institute of Genetics (1986)
Sequence database browsing and text
searching
Annotated sequence databases
 Additional information about sequence, definition, exon
location, associated protein etc.
 EMBL, GenBank, SWISS-PROT etc.
Low-annotation sequence databases
 Basic information only
 EST databases, high-throughput genome sequences
 Sequence comparison searching only
Sequence database browsing and text
searching
GenBank subset databases
 EST – sequences of cDNA which have been reverse-transcribed
from mRNA
 STS – short DNA segments with a single location in the
genome
 HTG – ‘unfinished’ DNA sequences generated by the high-
throughput sequence centers
 GSS – similar in nature to ESTs, except that the sequences are
genomic in origin, rather than cDNA
Sequence database browsing and text
searching
Non-redundant (NR)
 NR databases are created from multiple databases
 Contain sequence data only
 Cannot be text searched
 Can be search using a sequence
 Databases combine sequences from more than one database
 Ensure that there is one of every sequence in the database
 Reduce missing entries
 Text information is not combined and is lost
GenBank format: Header
Length Type Submission

Unique
identifier

Reference
GenBank format: Features
GenBank format: Sequence
Sequence database browsing and text
searching
Protein databases
 GenPep
 GenBank sequences that are translated exons in GenBank –
peptide GenPep
 trEMBL & SWISS-PROT
 Swiss Bioinformatic Institute (1998)
 Contains translations of all coding sequences in EMBL
 SP-trEMBL contains entries that will eventually be incorporated
into SWISS-PROT but have not been manually annotated
Sequence database browsing and text
searching
 trEMBL & SWISS-PROT (cont.)
 REM-trEMBL contains sequences that are not destined to be included
in SWISS-PROT e.g. fragments of fewer than eight amino acids,
synthetic sequences, patented sequences
 High level annotation including
 Descriptions of the function of the protein
 Structure of the domains
 Post translation modification
 All variants
 Aims to be minimally redundant
 SP-trEMBL is effectively a preliminary section of SWISS-PROT
 Entries in SP-trEMBL are removed when they are incorporated into
the SWISS-PROT database
Sequence database browsing and text
searching
Problems with the databases
 The bulk of proteins in GenBank have had their functional
annotation assigned by automatic methods, the quality and
reliability of this information in increasingly doubtful
Sequence database browsing and text
searching
Problems with the databases
 Majority of new DNA sequence information is annotated
automatically.
 Genes are found in genomic DNA by automatic open reading frames
(ORF) detection algorithms and automatically annotated based on
similarity searches againts the proteins in the databases
 These automatic annotation have proved to be rather powerful in the
case of microbial genomes
 In eukaryotic genomes there are complexities that cannot be easily
overcome by simple algorithms. It requires careful scrutiny to
distinguish between a pseudogene and a member of a multi-gene
family. Transposons and many other types of repeated sequences
occur both within expressed coding sequences and in non-coding
sequences

You might also like