Professional Documents
Culture Documents
LAB 1 - Text Search
LAB 1 - Text Search
DATABASES
Online database interface
To become familiar with the set up of online database
interface
Know how to search for a program with a sequence
Know how to work through protocols
Sequence database browsing and text
searching
DNA databases
GenBank: located at the National Centre for Biotechnology
Information (1980s)
EMBL (European Molecular Biology Laboratory): located at
European Bioinformatic Institute
DDBJ (DNA Data Bank of Japan): located at the National
Institute of Genetics (1986)
Sequence database browsing and text
searching
Annotated sequence databases
Additional information about sequence, definition, exon
location, associated protein etc.
EMBL, GenBank, SWISS-PROT etc.
Low-annotation sequence databases
Basic information only
EST databases, high-throughput genome sequences
Sequence comparison searching only
Sequence database browsing and text
searching
GenBank subset databases
EST – sequences of cDNA which have been reverse-transcribed
from mRNA
STS – short DNA segments with a single location in the
genome
HTG – ‘unfinished’ DNA sequences generated by the high-
throughput sequence centers
GSS – similar in nature to ESTs, except that the sequences are
genomic in origin, rather than cDNA
Sequence database browsing and text
searching
Non-redundant (NR)
NR databases are created from multiple databases
Contain sequence data only
Cannot be text searched
Can be search using a sequence
Databases combine sequences from more than one database
Ensure that there is one of every sequence in the database
Reduce missing entries
Text information is not combined and is lost
GenBank format: Header
Length Type Submission
Unique
identifier
Reference
GenBank format: Features
GenBank format: Sequence
Sequence database browsing and text
searching
Protein databases
GenPep
GenBank sequences that are translated exons in GenBank –
peptide GenPep
trEMBL & SWISS-PROT
Swiss Bioinformatic Institute (1998)
Contains translations of all coding sequences in EMBL
SP-trEMBL contains entries that will eventually be incorporated
into SWISS-PROT but have not been manually annotated
Sequence database browsing and text
searching
trEMBL & SWISS-PROT (cont.)
REM-trEMBL contains sequences that are not destined to be included
in SWISS-PROT e.g. fragments of fewer than eight amino acids,
synthetic sequences, patented sequences
High level annotation including
Descriptions of the function of the protein
Structure of the domains
Post translation modification
All variants
Aims to be minimally redundant
SP-trEMBL is effectively a preliminary section of SWISS-PROT
Entries in SP-trEMBL are removed when they are incorporated into
the SWISS-PROT database
Sequence database browsing and text
searching
Problems with the databases
The bulk of proteins in GenBank have had their functional
annotation assigned by automatic methods, the quality and
reliability of this information in increasingly doubtful
Sequence database browsing and text
searching
Problems with the databases
Majority of new DNA sequence information is annotated
automatically.
Genes are found in genomic DNA by automatic open reading frames
(ORF) detection algorithms and automatically annotated based on
similarity searches againts the proteins in the databases
These automatic annotation have proved to be rather powerful in the
case of microbial genomes
In eukaryotic genomes there are complexities that cannot be easily
overcome by simple algorithms. It requires careful scrutiny to
distinguish between a pseudogene and a member of a multi-gene
family. Transposons and many other types of repeated sequences
occur both within expressed coding sequences and in non-coding
sequences