Bioinformatics Unit I

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Bioinformatics

Unit I
Three publicly accessible databases store large amounts of nucleotide and protein sequence data:
1. GenBank at the National Center for Biotechnology Information (NCBI) of the National
Institutes of Health (NIH) in Bethesda (US).
2. The DNA Database of Japan (DDBJ) at the National Institute of Genetics in Mishima, and
3. The European Nucleotide Archive (ENA) at the European Bioinformatics Institute (EBI) in
Hinxton, England.
These three databases share their sequence data daily. They are coordinated by the International
Nucleotide Sequence Database Collaboration (INSDC)
Expressed Sequence Tags (ESTs)-+
The database of expressed sequence tags (dbEST) is a division of GenBank that contains
sequence data and other information on “single-pass” cDNA sequences from a number of
organisms.
An EST is a partial DNA sequence of a cDNA clone. All cDNA clones, and thus all ESTs, are
derived from some specific RNA source such as human brain or rat liver. The RNA is converted
into a more stable form, cDNA, which may then be packaged into a cDNA library.
ESTs are typically randomly selected cDNA clones that are sequenced on one strand (and thus
may have a relatively high sequencing error rate). ESTs are often 300 to 800 bp in length. The
earliest efforts to sequence ESTs resulted in the identification of many hundreds of genes that
were novel at the time.
ESTs and UniGene
The goal of the UniGene (unique gene) project is to create gene-oriented clusters by
automatically partitioning ESTs into nonredundant sets. Ultimately there should be one UniGene
cluster assigned to each gene of an organism. There may be as few as one EST in a cluster,
reflecting a gene that is rarely expressed, to tens of thousands of ESTs, associated with a highly
expressed gene.
There are now thought to be approximately 22,000 human genes. One might expect an equal
number of UniGene clusters. However, in practice, there are more UniGene clusters than there

1
are genes—currently, there are about 120,000 human UniGene clusters. This discrepancy could
occur for three reasons.
(1) Clusters of ESTs could correspond to distinct regions of one gene. In that case there would be
two (or more) UniGene entries corresponding to a single gene. Two UniGene clusters may
properly cluster into one, and the number of UniGene clusters may collapse over time.

(2) In the past several years it has become appreciated that much of the genome is transcribed at
low levels. Currently, 40,000 human UniGene clusters consist of a single EST, and over 76,000
UniGene clusters consist of just one to four ESTs. These could reflect authentic genes that have
not yet been appreciated by other means of gene identification. Alternatively they may represent
rare transcription events of unknown biological relevance.
(3) Some DNA may be transcribed during the creation of a cDNA library without corresponding
to an authentic transcript.
The Reference Sequence (RefSeq) Project
One of the most important recent developments in the management of molecular sequences is
RefSeq. The goal of RefSeq is to provide the best representative sequence for each normal (i.e.,
nonmutated) transcript produced by a gene and for each normal protein product. There may be
hundreds of GenBank accession numbers corresponding to a gene, since GenBank is an archival
database that is often highly redundant. However, there will be only one RefSeq entry

2
corresponding to a given gene or gene product. RefSeq entries are curated by the staff at NCBI,
and are nearly nonredundant.
We can recognize a RefSeq accession by its format, such as NP_000509 (P stands for beta globin
protein) or NM_006744 (for beta globin mRNA).
A variety of RefSeq identifiers are shown in Table for beta globin:

Entrez Gene (Formerly LocusLink)


It is particularly useful as a major portal. It is a curated database containing descriptive
information about genetic loci. You can obtain information on official nomenclature, aliases,
sequence accessions, phenotypes, EC numbers, OMIM numbers, UniGene clusters, HomoloGene
(a database that reports eukaryotic orthologs), map locations, and related websites.

ACCESS TO INFORMATION: PROTEIN DATABASES


UniProt
The Universal Protein Resource (UniProt) is the most comprehensive, centralized protein
sequence catalog; it consists of a combination of three key databases.
(1) Swiss-Prot is considered the best-annotated protein database, with descriptions of protein
structure and function added by expert curators.
(2) The translated EMBL (TrEMBL) Nucleotide Sequence Database Library provides automated
(rather than manual) annotations of proteins not in Swiss-Prot. It was created because of the vast
number of protein sequences that have become available through genome sequencing projects.
(3) PIR maintains the Protein Sequence Database, another protein database curated by experts.
UniProt is organized in three database layers.

3
(1) The UniProt Knowledgebase (UniProtKB) is the central database that is divided into the
manually annotated UniProtKB/Swiss-Prot and the computationally annotated
UniProtKB/TrEMBL.
(2) The UniProt Reference Clusters (UniRef) offer nonredundant reference clusters based on
UniProtKB. UniRef clusters are available with members sharing at least 50%, 90%, or 100%
identity.
(3) The UniProt Archive, UniParc, consists of a stable, nonredundant archive of protein
sequences from a wide variety of sources (including model organism databases, patent offices,
RefSeq, and Ensembl).
OMIM
Online Mendelian Inheritance in Man (OMIM) is a catalog of human genes and genetic
disorders. It was created by Victor McKusick and his colleagues and developed for the World
Wide Web by NCBI (Hamosh et al., 2005). The database contains detailed reference
information. It also contains links to PubMed articles and sequence information. We describe
OMIM in Chapter 20 (on human disease).

4
Protein Domain
A protein domain is a conserved part of a given protein sequence and (tertiary) structure that can
evolve, function, and exist independently of the rest of the protein chain.
Each domain forms a compact three-dimensional structure and often can be independently stable
and folded. Domains vary in length from between about 25 amino acids up to 500 amino acids in
length.

Figure: domains in a trans-membrane protein

Protein Structure Classification


Protein structure classification systems allow us to compare known protein structures for the
identification of relationships among structures.
Several protein structure classification systems have been developed, the two most popular being
Structural Classification of Proteins (SCOP) and Class, Architecture, Topology and
Homologous (CATH), both of which contain a number of hierarchical levels in their systems.

1. Structural Classification of Proteins (SCOP)


SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/ ) is a database for comparing and classifying protein
structures. It is constructed almost entirely based on manual examination of protein structures.
The proteins are grouped into hierarchies of classes, folds, superfamilies, and families.
Class: Classes consist of folds with similar core structures. This is at the highest level of the
hierarchy, which distinguishes groups of proteins by secondary structure compositions such as
all α, all β, α and β, and so on.

5
Folds: Folds consist of superfamilies with a common core structure, which is determined
manually. This level describes similar overall secondary structures with similar orientation and
connectivity between them. Members within the same fold do not always have evolutionary
relationships. Some of the shared core structure may be a result of analogy.
Superfamily: Superfamilies consist of families with similar structures, but weak sequence
similarity. It is believed that members of the same superfamily share a common ancestral origin,
although the relationships between families are considered distant.
Family: The SCOP families consist of proteins having high sequence identity (>30%). Thus, the
proteins within a family clearly share close evolutionary relationships and normally have the
same functionality. The protein structures at this level are also extremely similar.
2. CATH
CATH (www.biochem.ucl.ac.uk/bsm/cath new/index.html) classifies proteins based on the
automatic structural alignment program SSAP as well as manual comparison. Structural domain
separation is carried out also as a combined effort of a human expert and computer programs.
Individual domain structures are classified at five major levels: class, architecture, fold/topology,
homologous superfamily, and homologous family.
Class: The definition for class in CATH is similar to that in SCOP, and is based on secondary
structure content.
Architecture: Architecture is a unique level in CATH, intermediate between fold and class. This
level describes the overall packing and arrangement of secondary structures independent of
connectivity between the elements.
Topology: The topology level is equivalent to the fold level in SCOP, which describes overall
orientation of secondary structures and takes into account the sequence connectivity between the
secondary structure elements.
The homologous superfamily and homologous family levels are equivalent to the superfamily
and family levels in SCOP with similar evolutionary definitions, respectively.

You might also like