Bioinformatics:: Guide To Bio-Computing and The Internet

Bioinformatics:
Guide to bio-computing and the Internet
Copyright© Kerstin Wagner

Introduction: What is bioinformatics?
Can be defined as the body of tools, algorithms needed to handle large
and complex biological information.
Bioinformatics is a new scientific discipline created from the interaction

of biology and computer.
The NCBI defines bioinformatics as:

"Bioinformatics is the field of science in which biology, computer
science, and information technology merge into a single discipline”
Genomics era: High-throughput DNA sequencing
The first high-throughput genomics

technology was automated DNA sequencing
in the early 1990.
In 1995, Venter and Hamilton used whole-

genome shotgun sequencing strategy to
sequence the genomes of Mycoplasma and
Haemophilus .
In September 1999, Celera Genomics

completed the sequencing of the
Drosophila genome.
The 3-billion-bp human genome sequence

was generated in a competition between
the publicly funded Human Genome
Project and Celera
High-throughput DNA sequencing
Top image: confocal detection

by the MegaBACE sequencer
of fluorescently labeled DNA
That was then. How about

now?
Next Generation Sequencing
(2010) vol11:31
Genomics: Completed genomes as of 2010
Currently the genome of the organisms are sequenced:
1598 bacterial/85 archaeal/294 eukaryotic genomes
This generates large amounts of information to be handled by individual

computers.
The trend of data growth
21st century is a century of biotechnology: 8
7
Nucleotides(billion)
6
5
 Genomics: New sequence information is being 4
3
produced at increasing rates. (The 2
contents of GenBank double every year) 1
0
1980 1985 1990 1995 2000
 Microarray: Global expression analysis: RNA levels of every Years
gene in the genome analyzed in parallel. (OUT!)

Replaced by RNA-seq
 Proteomics:Global protein analysis generates by large mass

spectra libraries.
 Metabolomics:Global metabolite analysis: 25,000 secondary

metabolites characterized
Metagenomics
- “Who is there and what are they doing?”
- Cultivation-independent approaches to study the big impact of microbes

How to handle the large amount of information?
Drew Sheneman, New Jersey--The Newark Star Ledger
Answer: bioinformatics and Internet

Bioinformatics history
In1960s: the birth of bioinformatics
IBM 7090 computer
Margaret Oakley Dayhoff created:

The first protein database
The first program for sequence assembly
There is a need for computers and algorithms that allow:

Access, processing, storing, sharing, retrieving, visualizing, annotating…
Why do we need the Internet?
“omics” projects and the information associated with involve a huge amount
of data that is stored on computers all over the world.
Because it is impossible to maintain up-to-date copies of all relevant
databases within the lab. Access to the data is via the internet.
Database
storage
You are
here
The Commercial Market
Current bioinformatics market is worth 300 million / year
(Half software)
Prediction: $2 billion / year in 5-6 years
~50 Bioinformatics companies:

Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode
Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic,
GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools,
Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist,
eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic,
GeneFormatics, Molecular Simulations, Bioinformatics Solutions….
Scope of this lab
The lab will touch on the following computational tasks:
Similaritysearch
Sequence comparison: Alignment, multiple alignment, retrieval
Sequences analysis: Signal peptide, transmembrane domain,…
Protein folding: secondary structure from sequence
Sequence evolution: phylogenetic trees
Make you familiar with bioinformatics resources available on the

web to do these tasks.
Applying algorithms to analyze genomics data
-Accession #?
-Annotation?
Is it already in
databases?
Protein Other
characteristics? information?
-Sub-localization -Expression profile?
-Soluble? -Mutants?
You have just
-3D fold
cloned a gene
Is there conserved Is there similar Evolutionary

regions? sequences? relationship?
-Alignments? -% identity? -Phylogenetic
-Domains? -Family member? tree
A critical failure of current bioinformatics is the lack of a single software

package that can perform all of these functions.
DNA (nucleotide sequences) databases
They are big databases and searching either one should produce
similar results because they exchange information routinely.
-GenBank (NCBI): http://www.ncbi.nlm.nih.gov
-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp
-TIGR: http://tigr.org/tdb/tgi
-Yeast: http://yeastgenome.org
-Microbes: http://img.jgi.doe.gov/cgi-bin/pub/main.cgi
Specialized databases:Tissues, species…

-ESTs (Expressed Sequence Tags)
~at NCBI http://www.ncbi.nlm.nih.gov/dbEST
~at TIGR http://tigr.org/tdb/tgi
- ...many more!
Protein (amino acid) databases
They are big databases too:
-Swiss-Prot (very high level of annotation)
http://au.expasy.org/
-PIR (protein identification resource) the world's most

comprehensive catalog of information on proteins
http://www.pir.uniprot.org/
Translated databases:
-TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot.
http://www.ebi.ac.uk/trembl/access.html
-GenPept (translation of coding regions in GenBank)
-pdb (sequences derived from the 3D structure

Brookhaven PDB) http://www.rcsb.org/pdb/
Database homology searching
Use algorithms to efficiently provide mathematical basis of searches
that can be translated to statistical significance.
Assumes that sequence, structure, and function are inter-related.
All
similarity searching methods rely on the concepts of alignment
and distance between sequences.
A similarity
score is calculated from a distance: the number of DNA
bases or amino acids that are different between two sequences.
Calculating alignment scores
Scoring system: Uses scoring matrices that allow biologists to quantify the
quality of sequence alignments.
The raw score S is calculated by summing the scores for each aligned
position and the scores for gaps. Gap creation/extension scores are
inherent to the scoring system in use (BLAST, FASTA…)
The score for an identity or a mismatch is given by the specified substitution

matrix (e.g., BLOSUM62).
Devising a scoring system
Some popular scoring matrices are:
 PAM (Percent Accepted Mutation): for evolutionary studies.
For example in PAM1, 1 accepted point mutation per 100 amino
acids is required.
 BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding

common motifs. For example in BLOSUM62, the alignment is
created using sequences sharing no more than 62% identity.
How the matrices were created:

 Very similar sequences were aligned.
From these alignments, the frequency of substitution between

each pair of amino acids was calculated and then PAM1 was built.
 After normalizing to log-odds format, the full series of PAM matrices

can be calculated by multiplying the PAM1 matrix by itself.
Devising a scoring system
Importance:
 Scoring matrices appear in all analysis
involving sequence comparison.
 The choice of matrix can strongly influence

the outcome of the analysis.
 Understanding theories underlying a given
scoring matrix can aid in making proper
choice:
-Some matrices reflect similarity: good for
database searching
-Some reflect distance: good for phylogenies
 Log-odds matrices, a normalisation method for matrix values:
 S is the probability that two residues, i and j, are aligned by evolutionary descent
and by chance.
qij are the frequencies that i and j are observed to align in sequences known to
be related. pi and pj are their frequencies of occurrence in the set of sequences.

Database search methods: Sequence Alignment
Two broad classes of sequence alignments exist:
QKESGPSSSYC
 Global alignment: not sensitive
VQQESGLVRTTC
ESG
 Local alignment: faster
ESG
The most widely used local similarity algorithms are:

Smith-Waterman (http://www.ebi.ac.uk/MPsrch/)
Basic Local Alignment Search Tool (BLAST, http://www.ncbi.nih.gov)
Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;
http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
Which algorithm to use for database similarity search?
Speed:
BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a
LOT OF COMPUTER POWER)
Sensitivity/statistics:
FASTA is more sensitive, misses less homologues
Smith-Waterman is even more sensitive.
BLAST calculates probabilities
FASTA more accurate for DNA-DNA search then BLAST

-tuple methods provide optimal alignments
These methods are faster and excellent in comparing sequences.
BLAST and FASTA programs are based on -tuple algorithms:

1-Using query sequence, derive a list of
words of length w (e.g., 3)
2-Keep high-scoring words using a

scoring matrix(e.g. BLOSUM 62)
3-High-scoring words are compared

with database sequences
4-Sequences with many matches to

high-scoring words are used for final
alignments
Tools to search databases
The dilemma: DNA or protein?
Search by similarity
Using nucleotide seq. Using amino acid seq.
 Is the comparison of two nucleotide sequences accurate?
 By translating into amino acid sequence, are we losing information?

The genetic code is degenerate (Two or more codons can represent
the same amino acid)
 Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!
Reasons for translating
Comparing DNA sequences give more random matches:
A good alignment with end-gaps A very poor alignment
Almost 50% identity!
Conservation of protein in evolution (DNA similarity decays faster!)
Conclusion:
It is almost always better to compare coding sequences in their amino acid form,
especially if they are very divergent.
 Very highly similar nucleotide sequences may give better results.
BLAST and FASTA variants
FASTA: Compares a DNA query to DNA database, or a protein query

to protein database
FASTX: Compares a translated DNA query to a protein database
TFASTA: Compares a protein query to a translated DNA database
BLASTN: Compares a DNA query to DNA database.
BLASTP: Compares a protein query to protein database.

BLASTX: Compares the 6-frame translations of DNA query to protein
database.
TBLASTN: Compares a protein query to the 6-frame translations of a DNA
database.
TBLASTX: Compares the 6-frame translations of DNA query to the 6-frame
translations of a DNA database (each sequence is comparable to
BLASTP searches!)
PSI-BLAST: Performs iterative database searches. The results from each round
are incorporated into a 'position specific' score matrix, which is
used for further searching
A practical example of sequence alignment
http://www.ncbi.nlm.nih.gov
BLAST results
Detailed BLAST results
E value: is the expectation value or probability to find by chance hits similar to

your sequence. The lower the E, the more significant the score.
Database searching tips
Use latest database version.
Use BLAST first, then a finer tool (FASTA,…)
Search both strands when using FASTA.
Translate sequences where relevant
Search 6-frame translation of DNA database
E < 0.05 is statistically significant, usually biologically

interesting.
If the query has repeated segments, delete them and

repeat search
Most widely used sites for sequence analysis
Sites for alignment of 2 sequences:
T-COFFEE (http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi): more

accurate than ClustalW for sequences with less than 30% identity.
ClustalW (http://www.ch.embnet.org/software/ClustalW.html;
http://align.genome.jp)
bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi)
LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html)
MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html)
Sites for DNA to protein translation:

These algorithms can translate DNA sequences in any of the 3 forward or three
reverse sense frames.
Translate (http://au.expasy.org/tools/dna.html)
Translate a DNA sequence: (http://www.vivo.colostate.edu/molkit/translate/index.html)
Transeq (http://www.ebi.ac.uk/emboss/transeq)
BioEdit — a sequence editing software package
http://www.mbio.ncsu.edu/bioedit/bioedit.html
Oligo Design and Analysis Tools
http://www.idtdna.com/scitools/scitools.aspx

Bioinformatics:: Guide To Bio-Computing and The Internet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bioinformatics:: Guide To Bio-Computing and The Internet

Uploaded by

Copyright:

Available Formats

Bioinformatics:

Guide to bio-computing and the Internet

Copyright© Kerstin Wagner

Bioinformatics is a new scientific discipline created from the interaction

The NCBI defines bioinformatics as:

The first high-throughput genomics

In 1995, Venter and Hamilton used whole-

In September 1999, Celera Genomics

The 3-billion-bp human genome sequence

Top image: confocal detection

That was then. How about

1598 bacterial/85 archaeal/294 eukaryotic genomes

This generates large amounts of information to be handled by individual

 Microarray: Global expression analysis: RNA levels of every Years

gene in the genome analyzed in parallel. (OUT!)

 Proteomics:Global protein analysis generates by large mass

 Metabolomics:Global metabolite analysis: 25,000 secondary

- Cultivation-independent approaches to study the big impact of microbes

Drew Sheneman, New Jersey--The Newark Star Ledger

Answer: bioinformatics and Internet

IBM 7090 computer

Margaret Oakley Dayhoff created:

There is a need for computers and algorithms that allow:

Prediction: $2 billion / year in 5-6 years

~50 Bioinformatics companies:

Make you familiar with bioinformatics resources available on the

Is there conserved Is there similar Evolutionary

A critical failure of current bioinformatics is the lack of a single software

-GenBank (NCBI): http://www.ncbi.nlm.nih.gov

-DDBJ (DNA DataBase of Japan): http://www.ddbj.nig.ac.jp

Specialized databases:Tissues, species…

-PIR (protein identification resource) the world's most

-GenPept (translation of coding regions in GenBank)

-pdb (sequences derived from the 3D structure

Assumes that sequence, structure, and function are inter-related.

The score for an identity or a mismatch is given by the specified substitution

 BLOSUM (BLOcks amino acid SUbstitution Matrix): for finding

How the matrices were created:

From these alignments, the frequency of substitution between

 After normalizing to log-odds format, the full series of PAM matrices

 The choice of matrix can strongly influence

 Log-odds matrices, a normalisation method for matrix values:

be related. pi and pj are their frequencies of occurrence in the set of sequences.

The most widely used local similarity algorithms are:

Fast Alignment (FASTA, http://fasta.genome.jp; http://www.ebi.ac.uk/fasta33/;

FASTA more accurate for DNA-DNA search then BLAST

BLAST and FASTA programs are based on -tuple algorithms:

2-Keep high-scoring words using a

3-High-scoring words are compared

4-Sequences with many matches to

Using nucleotide seq. Using amino acid seq.

 Is the comparison of two nucleotide sequences accurate?

 By translating into amino acid sequence, are we losing information?

Almost 50% identity!

Conservation of protein in evolution (DNA similarity decays faster!)

FASTA: Compares a DNA query to DNA database, or a protein query

BLASTN: Compares a DNA query to DNA database.

BLASTP: Compares a protein query to protein database.

E value: is the expectation value or probability to find by chance hits similar to

Use BLAST first, then a finer tool (FASTA,…)

Search both strands when using FASTA.

Translate sequences where relevant

Search 6-frame translation of DNA database