Professional Documents
Culture Documents
Bioinformatics:: Guide To Bio-Computing and The Internet
Bioinformatics:: Guide To Bio-Computing and The Internet
(2010) vol11:31
Genomics: Completed genomes as of 2010
Currently the genome of the organisms are sequenced:
Nucleotides(billion)
6
5
Genomics: New sequence information is being 4
3
produced at increasing rates. (The 2
contents of GenBank double every year) 1
0
1980 1985 1990 1995 2000
You are
here
The Commercial Market
Current bioinformatics market is worth 300 million / year
(Half software)
-TIGR: http://tigr.org/tdb/tgi
-Yeast: http://yeastgenome.org
-Microbes: http://img.jgi.doe.gov/cgi-bin/pub/main.cgi
Translated databases:
-TREMBL (translated EMBL): includes entries that have
not been annotated yet into Swiss-Prot.
http://www.ebi.ac.uk/trembl/access.html
All
similarity searching methods rely on the concepts of alignment
and distance between sequences.
A similarity
score is calculated from a distance: the number of DNA
bases or amino acids that are different between two sequences.
Calculating alignment scores
Scoring system: Uses scoring matrices that allow biologists to quantify the
quality of sequence alignments.
The raw score S is calculated by summing the scores for each aligned
position and the scores for gaps. Gap creation/extension scores are
inherent to the scoring system in use (BLAST, FASTA…)
S is the probability that two residues, i and j, are aligned by evolutionary descent
and by chance.
qij are the frequencies that i and j are observed to align in sequences known to
QKESGPSSSYC
Global alignment: not sensitive
VQQESGLVRTTC
ESG
Local alignment: faster
ESG
http://www.arabidopsis.org/cgi-bin/fasta/nph-TAIRfasta.pl)
Which algorithm to use for database similarity search?
Speed:
BLAST > FASTA > Smith-Waterman (It is VERY SLOW and uses a
LOT OF COMPUTER POWER)
Sensitivity/statistics:
FASTA is more sensitive, misses less homologues
Smith-Waterman is even more sensitive.
BLAST calculates probabilities
Search by similarity
Very different DNA sequences may code for similar protein sequences
We certainly do not want to miss those cases!
Reasons for translating
Comparing DNA sequences give more random matches:
A good alignment with end-gaps A very poor alignment
Conclusion:
It is almost always better to compare coding sequences in their amino acid form,
especially if they are very divergent.
Very highly similar nucleotide sequences may give better results.
BLAST and FASTA variants
PSI-BLAST: Performs iterative database searches. The results from each round
are incorporated into a 'position specific' score matrix, which is
used for further searching
A practical example of sequence alignment
http://www.ncbi.nlm.nih.gov
BLAST results
Detailed BLAST results
accurate than ClustalW for sequences with less than 30% identity.
ClustalW (http://www.ch.embnet.org/software/ClustalW.html;
http://align.genome.jp)
bl2sequ (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi)
LALIGN (http://www.ch.embnet.org/software/LALIGN_form.html)
MultiALIGN (http://prodes.toulouse.inra.fr/multalin/multalin.html)
Transeq (http://www.ebi.ac.uk/emboss/transeq)
BioEdit — a sequence editing software package
http://www.mbio.ncsu.edu/bioedit/bioedit.html
Oligo Design and Analysis Tools
http://www.idtdna.com/scitools/scitools.aspx