Professional Documents
Culture Documents
Lecture 3: Algorithms and Programming Languages Algorithms and Bioinformatics
Lecture 3: Algorithms and Programming Languages Algorithms and Bioinformatics
• Review of Programming Languages § Efficiency: speed at which the algorithm arrives at the solution
to the task or problem.
§ Have a file containing a list of protein names (IDs) and the corresponding § Strategy:
protein sequences. (1) Obtain protein name from user (user_name).
§ Example: (2) Use Naive Search Algorithm to scan through list of protein names
(protein_name) to find match.
ID Protein Name: Protein Sequence:
(3) Output corresponding protein sequence (or error message if no match
1 Tubulin MRECISIHVGQAGV...
is found).
2 p53 MEEPQSDPSVEPP...
3 GroEL MAAKDVKFGNDAR...
§ Implementation of step 2 in pseudocode:
N Actin MCDEEVAALVVDN... For each protein_name in list {
if (protein_name is same as user_name) {
§ Task: Devise a search algorithm that will find the protein sequence print corresponding protein sequence
for the user-specified protein. }
}
Analysis of Naive Search Algorithm Binary Search Algorithm
§ Correctness: the algorithm checks each protein name for a match,
and outputs the protein sequence of a correct match. § Strategy:
(1) Sort protein name (and sequence) entries in alphabetical order.
§ Efficiency: How many comparisons are made on average using this
algorithm? (2) Obtain protein name from user (user_name).
(3) Use Binary Search Algorithm to search through list of protein names
• For a list of N protein names, the average search will make N/2 (protein_name) to find match.
comparisons (more if protein names not on list are frequently chosen)
(4) Output corresponding protein sequence (or error message if no match
• O(n) efficiency is found).
§ Initial conditions: ID: Protein Name: Protein Sequence: § Conditions (from previous round): ID: Protein Name: Protein Sequence:
• left_id = 1; right_id = 7 1 Actin MCDEEVAALVVDN... • left_id = 5; right_id = 7 1 Actin MCDEEVAALVVDN...
2 Bas1 MSNISTKDIRKSKP... 2 Bas1 MSNISTKDIRKSKP...
§ Step 1: 3 GroEL MAAKDVKFGNDAR... § Step 1: 3 GroEL MAAKDVKFGNDAR...
• mid_id = (left_id + right_id)/2 4 Hsp90 MPEEVHHGEEEVE... • mid_id = (left_id + right_id)/2 4 Hsp90 MPEEVHHGEEEVE...
• mid_id = (1 + 7)/2 = 4 5 p53 MEEPQSDPSVEPP... • mid_id = (5 + 7)/2 = 6 5 p53 MEEPQSDPSVEPP...
• For mid_id = 4, protein_name = Hsp90 6 Ras MTEYKLVVVGARG... • For mid_id = 6, protein_name = Ras 6 Ras MTEYKLVVVGARG...
7 Tubulin MRECISIHVGQAGV... 7 Tubulin MRECISIHVGQAGV...
§ Step 2: § Step 2:
user_name = p53 user_name = p53
• p53 ! Hsp90 • p53 ! Ras
§ Step 3: § Step 3:
• p53 > Hsp90, thus left_id = mid_id + 1 = 4 + 1 = 5 • p53 is not greater than Ras (aphabetically)
• right_id = right_id = 7
• Repeat step 1. § Step 4:
• p53 < Ras, thus left_id = left_id = 5
• right_id = mid_id - 1 = 6 - 1 = 5
• Repeat step 1.
§ Conditions (from previous round): § Correctness: the algorithm will identify a protein name and print its sequence
ID: Protein Name: Protein Sequence:
from an alphabetically sorted list of protein names.
• left_id = 5; right_id = 5 1 Actin MCDEEVAALVVDN...
2 Bas1 MSNISTKDIRKSKP...
§ Efficiency: How many comparisons are made on average using this
§ Step 1: 3 GroEL MAAKDVKFGNDAR...
algorithm?
• mid_id = (left_id + right_id)/2 4 Hsp90 MPEEVHHGEEEVE...
• mid_id = (5 + 5)/2 = 5 5 p53 MEEPQSDPSVEPP... • Each comparison eliminates half of the possible protein names in the
• For mid_id = 5, protein_name = p53 6 Ras MTEYKLVVVGARG... list.
7 Tubulin MRECISIHVGQAGV...
§ Step 2: • Maximum number of comparisons is equal to the number of times we
user_name = p53 must halve N, until there is one protein name left.
• p53 = p53
• Return protein sequence: MEEPQSDPSVEPP... • For a list of N protein names, the average search will make, at most,
log2(N) comparisons
Finished!
• Binary search has O(log n) efficiency
• Note: Sorting the list (which you only have to do once) will typically
require O(n) operations/comparisons
Comparison of Naive and Binary Search Algorithms Algorithm Example: Exact Sequence Matching
Chromosome Sequence:
§ For large lists of protein names, the binary search algorithm works many
GATTATAACATTATAAAGGCATTAGAGCTA m = 30
orders of magnitude times faster than the Naive Search method.
Algorithm: ?
Exact Sequence Matching: Basic Algorithm Analysis of Exact Sequence Matching Algorithm
§ Implementation (pseudocode): • How fast will our algorithm run with different sized query sequences
and chromosomes?
For all substrings Si of chromosome sequence, initial position (i) = 0 to (m - n) {
§ Worst Case Scenario:
if (substring Si is the same as the query sequence) { Q = AAA (n = 3)
print “Query sequence found at position i in chromosome sequence” P = AAAAAAAAAA (m = 10)
} • How many comparisons will be made?
}
§ Used primarily for interactive web pages and web-based databases § Handles database transactions in relational databases