Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Algorithms and Bioinformatics

Lecture 3: Algorithms and Programming Languages


§ An Algorithm is a precise description of a procedure that will
accomplish a task.

§ Correctness: proof that the algorithm correctly solves the given


• Algorithms problem or task

• Review of Programming Languages § Efficiency: speed at which the algorithm arrives at the solution
to the task or problem.

§ An algorithm’s efficiency is usually expressed in terms of the


number of calculations or operations required for a task of size n
• Big Oh notation: approximate number of operations required
by algorithm to operate on a data set of size n

• For Insertion Sort algorithm ~ O(n2)

• For n = 200 items, Insertion Sort requires ~40,000 operations

Algorithm Example: Protein Name search Naive Search Algorithm

§ Have a file containing a list of protein names (IDs) and the corresponding § Strategy:
protein sequences. (1) Obtain protein name from user (user_name).

§ Example: (2) Use Naive Search Algorithm to scan through list of protein names
(protein_name) to find match.
ID Protein Name: Protein Sequence:
(3) Output corresponding protein sequence (or error message if no match
1 Tubulin MRECISIHVGQAGV...
is found).
2 p53 MEEPQSDPSVEPP...
3 GroEL MAAKDVKFGNDAR...
§ Implementation of step 2 in pseudocode:
N Actin MCDEEVAALVVDN... For each protein_name in list {
if (protein_name is same as user_name) {
§ Task: Devise a search algorithm that will find the protein sequence print corresponding protein sequence
for the user-specified protein. }
}
Analysis of Naive Search Algorithm Binary Search Algorithm
§ Correctness: the algorithm checks each protein name for a match,
and outputs the protein sequence of a correct match. § Strategy:
(1) Sort protein name (and sequence) entries in alphabetical order.
§ Efficiency: How many comparisons are made on average using this
algorithm? (2) Obtain protein name from user (user_name).
(3) Use Binary Search Algorithm to search through list of protein names
• For a list of N protein names, the average search will make N/2 (protein_name) to find match.
comparisons (more if protein names not on list are frequently chosen)
(4) Output corresponding protein sequence (or error message if no match
• O(n) efficiency is found).

§ Can we find a more efficient algorithm to search for protein names?

Binary Search Algorithm Binary Search Algorithm: Example


§ Implementation of step 3 in pseudocode:
Initial conditions: left_ID = 1; right_ID = N
§ List of protein names:
ID: Protein Name: Protein Sequence:
(1) Find protein_name of mid_ID = (left_ID + right_ID)/2 from the alphabetically
1 Actin MCDEEVAALVVDN...
sorted protein name list.
2 Bas1 MSNISTKDIRKSKP...
(2) If (user_name is same as protein_name) { 3 GroEL MAAKDVKFGNDAR...
4 Hsp90 MPEEVHHGEEEVE...
return corresponding protein sequence
5 p53 MEEPQSDPSVEPP...
}
6 Ras MTEYKLVVVGARG...
(3) If (user_name occurs after protein_name [alphabetically] { 7 Tubulin MRECISIHVGQAGV...
Repeat step 1 with left_ID = mid_ID + 1 and right_ID = right_ID
} § Protein name inputted by user:
(4) If (user_name occurs before protein_name) { user_name = p53
Repeat step 1 with left_ID = left_ID and right_ID = mid_ID - 1
}
Binary Search Algorithm: Round 1 Binary Search Algorithm: Round 2

§ Initial conditions: ID: Protein Name: Protein Sequence: § Conditions (from previous round): ID: Protein Name: Protein Sequence:
• left_id = 1; right_id = 7 1 Actin MCDEEVAALVVDN... • left_id = 5; right_id = 7 1 Actin MCDEEVAALVVDN...
2 Bas1 MSNISTKDIRKSKP... 2 Bas1 MSNISTKDIRKSKP...
§ Step 1: 3 GroEL MAAKDVKFGNDAR... § Step 1: 3 GroEL MAAKDVKFGNDAR...
• mid_id = (left_id + right_id)/2 4 Hsp90 MPEEVHHGEEEVE... • mid_id = (left_id + right_id)/2 4 Hsp90 MPEEVHHGEEEVE...
• mid_id = (1 + 7)/2 = 4 5 p53 MEEPQSDPSVEPP... • mid_id = (5 + 7)/2 = 6 5 p53 MEEPQSDPSVEPP...
• For mid_id = 4, protein_name = Hsp90 6 Ras MTEYKLVVVGARG... • For mid_id = 6, protein_name = Ras 6 Ras MTEYKLVVVGARG...
7 Tubulin MRECISIHVGQAGV... 7 Tubulin MRECISIHVGQAGV...
§ Step 2: § Step 2:
user_name = p53 user_name = p53
• p53 ! Hsp90 • p53 ! Ras

§ Step 3: § Step 3:
• p53 > Hsp90, thus left_id = mid_id + 1 = 4 + 1 = 5 • p53 is not greater than Ras (aphabetically)
• right_id = right_id = 7
• Repeat step 1. § Step 4:
• p53 < Ras, thus left_id = left_id = 5
• right_id = mid_id - 1 = 6 - 1 = 5
• Repeat step 1.

Binary Search Algorithm: Round 3 Analysis of Binary Search Algorithm

§ Conditions (from previous round): § Correctness: the algorithm will identify a protein name and print its sequence
ID: Protein Name: Protein Sequence:
from an alphabetically sorted list of protein names.
• left_id = 5; right_id = 5 1 Actin MCDEEVAALVVDN...
2 Bas1 MSNISTKDIRKSKP...
§ Efficiency: How many comparisons are made on average using this
§ Step 1: 3 GroEL MAAKDVKFGNDAR...
algorithm?
• mid_id = (left_id + right_id)/2 4 Hsp90 MPEEVHHGEEEVE...
• mid_id = (5 + 5)/2 = 5 5 p53 MEEPQSDPSVEPP... • Each comparison eliminates half of the possible protein names in the
• For mid_id = 5, protein_name = p53 6 Ras MTEYKLVVVGARG... list.
7 Tubulin MRECISIHVGQAGV...
§ Step 2: • Maximum number of comparisons is equal to the number of times we
user_name = p53 must halve N, until there is one protein name left.
• p53 = p53
• Return protein sequence: MEEPQSDPSVEPP... • For a list of N protein names, the average search will make, at most,
log2(N) comparisons
Finished!
• Binary search has O(log n) efficiency
• Note: Sorting the list (which you only have to do once) will typically
require O(n) operations/comparisons
Comparison of Naive and Binary Search Algorithms Algorithm Example: Exact Sequence Matching

Number of Comparisons (worst case)


Number of Protein Names (N) Naive Search Binary Search
§ Devise an algorithm to identify all instances, if any, of the
10 10 4
100 100 7 following six nucleotide DNA sequence in the 30 nucleotide
1000 1000 10 chromosome:
10,000 10,000 14
100,000 100,000 17 Query Sequence:
1,000,000 1,000,000 20 TATAAA n=6

Chromosome Sequence:
§ For large lists of protein names, the binary search algorithm works many
GATTATAACATTATAAAGGCATTAGAGCTA m = 30
orders of magnitude times faster than the Naive Search method.

Algorithm: ?

Exact Sequence Matching: Basic Algorithm Analysis of Exact Sequence Matching Algorithm

§ Implementation (pseudocode): • How fast will our algorithm run with different sized query sequences
and chromosomes?
For all substrings Si of chromosome sequence, initial position (i) = 0 to (m - n) {
§ Worst Case Scenario:
if (substring Si is the same as the query sequence) { Q = AAA (n = 3)
print “Query sequence found at position i in chromosome sequence” P = AAAAAAAAAA (m = 10)
} • How many comparisons will be made?
}

§ Number of comparisons = n(m - n + 1) = 3(10 - 3 +1) = 24

§ Number of comparisons " O(n x m)

• How many comparisons will be made if n = 10,000 and m = 3,000,000,000?


Exact Sequence Matching: Pre-processing Algorithms Programming Languages: Basics

§ How can we speed up this algorithm?


• Pre-process the query (Boyer-Moore & Knuth-Morris-Pratt algorithms)
§ Definition: a programming language is a tool that allows one
• Pre-process the chromosome sequence (Suffix Tree algorithms) to write algorithms and instructions for the computer.
§ Pre-processing algorithms operate on linear time: O(n+m)

§ How many comparisons will be made if n = 10,000 and m = 3,000,000,000?


§ The source code--the instructions written in a programming
language--is converted by the language compiler or interpreter
into binary instructions that can be processed by the computer

Mechanics of Computer Languages Programming Languages: Overview


Source Code
#include <stdio.h>
int main() § Compiled languages:
{ • C, C++, Java
printf(“Hello, World!\n”);
return 0;
} § Scripting languages:
• Perl, Python, Ruby, Tcl
Compiler
§ Web languages:
1010101011100001000 • html, PHP, Perl, Javascript
Hello, World!
0010001001111010010
0101001110010010100
110101100011110011... § Database languages:
Computer Output • SQL
Compiled languages Scripting languages
§ Fast to run/slow to write
§ Slow to run/Fast to write
• many shortcuts and tools to make programmers job easier
§ Used primarily for large applications and where run-time speed
• language handles memory management automatically
is important.
§ Used primarily for small applications, tools, prototypes, and
§ Examples: Operating systems
where run-time speed is not important.
Word processing applications
BLAST, etc.
§ Examples: System Administration programs
Text processing
§ Hello, World! in C
Prototypes
Small applications
#include <stdio.h>
§ Hello, World! in Perl
int main()
{
#!/usr/bin/perl -w
printf(“Hello, World!\n”);
print (“Hello, World!\n”);
return 0;
}

Web languages Database Languages

§ Used primarily for interactive web pages and web-based databases § Handles database transactions in relational databases

§ Examples: Interactive web-pages § Used for database applications


Connecting databases to web-pages
§ Examples: Banking databases
§ Hello, World! in PHP Biological databases
<?php
echo “<html>\n<head>\n”; § Hello, World! in SQL
echo “<title>Hello, World!</title>\n”;
echo “</head>\n”; CREATE TABLE announce ( message VARCHAR(50) NOT NULL);
echo “<body>\n”; INSERT INTO announce VALUES(“Hello, World!”);
echo “<h1>Hello, World!</h1>\n”; SELECT message FROM announce;
echo “</body></html>\n”;
?>

You might also like