131 Scripting (Python) PDF

Introduction to Python
Incorporating selected material from Dalke Scientific’s

“Introduction to Programming for Bioinformatics in Python”
Monday, September 9, 13
Strings in Python
Computers store text
as strings
>>> s = "GATTACA"
0 1 2 3 4 5 6
s G A T T A C A
Each of these are characters
Why are strings important?
• Sequences are strings
• ..catgaaggaa ccacagccca gagcaccaag ggctatccat..
• Database records contain strings

• LOCUS AC005138
• DEFINITION Homo sapiens chromosome 17, clone

hRPK.261_A_13, complete sequence
• AUTHORS Birren,B., Fasman,K., Linton,L.,

Nusbaum,C. and Lander,E.
• HTML is one (big) string

Getting Characters
>>> s = "GATTACA"
>>> s[0] 0 1 2 3 4 5 6
'G' G A T T A C A
>>> s[1]
'A'
>>> s[-1]
'A'
>>> s[-2]
'C'
>>> s[7]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range
>>>
Getting substrings
>>> s[1:3]
NB half-open intervals
'AT' 0 1 2 3 4 5 6
>>> s[:3]
'GAT' G A T T A C A
>>> s[4:]
'ACA'
>>> s[3:5]
'TA'
>>> s[:]
'GATTACA'
>>> s[::2]
'GTAA'
>>> s[-2:2:-1]
'CAT'
>>>
Creating strings
Strings start and end with a single or double
quote characters (they must be the same)
"This is a string"
"This is another string"
""
"Strings can be in double quotes"
‘Or in single quotes.’
'There’s no difference.'
‘Okay, there\’s a small one.’
Special Characters and
Escape Sequences
Backslashes (\) are used to introduce special characters
>>> s = 'Okay, there\'s a small one.'
The \ “escapes” the following single quote
>>> print s
Okay, there's a small one.
Some special characters
Escape Sequence Meaning
\\ Backslash (keep a \)
\' Single quote (keeps the ')
\" Double quote (keeps the ")
\n Newline
\t Tab
Working with strings
>>> len("GATTACA") length
7
>>> "GAT" + "TACA"
'GATTACA'
concatenation
>>> "A" * 10
'AAAAAAAAAA' repeat
>>> "G" in "GATTACA"
True
>>> "GAT" in "GATTACA"
True substring test
>>> "AGT" in "GATTACA"
False
>>> "GATTACA".find("ATT") substring location
1
>>> "GATTACA".count("T")
2
substring count
>>>
Converting from/to strings
>>> "38" + 5
TypeError: cannot concatenate 'str' and 'int' objects
>>> int("38") + 5
43
>>> "38" + str(5)
'385'
>>> int("38"), str(5)
(38, '5')
>>> int("2.71828")
ValueError: invalid literal for int(): 2.71828
>>> float("2.71828")
2.71828
>>>
Change a string?
Strings cannot be modified
They are immutable
Instead, create a new one
>>> s = "GATTACA"
>>> s[3] = "C"
TypeError: object doesn't support item assignment
>>> s = s[:3] + "C" + s[4:]
>>> s
'GATCACA'
>>>
Some more methods
>>> "GATTACA".lower()
'gattaca'
>>> "gattaca".upper()
'GATTACA'
>>> "GATTACA".replace("G", "U")
'UATTACA'
>>> "GATTACA".replace("C", "U")
'GATTAUA'
>>> "GATTACA".replace("AT", "**")
'G**TACA'
>>> "GATTACA".startswith("G")
True
>>> "GATTACA".startswith("g")
False
>>>
Ask for a string
The Python function “raw_input” asks
the user (that’s you!) for a string
>>> seq = raw_input("Enter a DNA sequence: ")
Enter a DNA sequence: ATGTATTGCATATCGT
>>> seq.count("A")
4
>>> print "There are", seq.count("T"), "thymines"
There are 7 thymines
>>> "ATA" in seq
True
>>> substr = raw_input("Enter a subsequence to find: ")
Enter a subsequence to find: GCA
>>> substr in seq
True
>>>
Variables and
References in Python
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>> s "TAGAGAATTCTA”
Names Objects
References
>>> t = “GAAT”
>>>
s "TAGAGAATTCTA"
t "GAAT"
Names Objects
References
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>>
t "GAAT"
Strings have a i 4
“find” method
Names Objects
References
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>> t = s
t "GAAT"
>>>
i 4
Names Objects
References
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>> t = s
t
>>>
i 4
Names Objects
References
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>> t = s
t
>>> a
NameError: name 'a' is not defined
i 4
>>>
Names Objects
References
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>> t = s
t "GA"
>>> a
i 4
>>> s = “GA”
>>> print t
TAGAGAATTCTA
>>>
Names Objects
References
>>>
>>>
t = “GAAT”
i = s.find(t)
s
>>> print i
4 "GA"
>>> t = s
>>> a
i 4
>>> s = “GA”
>>> print t
TAGAGAATTCTA
>>> del t
>>>
Names Objects
References
>>>
>>>
t = “GAAT”
i = s.find(t)
s
>>> print i
4 "GA"
>>> t = s
>>> print a
i 4
>>> s = “GA”
>>> print t
TAGAGAATTCTA
>>> del t
>>> print t
NameError: name 't' is not defined
>>>
Names Objects
References
>>> L1 = [2, 4]
>>> L1 [2, 4]
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 4, 5]
>>>
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 4, 5]
>>> L2 = L1
>>>
L2
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 4, 5, 7]
>>> L2 = L1
>>> L2.append(7)
>>>
[2,
print L1
4, 5, 7]
L2
>>> print L2
[2, 4, 5, 7]
>>>
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 5, 7]
>>> L2 = L1
>>> L2.append(7)
>>>
[2,
print L1
4, 5, 7]
L2
>>> print L2
[2, 4, 5, 7]
>>> del L2[1]
>>>
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 5, 7]
>>> L2 = L1
>>> L2.append(7)
>>> print L1 L2 [2, 5, 7]
[2, 4, 5, 7]
>>> print L2
[2, 4, 5, 7]
>>> del L2[1]
>>> L2 = L1[:]
>>> L1 == L2
True
>>>
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [7, 5, 2]
>>> L2 = L1
>>> L2.append(7)
>>> print L1 L2 [2, 5, 7]
[2, 4, 5, 7]
>>> print L2
[2, 4, 5, 7]
>>> del L2[1]
>>> L2 = L1[:]
>>> L1 == L2
True
>>> L1.reverse()
>>> L1 == L2
False
>>>
Lists and the ‘for’ loop
Lists
Lists are an ordered collection of objects
>>> data = []
>>> print data Make an empty list
[]
>>> data.append("Hello!")
>>> print data “append” == “add to the end”
['Hello!']
>>> data.append(5)
>>> print data You can put different objects in
['Hello!', 5]
>>> data.append([9, 8, 7]) the same list
>>> print data
['Hello!', 5, [9, 8, 7]]
>>> data.extend([4, 5, 6])
>>> print data “extend” appends each
['Hello!', 5, [9, 8, 7], 4, 5, 6]
>>> element of the new
list to the old one
Lists and strings are
Strings
similar Lists
>>> s = "ATCG" >>> L = ["adenine", "thymine", "cytosine",
"guanine"]
>>> print s[0]
A >>> print L[0]
>>> print s[-1] adenine
G >>> print L[-1]
>>> print s[2:] guanine
CG >>> print L[2:]
>>> print "C" in s ['cytosine', 'guanine']
True >>> print "cytosine" in L
>>> s * 3 True
'ATCGATCGATCG' >>> L * 3
['adenine', 'thymine', 'cytosine', 'guanine',
>>> s[9] 'adenine', 'thymine', 'cytosine', 'guanine',
Traceback (most recent call last): 'adenine', 'thymine', 'cytosine', 'guanine']
IndexError: string index out of range
>>> L[9]
>>> File "<stdin>", line 1, in ?
IndexError: list index out of range
>>>
But lists are mutable
Lists can be changed. Strings are immutable.
>>> s = "ATCG" >>> L = ["adenine", "thymine", "cytosine",

>>> print s "guanine"]
ATCG >>> print L
>>> s[1] = "U" ['adenine', 'thymine', 'cytosine', 'guanine']
>>> L[1] = "uracil"
TypeError: object doesn't support item assignment >>> print L
>>> s.reverse() ['adenine', 'uracil', 'cytosine', 'guanine']
File "<stdin>", line 1, in ? >>> L.reverse()
AttributeError: 'str' object has no attribute >>> print L
'reverse'
['guanine', 'cytosine', 'uracil', 'adenine']
>>> print s[::-1]
>>> del L[0]
GCTA
>>> print L
>>> print s
['cytosine', 'uracil', 'adenine']
ATCG
>>>
>>>
Lists can hold any object
>>> L = ["", 1, "two", 3.0, ["quatro", "fem", [6j], []]]
>>> len(L)
5
>>> print L[-1]
['quatro', 'fem', [6j], []]
>>> len(L[-1])
4
>>> print L[-1][-1]
[]
>>> len(L[-1][-1])
0
>>>
A few more methods
>>> L = ["thymine", "cytosine", "guanine"]
>>> L.insert(0, "adenine")
>>> print L
['adenine', 'thymine', 'cytosine', 'guanine']
>>> L.insert(2, "uracil")
>>> print L
['adenine', 'thymine', 'uracil', 'cytosine', 'guanine']
>>> print L[:2]
['adenine', 'thymine']
>>> L[:2] = ["A", "T"]
>>> print L
['A', 'T', 'uracil', 'cytosine', 'guanine']
>>> L[:2] = []
>>> print L
['uracil', 'cytosine', 'guanine']
>>> L[:] = ["A", "T", "C", "G"]
>>> print L
['A', 'T', 'C', 'G']
>>>
Turn a string into a list
>>> s = "AAL532906 aaaatagtcaaatatatcccaattcagtatgcgctgagta"
}
>>> i = s.find(" ")
>>> print i
9
>>> print s[:i]
AAL532906 Complicated
>>> print s[i+1:]
aaaatagtcaaatatatcccaattcagtatgcgctgagta
>>>
>>> fields = s.split()
>>> print fields Easier!
['AAL532906', 'aaaatagtcaaatatatcccaattcagtatgcgctgagta']
>>> print fields[0]
AAL532906
>>> print len(fields[1])
40
>>>
More split examples
>>> protein = "ALA PRO ILU CYS"
>>> residues = protein.split()
>>> print residues
split() uses ‘whitespace’ to
['ALA', 'PRO', 'ILU', 'CYS'] find each word
>>>
>>> protein = " ALA PRO ILU CYS \n"
>>> print protein.split()
['ALA', 'PRO', 'ILU', 'CYS']
split(c) uses that character

>>> print "HIS-GLU-PHE-ASP".split("-")
['HIS', 'GLU', 'PHE', 'ASP'] to find each word
>>>
Turn a list into a string
join is the opposite of split
>>> L1 = ["Asp", "Gly", "Gln", "Pro", "Val"]
>>> print "-".join(L1)
Asp-Gly-Gln-Pro-Val
>>> print "**".join(L1)
Asp**Gly**Gln**Pro**Val
>>> print "\n".join(L1)
Asp
Gly The order is confusing.
Gln - string to join is first
Pro
Val - list to be joined is second
>>>
The ‘for’ loop
Lets you do something to each
element in a list
>>> for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

... print "Hello,", name
...
Hello, Andrew
Hello, Tsanwani
Hello, Arno
Hello, Tebogo
>>>
The ‘for’ loop
Lets you do something to each
element in a list

...
Hello, Andrew a new code block
Hello, Tsanwani
Hello, Arno
Hello, Tebogo
it must be indented
>>>
IDLE indents automatically when
it sees a ‘:’ on the previous line
A two line block
All lines in the same code block
must have the same indentation
... print "Your name is", len(name), "letters long"
...
Hello, Andrew
Your name is 6 letters long
Hello, Tsanwani
Hello, Arno
Hello, Tebogo
>>>
When indentation does
>>> a = 1
>>> a = 1 not match
File "<stdin>", line 1
a = 1
^
SyntaxError: invalid syntax
print "Your name is", len(name), "letters long"
^
SyntaxError: invalid syntax
print "Your name is", len(name), "letters long"
^
IndentationError: unindent does not match any outer indentation level
>>>
‘for’ works on strings
A string is similar to a list of letters
>>> seq = "ATGCATGTCGC"
>>> for letter in seq:
... print "Base:", letter
...
Base: A
Base: T
Base: G
Base: C
Base: A
Base: T
Base: G
Base: T
Base: C
Base: G
Base: C
>>>
Numbering bases
>>> seq = "ATGCATGTCGC"
>>> n = 0
... print "base", n, "is", letter
... n = n + 1
...
base 0 is A
base 1 is T
base 2 is G
base 3 is C
base 4 is A
base 5 is T
base 6 is G
base 7 is T
base 8 is C
base 9 is G
base 10 is C
>>>
>>> print "The sequence has", n, "bases"
The sequence has 11 bases
>>>
The range function
>>> range(5)
[0, 1, 2, 3, 4]
>>> range(8)
[0, 1, 2, 3, 4, 5, 6, 7]
>>> help(range)
>>> range(2, 8)
Help on built-in function range:
[2, 3, 4, 5, 6, 7]
>>> range(0, 8, 1) range(...)
[0, 1, 2, 3, 4, 5, 6, 7] range([start,] stop[, step]) -> list of integers
>>> range(0, 8, 2) Return a list containing an arithmetic progression of integers.
[0, 2, 4, 6] range(i, j) returns [i, i+1, i+2, ..., j-1]; start (!) defaults to 0.
>>> range(0, 8, 3) When step is given, it specifies the increment (or decrement).
For example, range(4) returns [0, 1, 2, 3]. The end point is omitted!
[0, 3, 6] These are exactly the valid indices for a list of 4 elements.
>>> range(0, 8, 4)
[0, 4]
>>> range(0, 8, -1)
[]
>>> range(8, 0, -1)
[8, 7, 6, 5, 4, 3, 2, 1]
>>>
Do something ‘N’ times
>>> for i in range(3):
... print "If I tell you three times it must be true."
...
If I tell you three times it must be true.
>>>
>>> for i in range(4):
... print i, "squared is", i*i, "and cubed is", i*i*i
...
0 squared is 0 and cubed is 0
>>>
Stepping through a
‘for’ loop
At the beginning - run module
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

print "Hello,", name
print “The end.”
Variable
Names Objects
Start with the first line - the ‘for’ statement

Variable
Names Objects
Look at the list

Variable
Names Objects
Is it empty? No. Start with the first object

Variable
Names Objects
Assign the first object to the variable ‘name’

Variable
Names Objects
name “Andrew” Create the string object

“Andrew” and assign it
to the variable named
‘name’
Then start the first line of the code block

Variable
Names Objects
name “Andrew”
This is the ‘print’ statement

Variable
Names Objects
name “Andrew”
print the string object “Hello,” and the value of
the variable with name ‘name’
Variable
Names Objects
name “Andrew”
Hello, Andrew
The print statement is finished.
Python is at the end of the code block...
Variable
Names Objects
name “Andrew”
... so go to the ‘for’ statement again

Variable
Names Objects
name “Andrew”
Move the list pointer forward by one

Variable
Names Objects
name “Andrew”
Is it past the end? No, it’s on the second item.

Variable
Names Objects
name “Andrew”
Assign the second object to the variable ‘name’

Variable
Names Objects
name “Andrew”
Create the string object
“Tsanwani” “Tsanwani” and assign it
to the variable named
‘name’

Variable
Names Objects
name “Andrew”
“Tsanwani”
Variable
Names Objects
name “Andrew”
Hello, Andrew
“Tsanwani” Hello, Tsanwani
Variable
Names Objects
name “Andrew”
“Tsanwani”

Variable
Names Objects
name “Andrew”
“Tsanwani”

Variable
Names Objects
name “Andrew”
“Tsanwani”
Is it past the end? No, it’s on the third item.

Variable
Names Objects
name “Andrew”
“Tsanwani”
Assign the third object to the variable ‘name’

Variable
Names Objects

“Tsanwani” “Arno” and assign it to
“Arno” the variable named
‘name’

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
Variable
Names Objects
name “Andrew”
Hello, Andrew
“Arno” Hello, Arno
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
Is it past the end? No, it’s on the fourth item.

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
Assign the fourth object to the variable ‘name’

Variable
Names Objects

“Tsanwani” “Tebogo” and assign it
“Arno” to the variable named
‘name’
“Tebogo”

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Variable
Names Objects
name “Andrew”
Hello, Andrew
“Tebogo” Hello, Tebogo
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Is it past the end? Yes!

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
So skip past the code block to the next
statement.
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
This is another print statement

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
It prints the string “The end.”

Variable
Names Objects
name “Andrew”
Hello, Andrew
“Tebogo” Hello, Tebogo
The end.
Python looks for the next statement...

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
... but there isn’t any, so Python stops.

Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Final Program Output
Hello, Andrew
Hello, Tsanwani
Hello, Arno
Hello, Tebogo
The end.
The if statement and
files
The if statement
Do a code block only when something is True
if test:
print "The expression is true"
Example
if "GAATTC" in "ATCTGGAATTCATCG":
print "EcoRI site is present"
if the test is true...
The test is: "GAATTC" in "ATCTGGAATTCATCG"
Then print the message
Here is it done in the Python shell

>>> if "GAATTC" in "ATCTGGAATTCATCG":
... print "EcoRI is present"
...
EcoRI is present
>>>
What if you want the
false case?
There are several possibilities; here’s two
1) Python has a not in operator

if "GAATTC" not in "AAAAAAAAA":
print "EcoRI will not cut the sequence"
2) The not operator switches true and false

if not "GAATTC" in "AAAAAAAAA":
In the Python shell
>>> x = True
>>> x
True
>>> not x
False
>>> not not x
True
>>> if "GAATTC" not in "AAAAAAAAA":
... print "EcoRI will not cut the sequence"
...
EcoRI will not cut the sequence
>>> if not "GAATTC" in "ATCTGGAATTCATCG":
...
>>> if not "GAATTC" in "AAAAAAAAA":
...
>>>
else:
What if you want to do one thing when the test is
true and another thing when the test is false?
Do the first code block (after the if:) if the

test is true
else:
Do the second code block (after the

else:) if the test is false
Examples with else
>>> if "GAATTC" in "ATCTGGAATTCATCG":
... print "EcoRI site is present"
... else:
...
EcoRI site is present
>>> if "GAATTC" in "AAAACTCGT":
... print "EcoRI site is present"
... else:
...
>>>
Where is the site?
The ‘find’ method of strings returns the index of a substring
in the string, or -1 if the substring doesn’t exist
>>> seq = "ATCTGGAATTCATCG"

There is a GAATTC
>>> seq.find("GAATTC")
5 } at position 5
>>> seq.find("GGCGC")
-1 }
But there is no GGCGC
>>> in the sequence
But where is the site?
>>> seq = "ATCTGGAATTCATCG"
>>> pos = seq.find("GAATTC")
>>> if pos == -1:
... print "EcoRI does not cut the sequence"
... else:
... print "EcoRI site starting at index", pos
...
EcoRI site starting at index 5
>>>
Start by creating the string “ATCTGGAATTCATCG” and
assigning it to the variable with name ‘seq’
seq = "ATCTGGAATTCATCG"
pos = seq.find("GAATTC")
if pos == -1:
print "EcoRI does not cut the sequence"
else:
print "EcoRI site starting at index", pos
Using the seq string, call the method named find. This
looks for the string “GAATTC” in the seq string
if pos == -1:
else:
The string “GAATC” is at position 5 in the seq string.
Assign the 5 object to the variable named pos.
if pos == -1:
else:
The variable name “pos” is often used for positions.

Common variations are “pos1”, “pos2”,
“start_pos”, “end_pos”
Do the test for the if statement
Is the variable pos equal to -1?
if pos == -1:
else:
Since pos is 5 and 5 is not equal to -1,
this test is false.
The test is False
if pos == -1:
else:
Skip the first code block
(that is only run if the test is True)
Instead, run the code block after the else:
if pos == -1:
else:
This is a print statement.
Print the index of the start position
if pos == -1:
else:
This prints
EcoRI site starting at index 5
There are no more statements so Python stops.
if pos == -1:
else:
A more complex example
Using if inside a for
restriction_sites = [
"GAATTC", # EcoRI
"GGATCC", # BamHI
"AAGCTT", # HindIII
]
seq = raw_input("Enter a DNA sequence: ")
for site in restriction_sites:

if site in seq:
print site, "is a cleavage site"
else:
print site, "is not present"
Nested code blocks
"GAATTC", # EcoRI
"GGATCC", # BamHI
"AAGCTT", # HindIII
]
}
if site in seq:
print site, "is a cleavage site" This is the code
else:
print site, "is not present" block for the
for statement
"GAATTC", # EcoRI
"GGATCC", # BamHI
"AAGCTT", # HindIII
]

if site in seq:
else: }
True part of the
if statement
"GAATTC", # EcoRI
"GGATCC", # BamHI
"AAGCTT", # HindIII
]

if site in seq:
else:
} False part of the
if statement
The program output
Enter a DNA sequence: AATGAATTCTCTGGAAGCTTA

GAATTC is a cleavage site
GGATCC is not present
AAGCTT is a cleavage site
Read lines from a file
• raw_input() asks the user for input

• Most of the time you’ll get data from a file.
(Or would you rather type in the sequence
every time?)
• To read from a file you need to tell Python to

open that file.
The open function
>>> infile = open("/home/myusername/my_sequences.seq")
>>> print infile
<open file '/usr/coursehome/dalke/10_sequences.seq', mode 'r' at 0x817ca60>
>>>
open returns a new object of type file
A file can’t be displayed like a

number or a string. It is useful
because it has methods for working
with the data in the file.
the readline() method
>>> print infile
>>> infile.readline()
'CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA\n'
>>>
readline returns one line from the file

The line includes the end of line
character (represented here by “\n”)
(Note: the last line of some
files may not have a “\n”)
readline finishes with ""
>>> print infile
'CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA\n'
'ATTTTTAACTTTTCTCTGTCGTCGCACAATCGACTTTCTCTGTTTTCTTGGGTTTACCGGAA\n'
'TTGTTTCTGCTGCGATGAGGTATTGCTCGTCAGCCTGAGGCTGAAAATAAAATCCGTGGT\n'
'CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT\n'
'TCTTCTCCAAGACGCATCCACGTGAACCGTTGTAACTATGTTCTGTGC\n'
'CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA\n'
'GTGCTCTCTTCTCGGAGAGAGAAGGTGGGCTGCTTGTCTGCCGATGTACTTTATTAAATCCAATAA\n'
'CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA\n'
'TCTGAAAAGTGCAAAGAACGATGATGATGATGATAGAGGAACCTGAGCAGCCATGTCTGAACCTATAGC\n'
'GTATTGGTCGTCGTGCGACTAAATTAGGTAAAAAAGTAGTTCTAAGAGATTTTGATGATTCAATGCAAAGTTCTATTAATCGTTCAATTG\n'
''
>>>
When there are no more lines,
readline returns the empty string
Using for with a file
A simple way to read lines from a file
>>> filename = "/home/myusername/my_sequences.seq"
>>> for line in open(filename):
... print line[:10]
...
CCTGTATTAG
ATTTTTAACT
TTGTTTCTGC for starts with the first line in the file ...
CACACCCAAT then the second ...
TCTTCTCCAA
CCACACCAAA then the third ...
GTGCTCTCTT ...
CCACACCAAA and finishes with the last line.
TCTGAAAAGT
GTATTGGTCG
>>>
A more complex task
List the sequences starting with a cytosine
>>> filename = "/home/myusername/my_sequences.seq"
>>> for line in open(filename):
... line = line.rstrip() rstrip Use to get rid
... if line.startswith("C"):
... print line of the “\n”
...
CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA
CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT
CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA
CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA
>>>
Searching and Regular
Expressions
Proteins
• 20 amino acids
• Interesting structures
• beta barrel, greek key motif, EF hand ...
• Bind, move, catalyze, recognize, block, ...
• Many post-translational modifications
• Structure/function strongly influenced by
sequence
Sequence Suggests
Structure/Function
When working with tumors you find the p53 tumor antigen,
which is found in increased amounts in transformed cells.
After looking at many p53s you find that the substring

MCNSSCMGGMNRR is well conserved and has few false
(mis)matches.
If you have a new protein sequence and it has this substring

then it is likely to be a p53 tumor antigen.
Finding a string
We’ve covered several ways to find a
substring in a larger string.
site in sequence -- test if the substring site is found
anywhere in the sequence
sequence.find(site) -- find the index of the first

site in the sequence. Return -1 if not found.
sequence.count(site) -- count the number of

times site is found in the sequence (no overlaps).
Is it a p53 sequence?
>>> p53 = "MCNSSCMGGMNRR"
>>> protein = "SEFTTVLYNFMCNSSCMGGMNRRPILTIIS"
>>> protein.find(p53)
10
>>> protein[10:10+len(p53)]
'MCNSSCMGGMNRR'
>>>
p53 needs more than
one test substring
After a while you find that p53s are variable in one residue.
MCNSSCMGGMNRR
or
MCNSSCVGGMNRR
You could test for both cases, but as you add more
possibilities the number of patterns gets really
large, and writing them out is tedious.
Need a pattern
Rather than write each alternative, perhaps we can write a
pattern, which is used to describe all the strings to test.
MCNSSCMGGMNRR
or MCNSSC[MV]GGMNRR
MCNSSCVGGMNRR
Use [] to indicate a list of residues that could match.
[FILAPVM] matches any hydrophobic residue
PROSITE
PROSITE is a database of protein patterns.
http://au.expasy.org/prosite/
The documentation for a pattern is in PRODOC.
PROSITE contains links to SWISS-PROT (a protein

sequence database) and PDB (a structure database)
ANTENNAPEDIA
'Homeobox' antennapedia-type protein signature.
Look for a
[LIVMFE][FY]PWM[KRQTA]
substring which:
Starts with L, I,V, M, F, or E
ANTENNAPEDIA
Look for a
substring which:
Then has an F or Y
ANTENNAPEDIA
Look for a
substring which:
Then has an F or Y
Then the letter P
Followed by a W
Followed by an M
ANTENNAPEDIA
Look for a
substring which:
Then has an F or Y
Then the letter P
Followed by a W
Followed by an M
And ending with a K, R, Q, T, or A
Find ANTENNAPEDIA
Can you find [LIVMFE][FY]PWM[KRQTA] ?
MDPDCFAMSS YQFVNSLASC YPQQMNPQQN HPGAGNSSAG GSGGGAGGSG GVVPSGGTNG
GQGSAGAATP GANDYFPAAA AYTPNLYPNT PQPTTPIRRL ADREIRIWWT TRSCSRSDCS
CSSSSNSNSS NMPMQRQSCC QQQQQLAQQQ HPQQQQQQQQ ANISCKYAND PVTPGGSGGG
GVSGSNNNNN SANSNNNNSQ SLASPQDLST RDISPKLSPS SVVESVARSL NKGVLGGSLA
AAAAAAGLNN NHSGSGVSGG PGNVNVPMHS PGGGDSDSES DSGNEAGSSQ NSGNGKKNPP
QIYPWMKRVH LGTSTVNANG ETKRQRTSYT RYQTLELEKE FHFNRYLTRR RRIEIAHALC
LTERQIKIWF QNRRMKWKKE HKMASMNIVP YHMGPYGHPY HQFDIHPSQF AHLSA
That’s why we have computers.
Sequences with the
ANTENNAPEDIA motif
Here are some sequences which contain substrings
which fit the pattern
...LHNEANLRIYPWMRSAGADR...
...PTVGKQIFPWMKES...
...VFPWMKMGGAKGGESKRTR...
Not a given residue
Suppose you know from structural reasons that a
residue cannot be a proline. You could write
[ACDEFGHIKLMNQRSTVWY]
That’s tedious, so let’s use a new notation
[^P]
This matches anything which is not a proline.
(Yes, using the ^ is strange. That’s the way it is.)
N-glycosylation site
This is the pattern for PS00001, ASN_GLYCOSYLATION
N[^P][ST][^P]
Match an N,
Then anything which isn’t a P,
Then an S or T,
And finally, anything which isn’t a P
Allow anything
Sometimes the pattern can have anything in a
given position - it just needs the proper spacing.
Could use [ACDEFGHJKLMNPQRSTVWY] but that gets
tedious. Instead, let’s make a new notation for “anything”
Let’s use the dot, “.”, so that P.P matches a proline

followed by any residue followed by a proline.
Barwin domain signature 1
The pattern is: CG[KR]CL.V.N
The substring must start with a C,
second letter must be a G,
third must be a K or R,
fourth must be a C, ...SSCGKCLSVTNTG...
fifth must be an L,
sixth may be any residue,
seventh must be a V,
eight may also be any residue,
last must be an N.
Repeats
Sometimes you’ll repeat yourself repeat yourself. For
example, a pattern may require 5 hydrophobic residues
between two well conserved regions.
You could write it as
[FILAPVM][FILAPVM][FILAPVM][FILAPVM][FILAPVM]
but that gets tedious. Again that word. And again we’ll create
a new notation. Let’s use {}s with a number inside to indicate
how many times to repeat the previous pattern.
[FILAPVM]{5}
[FILAPVM]{5}
The {}s repeat the previous pattern.
The above matches all of the following
AAAAA
AAPAP
LAPMAVAILA
VILLAMAP
LAPLAMP
And .{6} matches any string of at least length 6.
EGF-like domain
signature 1
The pattern for PS00022 is: C.C.{5}G.{2}C
Match a C, followed by any residue, followed by a C, followed
by 5 residues of any type, then a G, then 2 of any residue type,
then a C.
...VCSNEGKCICQPDWTGKDCS...
Count Ranges
Sometimes you may have a range of repeats. For example, a
loop can have 3 to 5 residues in it. All of our patterns so far
only matched a fixed number of characters, so we need to
modify the notation.
{m,n} - repeat the previous pattern at least

m times and up to n times.
For example, A{3, 5} matches AAA, AAAA, and AAAAA

but does not match AA nor AATAA.
EGF-like domain
signature 2
PS01186 is: C.C.{2}[GP][FYW].{4,8}C
Use a spacer of at least 4 residues

and up to (and including) 8 residues.
RHCYCEEGWAPPDCTTQLKA
RHCYCEEGWAPPDECTTQLKA
RHCYCEEGWAPPDEQCTTQLKA
RHCYCEEGWAPPDEQWCTTQLKA
RHCYCEEGWAPPDEQWICTTQLKA
Short-hand versions of
counts ranges
This notation is very powerful and widely used outside of
bioinformatics. (I think research on it started in the 1950s).
Some repeat ranges are used so frequently that (to prevent
tedium, and to make things easier to read) there is special
notation for them.
What it means
{0, 1} ? “optional”
{0,} * “0 or more”
{1,} + “at least one”
N- and C- terminals
Some things only happen at the N- terminal (start of the
sequence) or C-terminal (end of the sequence). We
don’t have a way to say that so we need - yes, you
guessed it - more notation.
^ means the start of the sequence (a ^ inside
of []s means “not”, outside means “start”)
$ means ends of the sequence
êxamples$
Â start with an A
^[MPK] start with an M, P, or K
E$ end with an E
[QSN]$ end with a Q, S, or N
^[^P] start with anything except P
start with an A and end with
Â.*E$
an E
Neuromodulin
(GAP-43) signature 1
The pattern for PS00412 is: ^MLCC[LIVM]RR
Does match: MLCCIRRTKPVEKNEEADQE

Does not match: MMLCCIRRTKPVEKNEEADQE
Endoplasmic reticulum
targeting sequence
The pattern for PS00014 is: [KRHQSA][DENQ]EL$
Does match: ADGGVDDDHDEL

Does not match: ADGGVDDDHDELQ
Regular expressions
These sorts of patterns which match strings are called “regular
expressions”. (The name “regular” comes from a theoretical
model of how simple computers work, and “expressions”
because they are written as text.)
People don’t like saying “regular expression” all the time so will
often say “regexp”, “regex”, or “re”, or (rarely) “rx”.
Many different regexp
languages
We’ve learned a bit of the “perl5” regular expression
language. It’s the most common and is used by Python
and other languages. There’s even pcre (perl compatible
regular expressions) for C.
There are many others: grep, emacs, awk, POSIX, and the
shells all use different ways to write the same pattern.
PROSITE also has its own unique form (which I
didn’t teach because no one else uses it).
regexps in Python
The re module in Python has functions for working
with regular expressions.
>>> import re
>>>
The ‘search’ method
>>> import re
>>> text = "My name is Andrew"
>>> re.search("[AT]", text)
The first parameter is the pattern, as a string.

The second is the string to search.
The Match object
>>> import re
<_sre.SRE_Match object at 0x3f8d40>
The search returns a “Match” object. Just like a file

object, there is no simple way to show it.
Using the match
>>> import re
<_sre.SRE_Match object at 0x3f8d40>
>>> match = re.search("[AT]", text)
>>> match.start()
11
>>> match.end()
12
>>> text[11:12]
'A'
>>>
Match a protein motif
>>> pattern = "[LIVMFE][FY]PWM[KRQTA]"
>>> seq = "LHNEANLRIYPWMRSAGADR"
>>> match = re.search(pattern, seq)
>>> match.start()
8
>>> match.end()
14
>>>
If it doesn’t match..
The search returns nothing (the None object)
when no match was found.
>>> import re
>>> match = re.search(pattern, "AAAAAAAAAAAAAA")
>>> print match
None
>>>
List matching patterns
>>> import re
>>> sequences = ["LHNEANLRIYPWMRSAGADR",
... "PTVGKQIFPWMKES",
... "NEANLKQIFPGAATR",
... "VFPWMKMGGAKGGESKRTR"]
>>> for seq in sequences:
... match = re.search(pattern, seq)
... if match:
... print seq, "matches"
... else:
... print seq, "does not have the motif"
...
LHNEANLRIYPWMRSAGADR matches
PTVGKQIFPWMKES matches
NEANLKQIFPGAATR does not have the motif
VFPWMKMGGAKGGESKRTR matches
>>>
Groups
Suppose an enzyme modifies a protein, and recognizes
the portion of the sequence matching [ASD]{3,5}[LI][^P]
{2,5}
The modification only occurs on the [IL] residue. I want
to know the residue of that one residue, and not the
start/end positions of the whole motif. This requires a
new notation, groups.
(groups)
Use ()s to indicate groups. The first ( is the start of the
first group, the second ( is the start of the second
group, etc. A group ends with the matching ).
>>> import re
>>> pattern = "[ASD]{3,5}([LI])[^P]{2,5}"
>>> seq = "EASALWTRD"
>>> match = re.search(pattern, seq)
>>> print match.start(), match.end()
1 9
>>> match.start(1), match.end(1)
4 5
>>>
Parsing with regexps
Groups are great for parsing. Suppose I have the string
Name: Andrew Age: 33
and want to get the name and the age values. I can use a
pattern with a group for each field.
Name: ([^ ]+) +Age: ([0123456789]+)
Dissecting that pattern
Name: ([^ ]+) +Age: ([0123456789]+)
Start with
“Age: ”
“Name: ”
One or more non-
One or more digits
space characters
(group 2)
(group 1)
One or more spaces
Shorthand
Saying [0123456789] is tedious (again!)
There is special shorthand notation for some of
the more common sets.
Name: ([^ ]+) +Age: (\d+)
Some others
\d = [0123456789]
\w = letters, digits, and the underscore
\s = “whitespace” (space, newline, tab, and a few others)
Using it
>>> import re
>>> text = "Name: Andrew Age: 33"
>>> pattern = "Name: ([^ ]+) +Age: ([0123456789]+)"
>>> match = re.search(pattern, text)
>>> match.start(1)
6
>>> match.end(1)
12
>>> match.group(1)
'Andrew'
>>> match.group(2)
'33'
>>>
Dictionaries
A “Good morning”
dictionary
English: Good morning
Spanish: Buenas días
Swedish: God morgon
German: Guten morgen
Venda: Ndi matscheloni
Afrikaans: Goeie môre
What’s a dictionary?
A dictionary is a table of items.
Each item has a “key” and a “value”
Keys Values
English Good morning
Spanish Buenas días
Swedish God morgon
German Guten morgen
Venda Ndi matscheloni
Afrikaans Goeie môre
Look up a value
I want to know “Good morning” in Swedish.
Step 1: Get the “Good morning” table
Keys Values
Swedish God morgon
German Guten morgen
Find the item
Step 2: Find the item where the key is “Swedish”
Keys Values
Swedish God morgon
German Guten morgen
Get the value
Step 3: The value of that item is how to say “Good
morning” in Swedish -- “God morgon”
Keys Values
Swedish God morgon
German Guten morgen
In Python
>>> good_morning_dict = {
... "English": "Good morning",
... "Swedish": "God morgon",
... "German": "Guten morgen",
... "Venda": "Ndi matscheloni",
... }
>>> print good_morning_dict["Swedish"]
God morgon
>>>
(I left out Spanish and Afrikaans because they use

‘special’ characters. Those require Unicode, which
I’m not going to cover.)
Dictionary examples
>>> D1 = {}
>>> len(D1) An empty dictionary
0
>>> D2 = {"name": "Andrew", "age": 33}
>>> len(D2)
2 A dictionary with 2 items
>>> D2["name"]
'Andrew'
>>> D2["age"]
33
>>> D2["AGE"]
Keys are case-sensitive
KeyError: 'AGE'
>>>
Add new elements
>>> my_sister = {}
>>> my_sister["name"] = "Christy"
>>> print "len =", len(my_sister), "and value is", my_sister
len = 1 and value is {'name': 'Christy'}
>>> my_sister["children"] = ["Maggie", "Porter"]
>>> print "len =", len(my_sister), "and value is", my_sister
len = 2 and value is {'name': 'Christy', 'children': ['Maggie', 'Porter']}
>>>
Get the keys and values
>>> city = {"name": "Cape Town", "country": "South Africa",
... "population": 2984000, "lat.": -33.93, "long.": 18.46}
>>> print city.keys()
['country', 'long.', 'lat.', 'name', 'population']
>>> print city.values()
['South Africa', 18.460000000000001, -33.93, 'Cape Town', 2984000]
>>> for k in city:
... print k, "=", city[k]
...
country = South Africa
long. = 18.46
lat. = -33.93
name = Cape Town
population = 2984000
>>>
A few more examples
>>> D = {"name": "Johann", "city": "Cape Town"}
>>> D["city"] = "Johannesburg"
>>> print D
{'city': 'Johannesburg', 'name': 'Johann'}
>>> del D["name"]
>>> print D
{'city': 'Johannesburg'}
>>> D["name"] = "Dan"
>>> print D
{'city': 'Johannesburg', 'name': 'Dan'}
>>> D.clear()
>>>
>>> print D
{}
>>>
Ambiguity codes
Sometimes DNA bases are ambiguous.
Eg, the sequencer might be able to tell that

a base is not a G or T but could be either A or C.
The standard (IUPAC) one-letter code for

DNA includes letters for ambiguity.
M is A or C Y is C or T D is A, G or T
R is A or G K is G or T B is C, G or T
W is A or T V is A, C or G N is G, A, T or C
S is C or G H is A, C or T
Count Bases #1
This time we’ll include all 16 possible letters
>>> seq = "TKKAMRCRAATARKWC"
>>> A = seq.count("A")
>>> B = seq.count("B")
>>> C = seq.count("C")
>>> D = seq.count("D")
>>> G = seq.count("G")
Don’t do this!
>>> H = seq.count("H")
>>> K = seq.count("K")
>>> M = seq.count("M")
>>>
>>>
>>>
N
R
S
=
=
=
seq.count("N")
seq.count("R")
seq.count("S")
Let the computer help out
>>> T = seq.count("T")
>>> V = seq.count("V")
>>> W = seq.count("W")
>>> Y = seq.count("Y")
>>> print "A =", A, "B =", B, "C =", C, "D =", D, "G =", G, "H =", H, "K =", K, "M =", M, "N
=", N, "R =", R, "S =", S, "T =", T, "V =", V, "W =", W, "Y =", Y
A = 4 B = 0 C = 2 D = 0 G = 0 H = 0 K = 3 M = 1 N = 0 R = 3 S = 0
T = 2 V = 0 W = 1 Y = 0
>>>
Count Bases #2
Using a dictionary
>>> counts = {}
>>> counts["A"] = seq.count("A")
>>> counts["B"] = seq.count("B")
>>> counts["C"] = seq.count("C")
>>> counts["D"] = seq.count("D")
>>> counts["G"] = seq.count("G")
>>>
>>>
>>>
counts["H"]
counts["K"]
counts["M"]
=
=
=
seq.count("H")
seq.count("K")
seq.count("M")
Don’t do this either!
>>> counts["N"] = seq.count("N")
>>> counts["R"] = seq.count("R")
>>> counts["S"] = seq.count("S")
>>> counts["T"] = seq.count("T")
>>> counts["V"] = seq.count("V")
>>> counts["W"] = seq.count("W")
>>> counts["Y"] = seq.count("Y")
>>> print counts
{'A': 4, 'C': 2, 'B': 0, 'D': 0, 'G': 0, 'H': 0, 'K': 3, 'M': 1, 'N':
0, 'S': 0, 'R': 3, 'T': 2, 'W': 1, 'V': 0, 'Y': 0}
>>>
Count Bases #3
use a for loop
>>> counts = {}
>>> for letter in "ABCDGHKMNRSTVWY":
... counts[letter] = seq.count(letter)
...
>>> print counts
{'A': 4, 'C': 2, 'B': 0, 'D': 0, 'G': 0, 'H': 0, 'K': 3, 'M': 1, 'N': 0, 'S': 0, 'R': 3, 'T': 2,
'W': 1, 'V': 0, 'Y': 0}
>>> for base in counts.keys():
... print base, "=", counts[base]
...
A = 4
C = 2
B = 0
D = 0
G = 0
H = 0
K = 3
M = 1
N = 0
S = 0
R = 3
T = 2
W = 1
V = 0
Y = 0
>>>
Count Bases #4
Suppose you don’t know all the possible bases.
If the base isn’t a key in the
>>> seq = "TKKAMRCRAATARKWC" counts dictionary then use
>>> counts = {}
>>> for base in seq: zero. Otherwise use the
...
...
if base not in counts:
n = 0 value from the dict
... else:
... n = counts[base]
... counts[base] = n + 1
...
>>> print counts
{'A': 4, 'C': 2, 'K': 3, 'M': 1, 'R': 3, 'T': 2, 'W': 1}
>>>
Count Bases #5 (Last
one!)
The idiom “use a default value if the key doesn’t
exist” is very common. Python has a special
method to make it easy.
>>> counts = {}
>>> for base in seq:
... counts[base] = counts.get(base, 0) + 1
...
>>> print counts
{'A': 4, 'C': 2, 'K': 3, 'M': 1, 'R': 3, 'T': 2, 'W': 1}
>>> counts.get("A", 9)
4
>>> counts["B"]
KeyError: 'B'
>>> counts.get("B", 9)
9
>>>
Reverse Complement
>>> complement_table = {"A": "T", "T": "A", "C": "G", "G": "C"}
>>> seq = "CCTGTATT"
>>> new_seq = []
... complement_letter = complement_table[letter]
... new_seq.append(complement_letter)
...
>>> print new_seq
['G', 'G', 'A', 'C', 'A', 'T', 'A', 'A']
>>> new_seq.reverse()
>>> print new_seq
['A', 'A', 'T', 'A', 'C', 'A', 'G', 'G']
>>> print "".join(new_seq)
AATACAGG
>>>
Listing Codons
>>> seq = "TCTCCAAGACGCATCCCAGTG"
>>> seq[0:3]
'TCT'
>>> seq[3:6]
'CCA'
>>> seq[6:9]
'AGA'
>>> range(0, len(seq), 3)
[0, 3, 6, 9, 12, 15, 18]
>>> for i in range(0, len(seq), 3):
... print "Codon", i/3, "is", seq[i:i+3]
...
Codon 0 is TCT
Codon 1 is CCA
Codon 2 is AGA
Codon 3 is CGC
Codon 4 is ATC
Codon 5 is CCA
Codon 6 is GTG
>>>
The last “codon”
>>> seq = "TCTCCAA"
>>> for i in range(0, len(seq), 3):
... print "Base", i/3, "is", seq[i:i+3]
...
Base 0 is TCT
Base 1 is CCA Not a codon!
Base 2 is A
>>>
What to do? It depends on what you want.

But you’ll probably want to know if the
sequence length isn’t divisible by three.
The ‘%’ (remainder)
operator
>>> 0 % 3
0
>>> 1 % 3
1
>>> 2 % 3
2 >>> seq = "TCTCCAA"
>>> 3 % 3 >>> len(seq)
0 7
>>> 4 % 3 >>> len(seq) % 3
1 1
>>> 5 % 3 >>>
2
>>> 6 % 3
0
>>>
Two solutions
First one -- refuse to do it
if len(seq) % 3 != 0: # not divisible by 3
print "Will not process the sequence"
else:
print "Will process the sequence"
Second one -- skip the last few letters

Here I’ll adjust the length
>>> seq = "TCTCCAA"
>>> for i in range(0, len(seq) - len(seq)%3, 3):
... print "Base", i/3, "is", seq[i:i+3]
...
Base 0 is TCT
Base 1 is CCA
>>>
Counting codons
>>> seq = "TCTCCAAGACGCATCCCAGTG"
>>> codon_counts = {}
>>> for i in range(0, len(seq) - len(seq)%3, 3):
... codon = seq[i:i+3]
... codon_counts[codon] = codon_counts.get(codon, 0) + 1
...
>>> codon_counts
{'ATC': 1, 'GTG': 1, 'TCT': 1, 'AGA': 1, 'CCA': 2, 'CGC': 1}
>>>
Notice that the codon_counts dictionary

elements aren’t sorted?
Sorting the output
People like sorted output. It’s easier to
find “GTG” if the codon table is in order.
Use keys to get the dictionary keys then
use sort to sort the keys (put them in order).
>>> codon_counts = {'ATC': 1, 'GTG': 1, 'TCT': 1, 'AGA': 1, 'CCA': 2, 'CGC': 1}
>>> codons = codon_counts.keys()
>>> print codons
['ATC', 'GTG', 'TCT', 'AGA', 'CCA', 'CGC']
>>> codons.sort()
>>> print codons
['AGA', 'ATC', 'CCA', 'CGC', 'GTG', 'TCT']
>>> for codon in codons:
... print codon, "=", codon_counts[codon]
...
AGA = 1
ATC = 1
CCA = 2
CGC = 1
GTG = 1
TCT = 1
>>>
Code Blocks
and
Indentation
Indentation is important
Think of a recipe - Chocolate Cake
(Mmmmmm.... Chocolate Cake....)
1. Make the cake

2. Put the frosting on
3. Eat and enjoy
Top-down programming style....
Make the cake?
1. Make the cake:
1A. make the batter
1B. put into pans
1C. bake at 180C for 30-35 minutes
3. Eat and enjoy
Make the batter?
1. Make the cake:
1A. make the batter:
1Aa. melt chocolate and butter
1Ab. prepare egg mixture
1Ac. sift dry ingredients
1Ad. combine egg mixture, dry
ingredients and milk
1Ae. fold egg whites into batter
1B. put into pans
1C. bake at 180C for 30-35 minutes
3. Eat and enjoy
- Make the cake:
Melt the chocolate ... ?
- make the batter:
- melt chocolate and butter:

- In a heavy saucepan over low heat:
- put in 6 ounces semi-sweet chocolate
- put in 1/2 cup butter
- while it hasn’t melted:
- wait a little bit
- stir
- put aside to let cool
- prepare egg mixture
- sift dry ingredients
- combine egg mixture, dry
ingredients and milk
- fold egg whites into batter
- put into pans
- bake at 180C for 30-35 minutes
- Put the frosting on
- Eat and enjoy
Where do I get ... ?
- Prepare for cooking:
- get 6 ounces of chocolate, 1/2 cup butter
- get a saucepan, stove, spoon for stirring
- Make the cake:
- make the batter:
- melt chocolate and butter:
- In a heavy saucepan over low heat:
- put in 6 ounces semi-sweet chocolate
- put in 1/2 cup butter
- while it hasn’t melted:
- wait a little bit
- stir
- put aside to let cool
- prepare egg mixture
- sift dry ingredients
- combine egg mixture, dry ingredients and milk
- fold egg whites into batter
- put into pans
- bake at 180C for 30-35 minutes
- Put the frosting on
- Eat and enjoy
I have/don’t have that!
Prepare for cooking:
get a kitchen with a good set of cookware
start a “Shopping list”
for each ingredient in [6 ounces of chocolate,
1/2 cup of butter]:
if I don’t have enough of the ingredient:
add what’s missing to the shopping list
Make the cake:
make the batter:
melt chocolate and butter:
In a heavy saucepan over low heat:
put in 6 ounces semi-sweet chocolate
put in 1/2 cup butter
while it hasn’t melted:
wait a little bit
stir
put aside to let cool
prepare egg mixture
sift dry ingredients
combine egg mixture, dry ingredients and milk
fold egg whites into batter
put into pans
bake at 180C for 30-35 minutes
Put the frosting on
Eat and enjoy
And the egg mixture?
Prepare for cooking:
get a kitchen with a good set of cookware
start a “Shopping list”
for each ingredient in [6 ounces of chocolate,

1/2 cup of butter, 4 eggs, 1/3 cup sugar]:
if I don’t have enough of the ingredient:
add what’s missing to the shopping list
Make the cake:
make the batter:
melt chocolate and butter:
In a heavy saucepan over low heat:
put in 6 ounces semi-sweet chocolate
put in 1/2 cup butter
while it hasn’t melted:
wait a little bit
stir
put aside to let cool
prepare egg mixture:

get one small bowl for egg whites and another for yolks
using two eggs:
put the egg whites in the bowl for egg whites
put the yolks in the bowl for yolks
with the other two eggs:
put the yolks in the yolk bowl
discard the egg whites
mix the yolks until they are thick and a lemon color
in the yolk bowl, add 1/3 cup sugar
mix the ingredients in the yolk bowl until thick
sift dry ingredients
combine egg mixture, dry ingredients and milk
fold egg whites into batter
put into pans
bake at 180C for 30-35 minutes
Put the frosting on
Eat and enjoy
Then the dry ingredients...
And folding in the egg whites ...
And putting everything into the pans ...
And making the frosting ...
Oh, and cleaning up afterwards...
...
Making four cakes
I’m making birthday cakes for four Swedes,
Anders, Lars, Ingela, Jacob.
Prepare for cooking (*4)
Make the cake (*4)
Put the frosting on (*4)
for x in [”Anders”,“Lars”,“Ingela”,“Jacob”]:
on the cake, write “Happy Birthday, “, x
Eat and enjoy
Functions
Built-in functions
You’ve used several functions already
>>> len("ATGGTCA")
7
>>> abs(-6)
6
>>> float("3.1415")
3.1415000000000002
>>>
What are functions?
A function is a code block with a name
>>> def hello():

... print "Hello, how are you?"
...
>>> hello()
Hello, how are you?
>>>
Functions start with ‘def’
>>> def hello():

...
>>> hello()
Hello, how are you?
>>>
Then the name
This function is named ‘hello’
>>> def hello():

...
>>> hello()
Hello, how are you?
>>>
The list of parameters
The parameters are always listed in parenthesis.
There are no parameters in this function
so the parameter list is empty.
>>> def hello():

...
>>> hello()
Hello, how are you?
>>>
(I’ll cover parameters in more detail soon)
A colon
A function definition starts a new code block.
The definition line must end with a colon (the “:”)
Just like the ‘if’, and ‘for’ statements.
>>> def hello():

...
>>> hello()
Hello, how are you?
>>>
The code block
These are the statements that are run when the
function is called. They can be any Python
statement (print, assignment, if, for, open, ...)
>>> def hello():

...
>>> hello()
Hello, how are you?
>>>
Calling the function
When you “call” a function you ask Python
to execute the statements in the code block
for that function.
>>> def hello():

...
>>> hello()
Hello, how are you?
>>>
Which function to call?
Start with the name of the function.
In this case the name is “hello”
>>> def hello():

...
>>> hello()
Hello, how are you?
>>>
List any parameters
The parameters are always listed in parenthesis.
There are no parameters for this function
so the parameter list is empty.
>>> def hello():

...
>>> hello()
Hello, how are you?
>>>
And the function runs
>>> def hello():

...
>>> hello()
Hello, how are you?
>>>
Arguments and
Parameters
(Two sides of the same idea)
Most of the time you don’t want the function

to do the same thing over and over. You want
it to run the same algorithm using different
data.
Hello, <insert name here>
Say “Hello” followed by the person’s name
In maths we say “the function is parameterized by
the person’s name”
>>> def hello(name):
... print "Hello", name
...
>>> hello("Andrew")
Hello Andrew
>>>
Change the function definition
The function now takes one parameter. When the function
is called this parameter will be accessible using the variable
named name

...
>>> hello("Andrew")
Hello Andrew
>>>
Calling the function
The function call now needs one argument.
Here I’ll use the string “Andrew”.

...
>>> hello("Andrew")
Hello Andrew
>>>
And the function runs
The function call assigns the string “Andrew” to
the variable “name” then does the statements
in the code block

...
>>> hello("Andrew")
Hello Andrew
>>>
Multiple parameters
Here’s a function which takes two parameters
and subtracts the second from the first.
Two parameters in the definition
>>> def subtract(x, y):
... print x-y
...
>>> subtract(8, 5)
3
>>>
Two parameters in the call
Returning values
Rarely do functions only print.
More often the function does something and
the results of that are used by something else.
For example, len computes the length of a string
or list then returns that value to the caller.
subtract doesn’t return
anything
By default, a function returns the special value None

... print x-y
...
>>> x = subtract(8, 5)
3
>>> print x
None
>>>
The return statement
The return statement tells Python to exit the
function and return a given object.

... return x-y
...
>>> x = subtract(8, 5)
>>> print x
3
>>>
You can return anything (list, string, number,
dictionary, even a function).
Making a function
Yes, we’re going to count letters again.
seq = "ATGCATGATGCATGAAAGGTCG"
counts = {}
for base in seq:
counts[base] = 1
else:
counts[base] = counts[base] + 1
for base in counts:

print base, “=”, counts[base]
Identify the function
I’m going to make a function which counts bases.
What’s the best part to turn into a function?
counts = {}
for base in seq:
counts[base] = 1
else:
for base in counts:

Identify the input
In this example the sequence can change.
That makes seq a good choice as a parameter.
counts = {}
for base in seq:
counts[base] = 1
else:
for base in counts:

Identify the algorithm
This is the part of your program
which does something.
counts = {}
for base in seq:
counts[base] = 1
else:
for base in counts:

Identify the output
The output will use the data computed by
your function...
counts = {}
for base in seq:
counts[base] = 1
else:
for base in counts:

Identify the return value
... which helps you identify the return value
counts = {}
for base in seq:
counts[base] = 1
else:
for base in counts:
Name the function
First, come up with a good name for your function.
It should be descriptive so that when you or someone

else sees the name then they have an idea of what it
does.
Good names Bad names
do_count
count_bases
count_bases_in_sequence
count_letters
CoUnTbAsEs
countbases
QPXT
Start with the ‘def’ line
The function definition starts with a ‘def’
def count_bases(seq):
It takes one parameter, which

It is named will be accessed using
‘count_bases’ the variable named ‘seq’
Remember, the def line ends with a colon
Add the code block
counts = {}
for base in seq:
counts[base] = 1
else:
Return the results
counts = {}
for base in seq:
counts[base] = 1
else:
return counts
Use the function
counts = {}
for base in seq:
counts[base] = 1
else:
return counts
input_seq = “ATGCATGATGCATGAAAGGTCG”
results = count_bases(input_seq)
for base in results:
Use the function
def count_bases(seq): Notice that the variables
counts = {} for the parameters and
for base in seq: the return value don’t
counts[base] = 1
need to be the same
else:
return counts
input_seq = “ATGCATGATGCATGAAAGGTCG”
results = count_bases(input_seq)
for base in results:
Interactively
>>> def count_bases(seq):
... counts = {}
... for base in seq:
... if base not in counts:
... counts[base] = 1
... else:
... counts[base] = counts[base] + 1
... return counts
...
>>> count_bases("ATATC") (I don’t even need a
{'A': 2, 'C': 1, 'T': 2} variable name - just use
>>> count_bases("ATATCQGAC") the values directly.)
{'A': 3, 'Q': 1, 'C': 2, 'T': 2, 'G': 1}
>>> count_bases("")
{}
>>>
Functions can call functions
>>> def gc_content(seq):
... counts = count_bases(seq)
... return (counts["G"] + counts["C"]) / float(len(seq))
...
>>> gc_content("CGAATT")
0.333333333333
>>>
Functions can be used
(almost) anywhere
In an ‘if’ statement
>>> def polyA_tail(seq):
... if seq.endswith("AAAAAA"):
... return True
... else:
... return False
...
>>> if polyA_tail("ATGCTGTCGATGAAAAAAA"):
... print "Has a poly-A tail"
...
Has a poly-A tail
>>>
Functions can be used
(almost) anywhere
In an ‘for’ statement
>>> def split_into_codons(seq):
... codons = []
... for i in range(0, len(seq)-len(seq)%3, 3):
... codons.append(seq[i:i+3])
... return codons
...
>>> for codon in split_into_codons("ATGCATGCATGCATGCATGC"):
... print "Codon", codon
...
Codon ATG
Codon CAT
Codon GCA
Codon TGC
Codon ATG
Codon CAT
>>>
Default arguments
def ask_ok(prompt, retries=4, complaint='Yes or no, please!'):
while True:
ok = raw_input(prompt)
if ok in ('y', 'ye', 'yes'):
return True
if ok in ('n', 'no', 'nop', 'nope'):
return False
retries = retries - 1
if retries < 0:
raise IOError('refusenik user')
print complaint
Keyword arguments
def parrot(voltage, state='a stiff', action='voom', type='Norwegian Blue'):
print "-- This parrot wouldn't", action,
print "if you put", voltage, "volts through it."
print "-- Lovely plumage, the", type
print "-- It's", state, "!"
OK:
parrot(1000) # 1 positional argument
parrot(voltage=1000) # 1 keyword argument
parrot(voltage=1000000, action='VOOOOOM') # 2 keyword arguments
parrot(action='VOOOOOM', voltage=1000000) # 2 keyword arguments
parrot('a million', 'bereft of life', 'jump') # 3 positional arguments
parrot('a thousand', state='pushing up the daisies') # 1 positional, 1 keyword
Not OK:
parrot() # required argument missing
parrot(voltage=5.0, 'dead') # non-keyword argument after a keyword argument
parrot(110, voltage=220) # duplicate value for the same argument
parrot(actor='John Cleese') # unknown keyword argument
Sorting and Modules
Sorting
Lists have a sort method
Strings are sorted alphabetically, except ...
>>> L1 = ["this", "is", "a", "list", "of", "words"]
>>> print L1
['this', 'is', 'a', 'list', 'of', 'words']
>>> L1.sort()
>>> print L1
['a', 'is', 'list', 'of', 'this', 'words']
>>>
Uppercase is sorted before lowercase (yes, strange)

>>> L1 = ["this", "is", "a", "list", "Of", "Words"]
>>> print L1
['this', 'is', 'a', 'list', 'Of', 'Words']
>>> L1.sort()
>>> print L1
['Of', 'Words', 'a', 'is', 'list', 'this']
>>>
>>> for i in range(32, 127):
...
...
32 =
print i, "=", chr(i)
56 = 8 80 = P 104 = h
ASCII order
33 = ! 57 = 9 81 = Q 105 = i
>>> for letter in "Hello":
34 = " 58 = : 82 = R 106 = j
35 = # 59 = ; 83 = S 107 = k ... print ord(letter)
36 = $ 60 = < 84 = T 108 = l ...
37 = % 61 = = 85 = U 109 = m 72
38 = & 62 = > 86 = V 110 = n 101
39 = ' 63 = ? 87 = W 111 = o 108
40 = ( 64 = @ 88 = X 112 = p 108
41 = ) 65 = A 89 = Y 113 = q 111
42 = * 66 = B 90 = Z 114 = r 10
43 = + 67 = C 91 = [ 115 = s >>>
44 = , 68 = D 92 = \ 116 = t
45 = - 69 = E 93 = ] 117 = u
46 = . 70 = F 94 = ^ 118 = v
47 = / 71 = G 95 = _ 119 = w
48 = 0 72 = H 96 = ` 120 = x
49 = 1 73 = I 97 = a 121 = y
50 = 2 74 = J 98 = b 122 = z
51 = 3 75 = K 99 = c 123 = {
52 = 4 76 = L 100 = d 124 = |
53 = 5 77 = M 101 = e 125 = }
54 = 6 78 = N 102 = f 126 = ~
55 = 7 79 = O 103 = g
Sorting Numbers
Numbers are sorted numerically
>>> L3 = [5, 2, 7, 8]
>>> L3.sort()
>>> print L3
[2, 5, 7, 8]
>>> L4 = [-7.0, 6, 3.5, -2]
>>> L4.sort()
>>> print L4
[-7.0, -2, 3.5, 6]
>>>
Sorting Both
You can sort with both numbers and strings
>>> L5 = [1, "two", 9.8, "fem"]
>>> L5.sort()
>>> print L5
[1, 9.8000000000000007, 'fem', 'two']
>>>
If you do, it usually means you’ve

designed your program poorly.
Sort returns nothing!
Sort modifies the list “in-place”
>>> L1 = "this is a list of words".split()

>>> print L1
>>> x = L1.sort()
>>> print x
None
>>> print L1
>>>
Three steps for sorting
#1 - Get the list
>>> L1 = "this is a list of words".split()
>>> print L1
#2 - Sort it
>>> L1.sort()
#3 - Use the sorted list

>>> print L1
>>>
Sorting Dictionaries
Dictionary keys are unsorted
>>> D = {"ATA": 6, "TGG": 8, "AAA": 1}
>>> print D
{'AAA': 1, 'TGG': 8, 'ATA': 6}
>>>
Sorting Dictionaries
#1 - Get the list
>>> D = {"ATA": 6, "TGG": 8, "AAA": 1}
>>> print D
{'AAA': 1, 'TGG': 8, 'ATA': 6}
>>> keys = D.keys()
>>> print keys
['AAA', 'TGG', 'ATA']
>>>
#2 - Sort the list
>>> D = {"ATA": 6, "TGG": 8, "AAA": 1}
>>> print D
{'AAA': 1, 'TGG': 8, 'ATA': 6}
>>> keys = D.keys()
>>> print keys
>>> keys.sort()
>>> print keys
['AAA', 'ATA', 'TGG']
>>> for k in keys:
... print k, D[k]
...
AAA 1
ATA 6
TGG 8
>>>
#3 - Use the sorted list
>>> D = {"ATA": 6, "TGG": 8, "AAA": 1}
>>> print D
{'AAA': 1, 'TGG': 8, 'ATA': 6}
>>> keys = D.keys()
>>> print keys
>>> keys.sort()
>>> print keys
['AAA', 'ATA', 'TGG']
>>> for k in keys:
... print k, D[k]
...
AAA 1
ATA 6
TGG 8
>>>
More info
There is a “how-to” on sorting at
http://www.amk.ca/python/howto/sorting/sorting.html
Object Oriented Programming
o Abstraction (the idea that the roles of Waiters, Customers and Kitchens
are abstract ideas, apart from any particular instance of a Waiter; Python
refers to the class of waiters, JavaScript to the prototypical Waiter)
o Messages (the function calls that are used to interact with objects; here,
the words in the speech balloons, and also perhaps the coffee & cash)
o Overloading (Waiter's response to "A coffee", different response to "A
black coffee")
o Polymorphism (Waiter and Kitchen implement "A black coffee" differently)
o Encapsulation (Customers, Waiters conceal their internal data, present
interfaces relating to behavior)
o Inheritance (not exactly used here, except implicitly: all types of coffee can
be drunk or spilled, all humans can speak basic English and hold cups of
coffee, etc. A better example of Inheritance is if there are different
specializations of Waiter, e.g. Head Waiter, Sommelier, etc. Then all “inherit”
the core functions of a Waiter, but with different extra functionality)
o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge,
the Kitchen is a Factory (and perhaps the Waiter is too), asking for coffee is
a Factory Method, etc.
Modules
Modules are collections of objects (like strings,
numbers, functions, lists, and dictionaries)
You’ve seen the math module
>>> import math
>>> math.cos(0)
1.0
>>> math.cos(math.radians(45))
0.70710678118654746
>>> math.sqrt(2) / 2
0.70710678118654757
>>> math.hypot(5, 12)
13.0
>>>
Importing a module
The import statement tells Python to find
module with the given name.
>>> import math

>>>
This says to import the module named ‘math’.
Using the new module
Objects in the math module are
accessed with the “dot notation”
>>> import math

>>> math.pi
3.1415926535897931
>>>
This says to get the variable named “pi”

from the math module.
Attributes
The dot notation is used for attributes, which are
also called properties.
>>> import math

>>> math.pi
3.1415926535897931
>>> math.degrees(math.pi)
180.0
>>>
“pi” and “degrees” are attributes (or properties) of
the math module.
Make a module
First, create a new file
In IDLE, click on “File” then select “New Window”.

This creates a new window.
In that window, save it to the file name

seq_functions.py
At this point the file is empty.
Add Python code
In the file “seq_functions.py” add the following
BASES = "ATCG"
def GC_content(s):
return (s.count("G") + s.count("C")) / float(len(s))
Next, save this file (again).
Test it interactively
>>> import seq_functions
>>> seq_functions.BASES
'ATCG'
>>>
seq_functions.GC_content("ATCG")
0.5
>>>
Using it from a program
Create a new file called “main.py”
Add the following code

import seq_functions
print "%GC content: ", seq_functions.GC_content(seq_functions.BASES)
Run this program. You should see 0.5 printed out.
Making changes
If you edit “seq_functions.py” then you must tell
Python to reread the statements from the module.
This does not happen automatically.
We have configured IDLE to reread all the modules

when Python runs.
If you edit a file in IDLE, you must do “Run Module”

for Python to see the changes.
Important modules:
Biopython, SQL & COM
Information sources
• python.org
• tutor list (for beginners), the Python
Package index, on-line help, tutorials, links to
other documentation, and more.
• biopython.org (and mailing list)

• newsgroup comp.lang.python
Biopython
• www.biopython.org
• Collection of many bioinformatics modules
• Some well tested, some experimental
• Check with biopython.org before writing new
software. It may already exist.
• Contribute your code (even useful scripts) to

them.
The Seq object
>>> from Bio import Seq
>>> seq = Seq.Seq("ATGCATGCATGATGATCG")
>>> print seq
Seq('ATGCATGCATGATGATCG', Alphabet())
>>>
Alphabet? What’s that?
Python doesn’t know that you gave it DNA.

(It could be a strange protein.)
Alphabets
>>> from Bio.Alphabet import IUPAC
>>> protein = Seq.Seq("ATGCATGCATGC", IUPAC.protein)
>>> dna = Seq.Seq("ATGCATGCATGC", IUPAC.unambiguous_dna)
>>> protein[:10]
Seq('ATGCATGCAT', IUPACProtein())
>>> protein[:10] + protein[::-1]
Seq('ATGCATGCATCGTACGTACGTA', IUPACProtein())
>>> dna[:6]
Seq('ATGCAT', IUPACUnambiguousDNA())
>>> dna[0]
'A'
>>> protein[:10] + dna[:6]
File "/usr/local/lib/python2.3/site-packages/Bio/Seq.py", line 45, in __add__
raise TypeError, ("incompatable alphabets", str(self.alphabet),
TypeError: ('incompatable alphabets', 'IUPACProtein()', 'IUPACUnambiguousDNA()')
>>>
Translation
>>> from Bio.Alphabet import IUPAC
>>> from Bio import Translate
>>>
>>> standard_translator = Translate.unambiguous_dna_by_id[1]
>>> seq = Seq.Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
... IUPAC.unambiguous_dna)
>>> standard_translator.translate(seq)
Seq('DRWAYIGSKI', HasStopCodon(IUPACProtein(), '*'))
>>>
Reading sequence files
We’ve put a lot of work into reading common
bioinformatics file formats. As the formats change, we
update our parsers. There’s (almost) no reason for you to
write your own GenBank, SWISS-PROT, ... parser!
Reading a FASTA file
>>> from Bio import Fasta
>>> parser = Fasta.RecordParser()
>>> infile = open("ls_orchid.fasta")
>>> iterator = Fasta.Iterator(infile, parser)
>>> record = iterator.next()
>>> record.title
'gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and
ITS1 and ITS2 DNA'
>>> record.sequence
'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAA
TCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCC
TCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCAC
CGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGAC
TCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATC
CCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGC
TTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGA
ACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATG
GAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC'
Reading all records
>>> from Bio import Fasta
>>> parser = Fasta.RecordParser()
>>> infile = open("ls_orchid.fasta")
>>> iterator = Fasta.Iterator(infile, parser)
>>> while 1:
... record = iterator.next()
... if not record:
... break
... print record.title[record.title.find(" ")+1:-1]
...
C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DN
C.californicum 5.8S rRNA gene and ITS1 and ITS2 DN
C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DN
C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DN
C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DN
C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DN
.... additional lines removed ....
Reading a GenBank file
>>> from Bio import GenBank
>>> parser = GenBank.RecordParser() Only changed
>>> infile = open("input.gb")
>>> iterator = GenBank.Iterator(infile, parser)
the format
>>> record = iterator.next() name
>>> record.locus
'10A19I'
>>> record.organism
'Oryza sativa (japonica cultivar-group)'
>>> len(record.features)
31
>>> record.features[0].key
'source'
>>> record.features[0].location
'1..99587'
>>> record.taxonomy
['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Poales',
'Poaceae', 'Ehrhartoideae', 'Oryzeae', 'Oryza']
>>>
Get data over the web
Python includes the ‘urllib2’ (successor to urllib) to fetch
data given a URL. It can handle GET and POST requests
for HTTP and HTTPS, do ftp, and read local files.
Several web servers have an interface for programs to get

data directly from the server instead of going through the
HTML page meant for humans. But most of the time we
have to do some screen-scraping.
NCBI’s EUtils
Designed for software to access NCBI’s databases directly.
Can query literature and sequence databases.
Biopython includes a library for working with it.
Sadly, it’s poorly documented.
The module for this is Bio.EUtils
Remote BLAST
(at NCBI)
Biopython includes a library to run a job on NCBI’s
BLAST server. To your program it looks just like a
normal function call.
The module for this is Bio.BLAST.NCBIWWW
BLAST (locally)
If you instead want to use a local installation of
BLAST you can use Bio.Blast.NCBIStandalone
Microsoft/COM/Excel
Python runs on Unix, Macs, Microsoft Windows, and more.
Python support for Windows is very good.
“The most Microsoft compliant language outside Redmond.”
COM is how one program can communicate with another

under Windows. It’s very easy to have Python talk to Excel to
load or get data from a spreadsheet.
For more information, see the “Win32 Programming in

Python” book by Mark Hammond.
SQL
Python can connect to different databases (like MySQL,
PostgreSQL, and Oracle). Usually there are two ways to
talk to the database; directly using the database-specific
interface or indirectly through a Python adapter which tries
to hide the differences between the databases.
There is a standard open-source database schema for

bioinformatics called BioSQL. Python supports it.
“Everything Else”
Find all substrings
We’ve learned how to find the first location of
a string in another string with find. What
about finding all matches?
Start by looking at the documentation.
S.find(sub [,start [,end]]) -> int
Return the lowest index in S where substring sub is found,

such that sub is contained within s[start,end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.
Experiment with find
>>> seq = "aaaaTaaaTaaT"
>>> seq.find("T")
4
>>> seq.find("T", 4)
4
8
11
>>> seq.find("T", 12)
-1
>>>
How to program it?
The only loop we’ve done so far is “for”.

But we aren’t looking at every element in the list.
We need some way to jump forward and stop when done.
while statement
The solution is the while statment
>>> pos = seq.find("T") While the test is true
>>> while pos != -1:
... print "T at index", pos
... pos = seq.find("T", pos+1)
...
T at index 4
T at index 8 Do its code block
T at index 11
>>>
There’s duplication...
Duplication is bad. (Unless you’re a gene?)
The more copies there are the more likely some
will be different than others.
>>> pos = seq.find("T")
>>> while pos != -1:
...
T at index 4
T at index 8
T at index 11
>>>
The break statement
The break statement says “exit this loop immediately”
instead of waiting for the normal exit.
>>> pos = -1
>>> while 1:
... if pos == -1:
... break
...
T at index 4
T at index 8
T at index 11
>>>
break in a for
A break also works in the for loop
Find the first 10 sequences in a file which have a poly-A tail

sequences = []
for line in open(filename):
seq = line.rstrip()
if seq.endswith("AAAAAAAA"):
sequences.append(seq)
if len(sequences) > 10:
break
elif
Sometimes the if statement is more complex than if/else
“If the weather is hot then go to the beach. If it is
rainy, go to the movies. If it is cold, read a book.
Otherwise watch television.”
if is_hot(weather):
go_to_beach()
elif is_rainy(weather):
go_to_movies()
elif is_cold(weather):
read_book()
else:
watch_television()
tuples
Python has another fundamental data type - a tuple.
A tuple is like a list except it’s immutable (can’t be changed)
>>> data = ("Cape Town", 2004, [])
>>> print data
('Cape Town', 2004, [])
>>> data[0]
'Cape Town'
>>> data[0] = "Johannesburg"
TypeError: object doesn't support item assignment
>>> data[1:]
(2004, [])
>>>
Why tuples?
We already have a list type. What does a tuple add?
This is one of those deep computer science answers.
Tuples can be used as dictionary keys, because they are

immutable so the hash value doesn’t change.
Tuples are used as anonymous classes and may contain

heterogeneous elements. Lists should be homogenous
(eg, all strings or all numbers or all sequences or...)
String Formating
So far all the output examples used the print statement. Print
puts spaces between fields, and sticks a newline at the end.
Often you’ll need to be more precise.
Python has a new definition for the “%” operator when used
with a strings on the left-hand side - “string interpolation”
>>> name = "Andrew"

>>> print "%s, come here" % name
Andrew, come here
>>>
Simple string interpolation
The left side of a string interpolation is always a string.
The right side of the string interpolation may be a dictionary, a
tuple, or anything else. Let’s start with the last.
The string interpolation looks for a “%” followed by a single

character (except that “%%” means to use a single “%”). That
letter immediately following says how to interpret the
object; %s for string, %d for number, %f for float, and a few
others
Most of the time you’ll just use %s.
% examples
Also note some of the special formating codes.
>>> "This is a string: %s" % "Yes, it is"
'This is a string: Yes, it is'
>>> "This is an integer: %d" % 10
'This is an integer: 10'
>>> "This is an integer: %4d" % 10
>>> "This is an integer: %04d" % 10
>>> "This is a float: %f" % 9.8
'This is a float: 9.800000'
>>> "This is a float: %.2f" % 9.8
'This is a float: 9.80'
>>>
string % tuple
To convert multiple values, use a tuple on the right.
(Tuple because it can be heterogeneous)
Objects are extracted left to right. First % gets the first
element in the tuple, second % gets the second, etc.
>>> "Name: %s, age: %d, language: %s" % ("Andrew", 33, "Python")
'Name: Andrew, age: 33, language: Python'
>>>
The number of % fields and tuple length must match.

>>> "Name: %s, age: %d, language: %s" % ("Andrew", 33)
TypeError: not enough arguments for format string
>>>
string % dictionary
When the right side is a dictionary, the left side must
include a name, which is used as the key.
>>> d = {"name": "Andrew",
... "age": 33,
... "language": "Python"}
>>>
>>> print "%(name)s is %(age)s years old. Yes, %(age)s." % d
Andrew is 33 years old. Yes, 33.
>>>
A %(names)s may be duplicated and the dictionary

size and % count don’t need to match.
Writing files
Opening a file for writing is very similar to
opening one for reading.
>>> infile = open("sequences.seq")
>>> outfile = open("sequences_small.seq", "w")
Open file for writing
The write method
>>> infile = open("sequences.seq")

>>> outfile = open("sequences_small.seq", "w")
>>> for line in infile:
... seq = line.rstrip() I need to write
... if len(seq) < 1000:
... outfile.write(seq) my own newline.
... outfile.write("\n")
...
>>> outfile.close() The close is optional,
>>> infile.close()
>>> but good style. Don’t
fret too much about it.
Command-line arguments
Python gives you access to the list of Unix command-line
arguments through sys.argv, which is a normal Python list.
% cat show_args.py
import sys
print sys.argv
% python show_args.py
['show_args.py']
% python show_args.py 2 3
['show_args.py', '2', '3']
% python show_args.py "Hello, World"
['show_args.py', 'Hello, World']
%
Parsing options
from optparse import OptionParser
[...]
parser = OptionParser()
parser.add_option("-f", "--file", dest="filename",
help="write report to FILE", metavar="FILE")
parser.add_option("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")
(options, args) = parser.parse_args()
Algorithmic complexity
Big O notation
• In discussing the resource usage of algorithms, it is often
useful to consider asymptotic behavior (e.g. on very large
datasets), rather than every detail
• For example, suppose you have N items and you want to

compare every item to every other item.
This involves N(N-1)/2 comparisons, but it’s often more
illuminating to say it involves “of order N2” comparisons.
Or, more technically, O(N2) comparisons.
• Formally, a function f(x) is said to be O(g(x)) if for large

enough x (x>x0) there is some constant K such that
f(x) < K*g(x) for all x>x0
Big O notation
• Formally, we say f(x)=O(g(x)) if for large enough x (x>x0)
there is some constant K such that f(x) < K*g(x)
• Example: if f(x) is a polynomial, f(x)=a+bx+cx2+...+kxm

then f(x)=O(xm) ... only the highest power matters
• Asymptotically (for large x), f(x) will be dominated by the

xm term. For questions like “how does f(x) change when we
double x”, the coefficient k does not matter.
• In practice, the coefficient k, as well as the lower powers

of x, might matter quite a lot. But in some sense they’re
“details”...
• Note that some functions can’t be written as finite

polynomials (e.g. log(x), exp(x), ...)
• Formally, a function f(x) is said to be O(g(x)) if for large enough
x (x>x0) there is some constant K such that
f(x) < K*g(x) for all x>x0
K*g(x)
f(x)
x0
It comes down to how fast the function f(x) grows for large x
Big O notation
• We are often specifically interested in using Big O
notation to describe...
• Runtime complexity - the time an algorithm takes to run

(note that we can only specify this to within a scaling
coefficient that depends on the CPU clock rate of the
hardware; Big O notation removes this scaling
coefficient from the picture)
• Memory complexity - the amount of RAM that an

algorithm uses (again, this depends on hardware details -
do we use 8, 16, 32 or 64 bits to store an integer? Big O
notation saves us from such details)
Simplification rules
• If f(x) is a sum of several terms, the one with the

largest growth rate is kept, and all others
omitted.
• If f(x) is a product of several factors, any constants

(terms in the product that do not depend on x) are
omitted.
Some common Big-O formulae
Super-linear
(linear)
Sub-linear
exp(n) grows faster than any power nk

n! grows even faster
Sorting and complexity
To find an item in a sorted list of N items takes

log2(N) steps (“binary search” algorithm)

Similarly, to find the correct place to insert a
new item takes log2(N) steps

To place N items correctly takes of order N*log2(N) steps

To place N items correctly takes of order N*log2(N) steps
To sort N items takes of order N*log2(N) steps
Containers and complexity
A linked-list is easy to build and modify,

but takes O(N) time to scan N items

Arrays (c.f. Python lists) are constant to access,

O(log N) to search (if pre-sorted),
O(N log N) to sort, and O(N) to insert/delete

Arrays (c.f. Python lists) are constant to access, Balanced trees are O(logN) to
O(log N) to search (if pre-sorted), search and modify, but a little
O(N log N) to sort, and O(N) to insert/delete slower in practice than hashtables

Hashtables (c.f. Python dictionaries) have O(N)

worst-case behavior for many operations, but
typical case is often better, e.g. O(1)
Arrays (c.f. Python lists) are constant to access, Balanced trees are O(logN) to
O(log N) to search (if pre-sorted), search and modify, but a little
O(N log N) to sort, and O(N) to insert/delete slower in practice than hashtables

131 Scripting (Python) PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

131 Scripting (Python) PDF

Uploaded by

Copyright:

Available Formats

Introduction to Python

Incorporating selected material from Dalke Scientific’s

Each of these are characters

• Database records contain strings

• DEFINITION Homo sapiens chromosome 17, clone

• AUTHORS Birren,B., Fasman,K., Linton,L.,

• HTML is one (big) string

>>> s = 'Okay, there\'s a small one.'

The \ “escapes” the following single quote

>>> s = "ATCG" >>> L = ["adenine", "thymine", "cytosine",

split(c) uses that character

>>> for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

>>> for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

name “Andrew” Create the string object

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

name “Andrew” Create the string object

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

name “Andrew” Create the string object

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:

The test is: "GAATTC" in "ATCTGGAATTCATCG"

Here is it done in the Python shell

1) Python has a not in operator

2) The not operator switches true and false

Do the first code block (after the if:) if the

Do the second code block (after the

>>> seq = "ATCTGGAATTCATCG"

The variable name “pos” is often used for positions.

EcoRI site starting at index 5

seq = raw_input("Enter a DNA sequence: ")

for site in restriction_sites:

seq = raw_input("Enter a DNA sequence: ")

for site in restriction_sites:

seq = raw_input("Enter a DNA sequence: ")

for site in restriction_sites:

seq = raw_input("Enter a DNA sequence: ")

for site in restriction_sites:

Enter a DNA sequence: AATGAATTCTCTGGAAGCTTA