Professional Documents
Culture Documents
131 Scripting (Python) PDF
131 Scripting (Python) PDF
Monday, September 9, 13
Strings in Python
Monday, September 9, 13
Computers store text
as strings
>>> s = "GATTACA"
0 1 2 3 4 5 6
s G A T T A C A
Monday, September 9, 13
Why are strings important?
• Sequences are strings
• ..catgaaggaa ccacagccca gagcaccaag ggctatccat..
Monday, September 9, 13
Special Characters and
Escape Sequences
Backslashes (\) are used to introduce special characters
>>> print s
Okay, there's a small one.
Monday, September 9, 13
Some special characters
Escape Sequence Meaning
\\ Backslash (keep a \)
\' Single quote (keeps the ')
\" Double quote (keeps the ")
\n Newline
\t Tab
Monday, September 9, 13
Working with strings
>>> len("GATTACA") length
7
>>> "GAT" + "TACA"
'GATTACA'
concatenation
>>> "A" * 10
'AAAAAAAAAA' repeat
>>> "G" in "GATTACA"
True
>>> "GAT" in "GATTACA"
True substring test
>>> "AGT" in "GATTACA"
False
>>> "GATTACA".find("ATT") substring location
1
>>> "GATTACA".count("T")
2
substring count
>>>
Monday, September 9, 13
Converting from/to strings
>>> "38" + 5
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: cannot concatenate 'str' and 'int' objects
>>> int("38") + 5
43
>>> "38" + str(5)
'385'
>>> int("38"), str(5)
(38, '5')
>>> int("2.71828")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: invalid literal for int(): 2.71828
>>> float("2.71828")
2.71828
>>>
Monday, September 9, 13
Change a string?
Strings cannot be modified
They are immutable
Instead, create a new one
>>> s = "GATTACA"
>>> s[3] = "C"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object doesn't support item assignment
>>> s = s[:3] + "C" + s[4:]
>>> s
'GATCACA'
>>>
Monday, September 9, 13
Some more methods
>>> "GATTACA".lower()
'gattaca'
>>> "gattaca".upper()
'GATTACA'
>>> "GATTACA".replace("G", "U")
'UATTACA'
>>> "GATTACA".replace("C", "U")
'GATTAUA'
>>> "GATTACA".replace("AT", "**")
'G**TACA'
>>> "GATTACA".startswith("G")
True
>>> "GATTACA".startswith("g")
False
>>>
Monday, September 9, 13
Ask for a string
The Python function “raw_input” asks
the user (that’s you!) for a string
>>> seq = raw_input("Enter a DNA sequence: ")
Enter a DNA sequence: ATGTATTGCATATCGT
>>> seq.count("A")
4
>>> print "There are", seq.count("T"), "thymines"
There are 7 thymines
>>> "ATA" in seq
True
>>> substr = raw_input("Enter a subsequence to find: ")
Enter a subsequence to find: GCA
>>> substr in seq
True
>>>
Monday, September 9, 13
Variables and
References in Python
Monday, September 9, 13
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>> s "TAGAGAATTCTA”
Monday, September 9, 13
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>> t = “GAAT”
>>>
s "TAGAGAATTCTA"
t "GAAT"
Monday, September 9, 13
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>>
t "GAAT"
Strings have a i 4
“find” method
Monday, September 9, 13
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>> t = s
t "GAAT"
>>>
i 4
Monday, September 9, 13
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>> t = s
t
>>>
i 4
Monday, September 9, 13
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>> t = s
t
>>> a
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'a' is not defined
i 4
>>>
Monday, September 9, 13
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>>
>>>
t = “GAAT”
i = s.find(t)
s "TAGAGAATTCTA"
>>> print i
4
>>> t = s
t "GA"
>>> a
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'a' is not defined
i 4
>>> s = “GA”
>>> print t
TAGAGAATTCTA
>>>
Monday, September 9, 13
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>>
>>>
t = “GAAT”
i = s.find(t)
s
>>> print i
4 "GA"
>>> t = s
>>> a
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'a' is not defined
i 4
>>> s = “GA”
>>> print t
TAGAGAATTCTA
>>> del t
>>>
Monday, September 9, 13
Names Objects
References
>>> s = “TAGAGAATTCTA”
>>>
>>>
t = “GAAT”
i = s.find(t)
s
>>> print i
4 "GA"
>>> t = s
>>> print a
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'a' is not defined
i 4
>>> s = “GA”
>>> print t
TAGAGAATTCTA
>>> del t
>>> print t
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 't' is not defined
>>>
Monday, September 9, 13
Names Objects
References
>>> L1 = [2, 4]
>>> L1 [2, 4]
Monday, September 9, 13
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 4, 5]
>>>
Monday, September 9, 13
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 4, 5]
>>> L2 = L1
>>>
L2
Monday, September 9, 13
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 4, 5, 7]
>>> L2 = L1
>>> L2.append(7)
>>>
[2,
print L1
4, 5, 7]
L2
>>> print L2
[2, 4, 5, 7]
>>>
Monday, September 9, 13
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 5, 7]
>>> L2 = L1
>>> L2.append(7)
>>>
[2,
print L1
4, 5, 7]
L2
>>> print L2
[2, 4, 5, 7]
>>> del L2[1]
>>>
Monday, September 9, 13
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [2, 5, 7]
>>> L2 = L1
>>> L2.append(7)
>>> print L1 L2 [2, 5, 7]
[2, 4, 5, 7]
>>> print L2
[2, 4, 5, 7]
>>> del L2[1]
>>> L2 = L1[:]
>>> L1 == L2
True
>>>
Monday, September 9, 13
Names Objects
References
>>> L1 = [2, 4]
>>> L1.append(5) L1 [7, 5, 2]
>>> L2 = L1
>>> L2.append(7)
>>> print L1 L2 [2, 5, 7]
[2, 4, 5, 7]
>>> print L2
[2, 4, 5, 7]
>>> del L2[1]
>>> L2 = L1[:]
>>> L1 == L2
True
>>> L1.reverse()
>>> L1 == L2
False
>>>
Monday, September 9, 13
Lists and the ‘for’ loop
Monday, September 9, 13
Lists
Lists are an ordered collection of objects
>>> data = []
>>> print data Make an empty list
[]
>>> data.append("Hello!")
>>> print data “append” == “add to the end”
['Hello!']
>>> data.append(5)
>>> print data You can put different objects in
['Hello!', 5]
>>> data.append([9, 8, 7]) the same list
>>> print data
['Hello!', 5, [9, 8, 7]]
>>> data.extend([4, 5, 6])
>>> print data “extend” appends each
['Hello!', 5, [9, 8, 7], 4, 5, 6]
>>> element of the new
list to the old one
Monday, September 9, 13
Lists and strings are
Strings
similar Lists
>>> s = "ATCG" >>> L = ["adenine", "thymine", "cytosine",
"guanine"]
>>> print s[0]
A >>> print L[0]
>>> print s[-1] adenine
G >>> print L[-1]
>>> print s[2:] guanine
CG >>> print L[2:]
>>> print "C" in s ['cytosine', 'guanine']
True >>> print "cytosine" in L
>>> s * 3 True
'ATCGATCGATCG' >>> L * 3
['adenine', 'thymine', 'cytosine', 'guanine',
>>> s[9] 'adenine', 'thymine', 'cytosine', 'guanine',
Traceback (most recent call last): 'adenine', 'thymine', 'cytosine', 'guanine']
File "<stdin>", line 1, in ?
IndexError: string index out of range
>>> L[9]
Traceback (most recent call last):
>>> File "<stdin>", line 1, in ?
IndexError: list index out of range
>>>
Monday, September 9, 13
But lists are mutable
Lists can be changed. Strings are immutable.
Monday, September 9, 13
Lists can hold any object
>>> L = ["", 1, "two", 3.0, ["quatro", "fem", [6j], []]]
>>> len(L)
5
>>> print L[-1]
['quatro', 'fem', [6j], []]
>>> len(L[-1])
4
>>> print L[-1][-1]
[]
>>> len(L[-1][-1])
0
>>>
Monday, September 9, 13
A few more methods
>>> L = ["thymine", "cytosine", "guanine"]
>>> L.insert(0, "adenine")
>>> print L
['adenine', 'thymine', 'cytosine', 'guanine']
>>> L.insert(2, "uracil")
>>> print L
['adenine', 'thymine', 'uracil', 'cytosine', 'guanine']
>>> print L[:2]
['adenine', 'thymine']
>>> L[:2] = ["A", "T"]
>>> print L
['A', 'T', 'uracil', 'cytosine', 'guanine']
>>> L[:2] = []
>>> print L
['uracil', 'cytosine', 'guanine']
>>> L[:] = ["A", "T", "C", "G"]
>>> print L
['A', 'T', 'C', 'G']
>>>
Monday, September 9, 13
Turn a string into a list
>>> s = "AAL532906 aaaatagtcaaatatatcccaattcagtatgcgctgagta"
}
>>> i = s.find(" ")
>>> print i
9
>>> print s[:i]
AAL532906 Complicated
>>> print s[i+1:]
aaaatagtcaaatatatcccaattcagtatgcgctgagta
>>>
>>> fields = s.split()
>>> print fields Easier!
['AAL532906', 'aaaatagtcaaatatatcccaattcagtatgcgctgagta']
>>> print fields[0]
AAL532906
>>> print len(fields[1])
40
>>>
Monday, September 9, 13
More split examples
>>> protein = "ALA PRO ILU CYS"
>>> residues = protein.split()
>>> print residues
split() uses ‘whitespace’ to
['ALA', 'PRO', 'ILU', 'CYS'] find each word
>>>
>>> protein = " ALA PRO ILU CYS \n"
>>> print protein.split()
['ALA', 'PRO', 'ILU', 'CYS']
Monday, September 9, 13
Turn a list into a string
join is the opposite of split
>>> L1 = ["Asp", "Gly", "Gln", "Pro", "Val"]
>>> print "-".join(L1)
Asp-Gly-Gln-Pro-Val
>>> print "**".join(L1)
Asp**Gly**Gln**Pro**Val
>>> print "\n".join(L1)
Asp
Gly The order is confusing.
Gln - string to join is first
Pro
Val - list to be joined is second
>>>
Monday, September 9, 13
The ‘for’ loop
Lets you do something to each
element in a list
Monday, September 9, 13
The ‘for’ loop
Lets you do something to each
element in a list
Monday, September 9, 13
A two line block
All lines in the same code block
must have the same indentation
>>> for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
... print "Hello,", name
... print "Your name is", len(name), "letters long"
...
Hello, Andrew
Your name is 6 letters long
Hello, Tsanwani
Your name is 8 letters long
Hello, Arno
Your name is 4 letters long
Hello, Tebogo
Your name is 6 letters long
>>>
Monday, September 9, 13
When indentation does
>>> a = 1
>>> a = 1 not match
File "<stdin>", line 1
a = 1
^
SyntaxError: invalid syntax
>>> for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
... print "Hello,", name
... print "Your name is", len(name), "letters long"
File "<stdin>", line 3
print "Your name is", len(name), "letters long"
^
SyntaxError: invalid syntax
>>> for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
... print "Hello,", name
... print "Your name is", len(name), "letters long"
File "<stdin>", line 3
print "Your name is", len(name), "letters long"
^
IndentationError: unindent does not match any outer indentation level
>>>
Monday, September 9, 13
‘for’ works on strings
A string is similar to a list of letters
>>> seq = "ATGCATGTCGC"
>>> for letter in seq:
... print "Base:", letter
...
Base: A
Base: T
Base: G
Base: C
Base: A
Base: T
Base: G
Base: T
Base: C
Base: G
Base: C
>>>
Monday, September 9, 13
Numbering bases
>>> seq = "ATGCATGTCGC"
>>> n = 0
>>> for letter in seq:
... print "base", n, "is", letter
... n = n + 1
...
base 0 is A
base 1 is T
base 2 is G
base 3 is C
base 4 is A
base 5 is T
base 6 is G
base 7 is T
base 8 is C
base 9 is G
base 10 is C
>>>
>>> print "The sequence has", n, "bases"
The sequence has 11 bases
>>>
Monday, September 9, 13
The range function
>>> range(5)
[0, 1, 2, 3, 4]
>>> range(8)
[0, 1, 2, 3, 4, 5, 6, 7]
>>> help(range)
>>> range(2, 8)
Help on built-in function range:
[2, 3, 4, 5, 6, 7]
>>> range(0, 8, 1) range(...)
[0, 1, 2, 3, 4, 5, 6, 7] range([start,] stop[, step]) -> list of integers
>>> range(0, 8, 2) Return a list containing an arithmetic progression of integers.
[0, 2, 4, 6] range(i, j) returns [i, i+1, i+2, ..., j-1]; start (!) defaults to 0.
>>> range(0, 8, 3) When step is given, it specifies the increment (or decrement).
For example, range(4) returns [0, 1, 2, 3]. The end point is omitted!
[0, 3, 6] These are exactly the valid indices for a list of 4 elements.
>>> range(0, 8, 4)
[0, 4]
>>> range(0, 8, -1)
[]
>>> range(8, 0, -1)
[8, 7, 6, 5, 4, 3, 2, 1]
>>>
Monday, September 9, 13
Do something ‘N’ times
>>> for i in range(3):
... print "If I tell you three times it must be true."
...
If I tell you three times it must be true.
If I tell you three times it must be true.
If I tell you three times it must be true.
>>>
>>> for i in range(4):
... print i, "squared is", i*i, "and cubed is", i*i*i
...
0 squared is 0 and cubed is 0
1 squared is 1 and cubed is 1
2 squared is 4 and cubed is 8
3 squared is 9 and cubed is 27
>>>
Monday, September 9, 13
Stepping through a
‘for’ loop
Monday, September 9, 13
At the beginning - run module
Variable
Names Objects
Monday, September 9, 13
Start with the first line - the ‘for’ statement
Variable
Names Objects
Monday, September 9, 13
Look at the list
Variable
Names Objects
Monday, September 9, 13
Is it empty? No. Start with the first object
Variable
Names Objects
Monday, September 9, 13
Assign the first object to the variable ‘name’
Variable
Names Objects
Monday, September 9, 13
Then start the first line of the code block
Variable
Names Objects
name “Andrew”
Monday, September 9, 13
This is the ‘print’ statement
Variable
Names Objects
name “Andrew”
Monday, September 9, 13
print the string object “Hello,” and the value of
the variable with name ‘name’
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
print "Hello,", name
print “The end.”
Variable
Names Objects
name “Andrew”
Hello, Andrew
Monday, September 9, 13
The print statement is finished.
Python is at the end of the code block...
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
print "Hello,", name
print “The end.”
Variable
Names Objects
name “Andrew”
Monday, September 9, 13
... so go to the ‘for’ statement again
Variable
Names Objects
name “Andrew”
Monday, September 9, 13
Move the list pointer forward by one
Variable
Names Objects
name “Andrew”
Monday, September 9, 13
Is it past the end? No, it’s on the second item.
Variable
Names Objects
name “Andrew”
Monday, September 9, 13
Assign the second object to the variable ‘name’
Variable
Names Objects
name “Andrew”
Create the string object
“Tsanwani” “Tsanwani” and assign it
to the variable named
‘name’
Monday, September 9, 13
Then start the first line of the code block
Variable
Names Objects
name “Andrew”
“Tsanwani”
Monday, September 9, 13
print the string object “Hello,” and the value of
the variable with name ‘name’
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
print "Hello,", name
print “The end.”
Variable
Names Objects
name “Andrew”
Hello, Andrew
“Tsanwani” Hello, Tsanwani
Monday, September 9, 13
The print statement is finished.
Python is at the end of the code block...
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
print "Hello,", name
print “The end.”
Variable
Names Objects
name “Andrew”
“Tsanwani”
Monday, September 9, 13
... so go to the ‘for’ statement again
Variable
Names Objects
name “Andrew”
“Tsanwani”
Monday, September 9, 13
Move the list pointer forward by one
Variable
Names Objects
name “Andrew”
“Tsanwani”
Monday, September 9, 13
Is it past the end? No, it’s on the third item.
Variable
Names Objects
name “Andrew”
“Tsanwani”
Monday, September 9, 13
Assign the third object to the variable ‘name’
Variable
Names Objects
Monday, September 9, 13
Then start the first line of the code block
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
Monday, September 9, 13
print the string object “Hello,” and the value of
the variable with name ‘name’
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
print "Hello,", name
print “The end.”
Variable
Names Objects
name “Andrew”
Hello, Andrew
“Tsanwani” Hello, Tsanwani
“Arno” Hello, Arno
Monday, September 9, 13
The print statement is finished.
Python is at the end of the code block...
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
print "Hello,", name
print “The end.”
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
Monday, September 9, 13
... so go to the ‘for’ statement again
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
Monday, September 9, 13
Move the list pointer forward by one
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
Monday, September 9, 13
Is it past the end? No, it’s on the fourth item.
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
Monday, September 9, 13
Assign the fourth object to the variable ‘name’
Variable
Names Objects
Monday, September 9, 13
Then start the first line of the code block
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Monday, September 9, 13
print the string object “Hello,” and the value of
the variable with name ‘name’
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
print "Hello,", name
print “The end.”
Variable
Names Objects
name “Andrew”
Hello, Andrew
“Tsanwani” Hello, Tsanwani
“Arno” Hello, Arno
“Tebogo” Hello, Tebogo
Monday, September 9, 13
The print statement is finished.
Python is at the end of the code block...
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
print "Hello,", name
print “The end.”
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Monday, September 9, 13
... so go to the ‘for’ statement again
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Monday, September 9, 13
Move the list pointer forward by one
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Monday, September 9, 13
Is it past the end? Yes!
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Monday, September 9, 13
So skip past the code block to the next
statement.
for name in ["Andrew", "Tsanwani", "Arno", "Tebogo"]:
print "Hello,", name
print “The end.”
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Monday, September 9, 13
This is another print statement
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Monday, September 9, 13
It prints the string “The end.”
Variable
Names Objects
name “Andrew”
Hello, Andrew
“Tsanwani” Hello, Tsanwani
“Arno” Hello, Arno
“Tebogo” Hello, Tebogo
The end.
Monday, September 9, 13
Python looks for the next statement...
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Monday, September 9, 13
... but there isn’t any, so Python stops.
Variable
Names Objects
name “Andrew”
“Tsanwani”
“Arno”
“Tebogo”
Monday, September 9, 13
Final Program Output
Hello, Andrew
Hello, Tsanwani
Hello, Arno
Hello, Tebogo
The end.
Monday, September 9, 13
The if statement and
files
Monday, September 9, 13
The if statement
Do a code block only when something is True
if test:
print "The expression is true"
Monday, September 9, 13
Example
if "GAATTC" in "ATCTGGAATTCATCG":
print "EcoRI site is present"
Monday, September 9, 13
if the test is true...
if "GAATTC" in "ATCTGGAATTCATCG":
print "EcoRI site is present"
Monday, September 9, 13
Then print the message
if "GAATTC" in "ATCTGGAATTCATCG":
print "EcoRI site is present"
Monday, September 9, 13
What if you want the
false case?
There are several possibilities; here’s two
Monday, September 9, 13
In the Python shell
>>> x = True
>>> x
True
>>> not x
False
>>> not not x
True
>>> if "GAATTC" not in "AAAAAAAAA":
... print "EcoRI will not cut the sequence"
...
EcoRI will not cut the sequence
>>> if not "GAATTC" in "ATCTGGAATTCATCG":
... print "EcoRI will not cut the sequence"
...
>>> if not "GAATTC" in "AAAAAAAAA":
... print "EcoRI will not cut the sequence"
...
EcoRI will not cut the sequence
>>>
Monday, September 9, 13
else:
What if you want to do one thing when the test is
true and another thing when the test is false?
Monday, September 9, 13
Examples with else
>>> if "GAATTC" in "ATCTGGAATTCATCG":
... print "EcoRI site is present"
... else:
... print "EcoRI will not cut the sequence"
...
EcoRI site is present
>>> if "GAATTC" in "AAAACTCGT":
... print "EcoRI site is present"
... else:
... print "EcoRI will not cut the sequence"
...
EcoRI will not cut the sequence
>>>
Monday, September 9, 13
Where is the site?
The ‘find’ method of strings returns the index of a substring
in the string, or -1 if the substring doesn’t exist
Monday, September 9, 13
But where is the site?
>>> seq = "ATCTGGAATTCATCG"
>>> pos = seq.find("GAATTC")
>>> if pos == -1:
... print "EcoRI does not cut the sequence"
... else:
... print "EcoRI site starting at index", pos
...
EcoRI site starting at index 5
>>>
Monday, September 9, 13
Start by creating the string “ATCTGGAATTCATCG” and
assigning it to the variable with name ‘seq’
seq = "ATCTGGAATTCATCG"
pos = seq.find("GAATTC")
if pos == -1:
print "EcoRI does not cut the sequence"
else:
print "EcoRI site starting at index", pos
Monday, September 9, 13
Using the seq string, call the method named find. This
looks for the string “GAATTC” in the seq string
seq = "ATCTGGAATTCATCG"
pos = seq.find("GAATTC")
if pos == -1:
print "EcoRI does not cut the sequence"
else:
print "EcoRI site starting at index", pos
Monday, September 9, 13
The string “GAATC” is at position 5 in the seq string.
Assign the 5 object to the variable named pos.
seq = "ATCTGGAATTCATCG"
pos = seq.find("GAATTC")
if pos == -1:
print "EcoRI does not cut the sequence"
else:
print "EcoRI site starting at index", pos
Monday, September 9, 13
Do the test for the if statement
Is the variable pos equal to -1?
seq = "ATCTGGAATTCATCG"
pos = seq.find("GAATTC")
if pos == -1:
print "EcoRI does not cut the sequence"
else:
print "EcoRI site starting at index", pos
Monday, September 9, 13
Since pos is 5 and 5 is not equal to -1,
this test is false.
seq = "ATCTGGAATTCATCG"
pos = seq.find("GAATTC")
The test is False
if pos == -1:
print "EcoRI does not cut the sequence"
else:
print "EcoRI site starting at index", pos
Monday, September 9, 13
Skip the first code block
(that is only run if the test is True)
Instead, run the code block after the else:
seq = "ATCTGGAATTCATCG"
pos = seq.find("GAATTC")
if pos == -1:
print "EcoRI does not cut the sequence"
else:
print "EcoRI site starting at index", pos
Monday, September 9, 13
This is a print statement.
Print the index of the start position
seq = "ATCTGGAATTCATCG"
pos = seq.find("GAATTC")
if pos == -1:
print "EcoRI does not cut the sequence"
else:
print "EcoRI site starting at index", pos
This prints
Monday, September 9, 13
There are no more statements so Python stops.
seq = "ATCTGGAATTCATCG"
pos = seq.find("GAATTC")
if pos == -1:
print "EcoRI does not cut the sequence"
else:
print "EcoRI site starting at index", pos
Monday, September 9, 13
A more complex example
Using if inside a for
restriction_sites = [
"GAATTC", # EcoRI
"GGATCC", # BamHI
"AAGCTT", # HindIII
]
Monday, September 9, 13
Nested code blocks
restriction_sites = [
"GAATTC", # EcoRI
"GGATCC", # BamHI
"AAGCTT", # HindIII
]
}
if site in seq:
print site, "is a cleavage site" This is the code
else:
print site, "is not present" block for the
for statement
Monday, September 9, 13
restriction_sites = [
"GAATTC", # EcoRI
"GGATCC", # BamHI
"AAGCTT", # HindIII
]
Monday, September 9, 13
restriction_sites = [
"GAATTC", # EcoRI
"GGATCC", # BamHI
"AAGCTT", # HindIII
]
Monday, September 9, 13
The program output
Monday, September 9, 13
Read lines from a file
Monday, September 9, 13
The open function
>>> infile = open("/home/myusername/my_sequences.seq")
>>> print infile
<open file '/usr/coursehome/dalke/10_sequences.seq', mode 'r' at 0x817ca60>
>>>
Monday, September 9, 13
the readline() method
>>> infile = open("/home/myusername/my_sequences.seq")
>>> print infile
<open file '/usr/coursehome/dalke/10_sequences.seq', mode 'r' at 0x817ca60>
>>> infile.readline()
'CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA\n'
>>>
Monday, September 9, 13
readline finishes with ""
>>> infile = open("/home/myusername/my_sequences.seq")
>>> print infile
<open file '/usr/coursehome/dalke/10_sequences.seq', mode 'r' at 0x817ca60>
>>> infile.readline()
'CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA\n'
>>> infile.readline()
'ATTTTTAACTTTTCTCTGTCGTCGCACAATCGACTTTCTCTGTTTTCTTGGGTTTACCGGAA\n'
>>> infile.readline()
'TTGTTTCTGCTGCGATGAGGTATTGCTCGTCAGCCTGAGGCTGAAAATAAAATCCGTGGT\n'
>>> infile.readline()
'CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT\n'
>>> infile.readline()
'TCTTCTCCAAGACGCATCCACGTGAACCGTTGTAACTATGTTCTGTGC\n'
>>> infile.readline()
'CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA\n'
>>> infile.readline()
'GTGCTCTCTTCTCGGAGAGAGAAGGTGGGCTGCTTGTCTGCCGATGTACTTTATTAAATCCAATAA\n'
>>> infile.readline()
'CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA\n'
>>> infile.readline()
'TCTGAAAAGTGCAAAGAACGATGATGATGATGATAGAGGAACCTGAGCAGCCATGTCTGAACCTATAGC\n'
>>> infile.readline()
'GTATTGGTCGTCGTGCGACTAAATTAGGTAAAAAAGTAGTTCTAAGAGATTTTGATGATTCAATGCAAAGTTCTATTAATCGTTCAATTG\n'
>>> infile.readline()
''
>>>
When there are no more lines,
readline returns the empty string
Monday, September 9, 13
Using for with a file
A simple way to read lines from a file
>>> filename = "/home/myusername/my_sequences.seq"
>>> for line in open(filename):
... print line[:10]
...
CCTGTATTAG
ATTTTTAACT
TTGTTTCTGC for starts with the first line in the file ...
CACACCCAAT then the second ...
TCTTCTCCAA
CCACACCAAA then the third ...
GTGCTCTCTT ...
CCACACCAAA and finishes with the last line.
TCTGAAAAGT
GTATTGGTCG
>>>
Monday, September 9, 13
A more complex task
List the sequences starting with a cytosine
>>> filename = "/home/myusername/my_sequences.seq"
>>> for line in open(filename):
... line = line.rstrip() rstrip Use to get rid
... if line.startswith("C"):
... print line of the “\n”
...
CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA
CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT
CCACACCAAAAAAACTTTCCACGTGAACCGAAAACGAAAGTCTTTGGTTTTAATCAATAA
CCACACCAAAAAAACTTTCCACGTGTGAACTATACTCCAAAAACGAAGTATTGGTTTATCATAA
>>>
Monday, September 9, 13
Searching and Regular
Expressions
Monday, September 9, 13
Proteins
• 20 amino acids
• Interesting structures
• beta barrel, greek key motif, EF hand ...
• Bind, move, catalyze, recognize, block, ...
• Many post-translational modifications
• Structure/function strongly influenced by
sequence
Monday, September 9, 13
Sequence Suggests
Structure/Function
When working with tumors you find the p53 tumor antigen,
which is found in increased amounts in transformed cells.
Monday, September 9, 13
Finding a string
We’ve covered several ways to find a
substring in a larger string.
site in sequence -- test if the substring site is found
anywhere in the sequence
Monday, September 9, 13
Is it a p53 sequence?
>>> p53 = "MCNSSCMGGMNRR"
>>> protein = "SEFTTVLYNFMCNSSCMGGMNRRPILTIIS"
>>> protein.find(p53)
10
>>> protein[10:10+len(p53)]
'MCNSSCMGGMNRR'
>>>
Monday, September 9, 13
p53 needs more than
one test substring
After a while you find that p53s are variable in one residue.
MCNSSCMGGMNRR
or
MCNSSCVGGMNRR
You could test for both cases, but as you add more
possibilities the number of patterns gets really
large, and writing them out is tedious.
Monday, September 9, 13
Need a pattern
Rather than write each alternative, perhaps we can write a
pattern, which is used to describe all the strings to test.
MCNSSCMGGMNRR
or MCNSSC[MV]GGMNRR
MCNSSCVGGMNRR
Monday, September 9, 13
PROSITE
PROSITE is a database of protein patterns.
http://au.expasy.org/prosite/
Monday, September 9, 13
ANTENNAPEDIA
'Homeobox' antennapedia-type protein signature.
Look for a
[LIVMFE][FY]PWM[KRQTA]
substring which:
Starts with L, I,V, M, F, or E
Monday, September 9, 13
ANTENNAPEDIA
'Homeobox' antennapedia-type protein signature.
Look for a
[LIVMFE][FY]PWM[KRQTA]
substring which:
Starts with L, I,V, M, F, or E
Then has an F or Y
Monday, September 9, 13
ANTENNAPEDIA
'Homeobox' antennapedia-type protein signature.
Look for a
[LIVMFE][FY]PWM[KRQTA]
substring which:
Starts with L, I,V, M, F, or E
Then has an F or Y
Then the letter P
Followed by a W
Followed by an M
Monday, September 9, 13
ANTENNAPEDIA
'Homeobox' antennapedia-type protein signature.
Look for a
[LIVMFE][FY]PWM[KRQTA]
substring which:
Starts with L, I,V, M, F, or E
Then has an F or Y
Then the letter P
Followed by a W
Followed by an M
And ending with a K, R, Q, T, or A
Monday, September 9, 13
Find ANTENNAPEDIA
Can you find [LIVMFE][FY]PWM[KRQTA] ?
MDPDCFAMSS YQFVNSLASC YPQQMNPQQN HPGAGNSSAG GSGGGAGGSG GVVPSGGTNG
GQGSAGAATP GANDYFPAAA AYTPNLYPNT PQPTTPIRRL ADREIRIWWT TRSCSRSDCS
CSSSSNSNSS NMPMQRQSCC QQQQQLAQQQ HPQQQQQQQQ ANISCKYAND PVTPGGSGGG
GVSGSNNNNN SANSNNNNSQ SLASPQDLST RDISPKLSPS SVVESVARSL NKGVLGGSLA
AAAAAAGLNN NHSGSGVSGG PGNVNVPMHS PGGGDSDSES DSGNEAGSSQ NSGNGKKNPP
QIYPWMKRVH LGTSTVNANG ETKRQRTSYT RYQTLELEKE FHFNRYLTRR RRIEIAHALC
LTERQIKIWF QNRRMKWKKE HKMASMNIVP YHMGPYGHPY HQFDIHPSQF AHLSA
Monday, September 9, 13
Sequences with the
ANTENNAPEDIA motif
Here are some sequences which contain substrings
which fit the pattern
[LIVMFE][FY]PWM[KRQTA]
...LHNEANLRIYPWMRSAGADR...
...PTVGKQIFPWMKES...
...VFPWMKMGGAKGGESKRTR...
Monday, September 9, 13
Not a given residue
Suppose you know from structural reasons that a
residue cannot be a proline. You could write
[ACDEFGHIKLMNQRSTVWY]
That’s tedious, so let’s use a new notation
[^P]
This matches anything which is not a proline.
(Yes, using the ^ is strange. That’s the way it is.)
Monday, September 9, 13
N-glycosylation site
This is the pattern for PS00001, ASN_GLYCOSYLATION
N[^P][ST][^P]
Match an N,
Then anything which isn’t a P,
Then an S or T,
And finally, anything which isn’t a P
Monday, September 9, 13
Allow anything
Sometimes the pattern can have anything in a
given position - it just needs the proper spacing.
Could use [ACDEFGHJKLMNPQRSTVWY] but that gets
tedious. Instead, let’s make a new notation for “anything”
Monday, September 9, 13
Barwin domain signature 1
The pattern is: CG[KR]CL.V.N
The substring must start with a C,
second letter must be a G,
third must be a K or R,
fourth must be a C, ...SSCGKCLSVTNTG...
fifth must be an L,
sixth may be any residue,
seventh must be a V,
eight may also be any residue,
last must be an N.
Monday, September 9, 13
Repeats
Sometimes you’ll repeat yourself repeat yourself. For
example, a pattern may require 5 hydrophobic residues
between two well conserved regions.
You could write it as
[FILAPVM][FILAPVM][FILAPVM][FILAPVM][FILAPVM]
but that gets tedious. Again that word. And again we’ll create
a new notation. Let’s use {}s with a number inside to indicate
how many times to repeat the previous pattern.
[FILAPVM]{5}
Monday, September 9, 13
[FILAPVM]{5}
The {}s repeat the previous pattern.
The above matches all of the following
AAAAA
AAPAP
LAPMAVAILA
VILLAMAP
LAPLAMP
And .{6} matches any string of at least length 6.
Monday, September 9, 13
EGF-like domain
signature 1
The pattern for PS00022 is: C.C.{5}G.{2}C
Match a C, followed by any residue, followed by a C, followed
by 5 residues of any type, then a G, then 2 of any residue type,
then a C.
...VCSNEGKCICQPDWTGKDCS...
Monday, September 9, 13
Count Ranges
Sometimes you may have a range of repeats. For example, a
loop can have 3 to 5 residues in it. All of our patterns so far
only matched a fixed number of characters, so we need to
modify the notation.
Monday, September 9, 13
EGF-like domain
signature 2
PS01186 is: C.C.{2}[GP][FYW].{4,8}C
Monday, September 9, 13
Short-hand versions of
counts ranges
This notation is very powerful and widely used outside of
bioinformatics. (I think research on it started in the 1950s).
Some repeat ranges are used so frequently that (to prevent
tedium, and to make things easier to read) there is special
notation for them.
What it means
{0, 1} ? “optional”
{0,} * “0 or more”
{1,} + “at least one”
Monday, September 9, 13
N- and C- terminals
Some things only happen at the N- terminal (start of the
sequence) or C-terminal (end of the sequence). We
don’t have a way to say that so we need - yes, you
guessed it - more notation.
^ means the start of the sequence (a ^ inside
of []s means “not”, outside means “start”)
$ means ends of the sequence
Monday, September 9, 13
^examples$
^A start with an A
^[MPK] start with an M, P, or K
E$ end with an E
[QSN]$ end with a Q, S, or N
^[^P] start with anything except P
start with an A and end with
^A.*E$
an E
Monday, September 9, 13
Neuromodulin
(GAP-43) signature 1
The pattern for PS00412 is: ^MLCC[LIVM]RR
Monday, September 9, 13
Endoplasmic reticulum
targeting sequence
The pattern for PS00014 is: [KRHQSA][DENQ]EL$
Monday, September 9, 13
Regular expressions
These sorts of patterns which match strings are called “regular
expressions”. (The name “regular” comes from a theoretical
model of how simple computers work, and “expressions”
because they are written as text.)
People don’t like saying “regular expression” all the time so will
often say “regexp”, “regex”, or “re”, or (rarely) “rx”.
Monday, September 9, 13
Many different regexp
languages
We’ve learned a bit of the “perl5” regular expression
language. It’s the most common and is used by Python
and other languages. There’s even pcre (perl compatible
regular expressions) for C.
There are many others: grep, emacs, awk, POSIX, and the
shells all use different ways to write the same pattern.
PROSITE also has its own unique form (which I
didn’t teach because no one else uses it).
Monday, September 9, 13
regexps in Python
The re module in Python has functions for working
with regular expressions.
>>> import re
>>>
Monday, September 9, 13
The ‘search’ method
>>> import re
>>> text = "My name is Andrew"
>>> re.search("[AT]", text)
Monday, September 9, 13
The Match object
>>> import re
>>> text = "My name is Andrew"
>>> re.search("[AT]", text)
<_sre.SRE_Match object at 0x3f8d40>
Monday, September 9, 13
Using the match
>>> import re
>>> text = "My name is Andrew"
>>> re.search("[AT]", text)
<_sre.SRE_Match object at 0x3f8d40>
>>> match = re.search("[AT]", text)
>>> match.start()
11
>>> match.end()
12
>>> text[11:12]
'A'
>>>
Monday, September 9, 13
Match a protein motif
>>> pattern = "[LIVMFE][FY]PWM[KRQTA]"
>>> seq = "LHNEANLRIYPWMRSAGADR"
>>> match = re.search(pattern, seq)
>>> match.start()
8
>>> match.end()
14
>>>
Monday, September 9, 13
If it doesn’t match..
The search returns nothing (the None object)
when no match was found.
>>> import re
>>> pattern = "[LIVMFE][FY]PWM[KRQTA]"
>>> match = re.search(pattern, "AAAAAAAAAAAAAA")
>>> print match
None
>>>
Monday, September 9, 13
List matching patterns
>>> import re
>>> pattern = "[LIVMFE][FY]PWM[KRQTA]"
>>> sequences = ["LHNEANLRIYPWMRSAGADR",
... "PTVGKQIFPWMKES",
... "NEANLKQIFPGAATR",
... "VFPWMKMGGAKGGESKRTR"]
>>> for seq in sequences:
... match = re.search(pattern, seq)
... if match:
... print seq, "matches"
... else:
... print seq, "does not have the motif"
...
LHNEANLRIYPWMRSAGADR matches
PTVGKQIFPWMKES matches
NEANLKQIFPGAATR does not have the motif
VFPWMKMGGAKGGESKRTR matches
>>>
Monday, September 9, 13
Groups
Suppose an enzyme modifies a protein, and recognizes
the portion of the sequence matching [ASD]{3,5}[LI][^P]
{2,5}
The modification only occurs on the [IL] residue. I want
to know the residue of that one residue, and not the
start/end positions of the whole motif. This requires a
new notation, groups.
Monday, September 9, 13
(groups)
Use ()s to indicate groups. The first ( is the start of the
first group, the second ( is the start of the second
group, etc. A group ends with the matching ).
>>> import re
>>> pattern = "[ASD]{3,5}([LI])[^P]{2,5}"
>>> seq = "EASALWTRD"
>>> match = re.search(pattern, seq)
>>> print match.start(), match.end()
1 9
>>> match.start(1), match.end(1)
4 5
>>>
Monday, September 9, 13
Parsing with regexps
Groups are great for parsing. Suppose I have the string
Name: Andrew Age: 33
and want to get the name and the age values. I can use a
pattern with a group for each field.
Monday, September 9, 13
Dissecting that pattern
Name: ([^ ]+) +Age: ([0123456789]+)
Start with
“Age: ”
“Name: ”
One or more non-
One or more digits
space characters
(group 2)
(group 1)
Monday, September 9, 13
Shorthand
Saying [0123456789] is tedious (again!)
There is special shorthand notation for some of
the more common sets.
Name: ([^ ]+) +Age: (\d+)
Some others
\d = [0123456789]
\w = letters, digits, and the underscore
\s = “whitespace” (space, newline, tab, and a few others)
Monday, September 9, 13
Using it
>>> import re
>>> text = "Name: Andrew Age: 33"
>>> pattern = "Name: ([^ ]+) +Age: ([0123456789]+)"
>>> match = re.search(pattern, text)
>>> match.start(1)
6
>>> match.end(1)
12
>>> match.group(1)
'Andrew'
>>> match.group(2)
'33'
>>>
Monday, September 9, 13
Dictionaries
Monday, September 9, 13
A “Good morning”
dictionary
English: Good morning
Spanish: Buenas días
Swedish: God morgon
German: Guten morgen
Venda: Ndi matscheloni
Afrikaans: Goeie môre
Monday, September 9, 13
What’s a dictionary?
A dictionary is a table of items.
Each item has a “key” and a “value”
Keys Values
English Good morning
Spanish Buenas días
Swedish God morgon
German Guten morgen
Venda Ndi matscheloni
Afrikaans Goeie môre
Monday, September 9, 13
Look up a value
I want to know “Good morning” in Swedish.
Step 1: Get the “Good morning” table
Keys Values
English Good morning
Spanish Buenas días
Swedish God morgon
German Guten morgen
Venda Ndi matscheloni
Afrikaans Goeie môre
Monday, September 9, 13
Find the item
Step 2: Find the item where the key is “Swedish”
Keys Values
English Good morning
Spanish Buenas días
Swedish God morgon
German Guten morgen
Venda Ndi matscheloni
Afrikaans Goeie môre
Monday, September 9, 13
Get the value
Step 3: The value of that item is how to say “Good
morning” in Swedish -- “God morgon”
Keys Values
English Good morning
Spanish Buenas días
Swedish God morgon
German Guten morgen
Venda Ndi matscheloni
Afrikaans Goeie môre
Monday, September 9, 13
In Python
>>> good_morning_dict = {
... "English": "Good morning",
... "Swedish": "God morgon",
... "German": "Guten morgen",
... "Venda": "Ndi matscheloni",
... }
>>> print good_morning_dict["Swedish"]
God morgon
>>>
Monday, September 9, 13
Dictionary examples
>>> D1 = {}
>>> len(D1) An empty dictionary
0
>>> D2 = {"name": "Andrew", "age": 33}
>>> len(D2)
2 A dictionary with 2 items
>>> D2["name"]
'Andrew'
>>> D2["age"]
33
>>> D2["AGE"]
Traceback (most recent call last):
Keys are case-sensitive
File "<stdin>", line 1, in ?
KeyError: 'AGE'
>>>
Monday, September 9, 13
Add new elements
>>> my_sister = {}
>>> my_sister["name"] = "Christy"
>>> print "len =", len(my_sister), "and value is", my_sister
len = 1 and value is {'name': 'Christy'}
>>> my_sister["children"] = ["Maggie", "Porter"]
>>> print "len =", len(my_sister), "and value is", my_sister
len = 2 and value is {'name': 'Christy', 'children': ['Maggie', 'Porter']}
>>>
Monday, September 9, 13
Get the keys and values
>>> city = {"name": "Cape Town", "country": "South Africa",
... "population": 2984000, "lat.": -33.93, "long.": 18.46}
>>> print city.keys()
['country', 'long.', 'lat.', 'name', 'population']
>>> print city.values()
['South Africa', 18.460000000000001, -33.93, 'Cape Town', 2984000]
>>> for k in city:
... print k, "=", city[k]
...
country = South Africa
long. = 18.46
lat. = -33.93
name = Cape Town
population = 2984000
>>>
Monday, September 9, 13
A few more examples
>>> D = {"name": "Johann", "city": "Cape Town"}
>>> D["city"] = "Johannesburg"
>>> print D
{'city': 'Johannesburg', 'name': 'Johann'}
>>> del D["name"]
>>> print D
{'city': 'Johannesburg'}
>>> D["name"] = "Dan"
>>> print D
{'city': 'Johannesburg', 'name': 'Dan'}
>>> D.clear()
>>>
>>> print D
{}
>>>
Monday, September 9, 13
Ambiguity codes
Sometimes DNA bases are ambiguous.
Monday, September 9, 13
Count Bases #1
This time we’ll include all 16 possible letters
>>> seq = "TKKAMRCRAATARKWC"
>>> A = seq.count("A")
>>> B = seq.count("B")
>>> C = seq.count("C")
>>> D = seq.count("D")
>>> G = seq.count("G")
Don’t do this!
>>> H = seq.count("H")
>>> K = seq.count("K")
>>> M = seq.count("M")
>>>
>>>
>>>
N
R
S
=
=
=
seq.count("N")
seq.count("R")
seq.count("S")
Let the computer help out
>>> T = seq.count("T")
>>> V = seq.count("V")
>>> W = seq.count("W")
>>> Y = seq.count("Y")
>>> print "A =", A, "B =", B, "C =", C, "D =", D, "G =", G, "H =", H, "K =", K, "M =", M, "N
=", N, "R =", R, "S =", S, "T =", T, "V =", V, "W =", W, "Y =", Y
A = 4 B = 0 C = 2 D = 0 G = 0 H = 0 K = 3 M = 1 N = 0 R = 3 S = 0
T = 2 V = 0 W = 1 Y = 0
>>>
Monday, September 9, 13
Count Bases #2
Using a dictionary
>>> seq = "TKKAMRCRAATARKWC"
>>> counts = {}
>>> counts["A"] = seq.count("A")
>>> counts["B"] = seq.count("B")
>>> counts["C"] = seq.count("C")
>>> counts["D"] = seq.count("D")
>>> counts["G"] = seq.count("G")
>>>
>>>
>>>
counts["H"]
counts["K"]
counts["M"]
=
=
=
seq.count("H")
seq.count("K")
seq.count("M")
Don’t do this either!
>>> counts["N"] = seq.count("N")
>>> counts["R"] = seq.count("R")
>>> counts["S"] = seq.count("S")
>>> counts["T"] = seq.count("T")
>>> counts["V"] = seq.count("V")
>>> counts["W"] = seq.count("W")
>>> counts["Y"] = seq.count("Y")
>>> print counts
{'A': 4, 'C': 2, 'B': 0, 'D': 0, 'G': 0, 'H': 0, 'K': 3, 'M': 1, 'N':
0, 'S': 0, 'R': 3, 'T': 2, 'W': 1, 'V': 0, 'Y': 0}
>>>
Monday, September 9, 13
Count Bases #3
use a for loop
>>> seq = "TKKAMRCRAATARKWC"
>>> counts = {}
>>> for letter in "ABCDGHKMNRSTVWY":
... counts[letter] = seq.count(letter)
...
>>> print counts
{'A': 4, 'C': 2, 'B': 0, 'D': 0, 'G': 0, 'H': 0, 'K': 3, 'M': 1, 'N': 0, 'S': 0, 'R': 3, 'T': 2,
'W': 1, 'V': 0, 'Y': 0}
>>> for base in counts.keys():
... print base, "=", counts[base]
...
A = 4
C = 2
B = 0
D = 0
G = 0
H = 0
K = 3
M = 1
N = 0
S = 0
R = 3
T = 2
W = 1
V = 0
Y = 0
>>>
Monday, September 9, 13
Count Bases #4
Suppose you don’t know all the possible bases.
If the base isn’t a key in the
>>> seq = "TKKAMRCRAATARKWC" counts dictionary then use
>>> counts = {}
>>> for base in seq: zero. Otherwise use the
...
...
if base not in counts:
n = 0 value from the dict
... else:
... n = counts[base]
... counts[base] = n + 1
...
>>> print counts
{'A': 4, 'C': 2, 'K': 3, 'M': 1, 'R': 3, 'T': 2, 'W': 1}
>>>
Monday, September 9, 13
Count Bases #5 (Last
one!)
The idiom “use a default value if the key doesn’t
exist” is very common. Python has a special
method to make it easy.
>>> seq = "TKKAMRCRAATARKWC"
>>> counts = {}
>>> for base in seq:
... counts[base] = counts.get(base, 0) + 1
...
>>> print counts
{'A': 4, 'C': 2, 'K': 3, 'M': 1, 'R': 3, 'T': 2, 'W': 1}
>>> counts.get("A", 9)
4
>>> counts["B"]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
KeyError: 'B'
>>> counts.get("B", 9)
9
>>>
Monday, September 9, 13
Reverse Complement
>>> complement_table = {"A": "T", "T": "A", "C": "G", "G": "C"}
>>> seq = "CCTGTATT"
>>> new_seq = []
>>> for letter in seq:
... complement_letter = complement_table[letter]
... new_seq.append(complement_letter)
...
>>> print new_seq
['G', 'G', 'A', 'C', 'A', 'T', 'A', 'A']
>>> new_seq.reverse()
>>> print new_seq
['A', 'A', 'T', 'A', 'C', 'A', 'G', 'G']
>>> print "".join(new_seq)
AATACAGG
>>>
Monday, September 9, 13
Listing Codons
>>> seq = "TCTCCAAGACGCATCCCAGTG"
>>> seq[0:3]
'TCT'
>>> seq[3:6]
'CCA'
>>> seq[6:9]
'AGA'
>>> range(0, len(seq), 3)
[0, 3, 6, 9, 12, 15, 18]
>>> for i in range(0, len(seq), 3):
... print "Codon", i/3, "is", seq[i:i+3]
...
Codon 0 is TCT
Codon 1 is CCA
Codon 2 is AGA
Codon 3 is CGC
Codon 4 is ATC
Codon 5 is CCA
Codon 6 is GTG
>>>
Monday, September 9, 13
The last “codon”
>>> seq = "TCTCCAA"
>>> for i in range(0, len(seq), 3):
... print "Base", i/3, "is", seq[i:i+3]
...
Base 0 is TCT
Base 1 is CCA Not a codon!
Base 2 is A
>>>
Monday, September 9, 13
The ‘%’ (remainder)
operator
>>> 0 % 3
0
>>> 1 % 3
1
>>> 2 % 3
2 >>> seq = "TCTCCAA"
>>> 3 % 3 >>> len(seq)
0 7
>>> 4 % 3 >>> len(seq) % 3
1 1
>>> 5 % 3 >>>
2
>>> 6 % 3
0
>>>
Monday, September 9, 13
Two solutions
First one -- refuse to do it
if len(seq) % 3 != 0: # not divisible by 3
print "Will not process the sequence"
else:
print "Will process the sequence"
Monday, September 9, 13
Counting codons
>>> seq = "TCTCCAAGACGCATCCCAGTG"
>>> codon_counts = {}
>>> for i in range(0, len(seq) - len(seq)%3, 3):
... codon = seq[i:i+3]
... codon_counts[codon] = codon_counts.get(codon, 0) + 1
...
>>> codon_counts
{'ATC': 1, 'GTG': 1, 'TCT': 1, 'AGA': 1, 'CCA': 2, 'CGC': 1}
>>>
Monday, September 9, 13
Sorting the output
People like sorted output. It’s easier to
find “GTG” if the codon table is in order.
Use keys to get the dictionary keys then
use sort to sort the keys (put them in order).
>>> codon_counts = {'ATC': 1, 'GTG': 1, 'TCT': 1, 'AGA': 1, 'CCA': 2, 'CGC': 1}
>>> codons = codon_counts.keys()
>>> print codons
['ATC', 'GTG', 'TCT', 'AGA', 'CCA', 'CGC']
>>> codons.sort()
>>> print codons
['AGA', 'ATC', 'CCA', 'CGC', 'GTG', 'TCT']
>>> for codon in codons:
... print codon, "=", codon_counts[codon]
...
AGA = 1
ATC = 1
CCA = 2
CGC = 1
GTG = 1
TCT = 1
>>>
Monday, September 9, 13
Code Blocks
and
Indentation
Monday, September 9, 13
Indentation is important
Think of a recipe - Chocolate Cake
(Mmmmmm.... Chocolate Cake....)
Monday, September 9, 13
Make the cake?
1. Make the cake:
1A. make the batter
1B. put into pans
1C. bake at 180C for 30-35 minutes
2. Put the frosting on
3. Eat and enjoy
Monday, September 9, 13
Make the batter?
1. Make the cake:
1A. make the batter:
1Aa. melt chocolate and butter
1Ab. prepare egg mixture
1Ac. sift dry ingredients
1Ad. combine egg mixture, dry
ingredients and milk
1Ae. fold egg whites into batter
1B. put into pans
1C. bake at 180C for 30-35 minutes
2. Put the frosting on
3. Eat and enjoy
Monday, September 9, 13
- Make the cake:
Melt the chocolate ... ?
- make the batter:
Monday, September 9, 13
Where do I get ... ?
- Prepare for cooking:
- get 6 ounces of chocolate, 1/2 cup butter
- get a saucepan, stove, spoon for stirring
- Make the cake:
- make the batter:
- melt chocolate and butter:
- In a heavy saucepan over low heat:
- put in 6 ounces semi-sweet chocolate
- put in 1/2 cup butter
- while it hasn’t melted:
- wait a little bit
- stir
- put aside to let cool
- prepare egg mixture
- sift dry ingredients
- combine egg mixture, dry ingredients and milk
- fold egg whites into batter
- put into pans
- bake at 180C for 30-35 minutes
- Put the frosting on
- Eat and enjoy
Monday, September 9, 13
I have/don’t have that!
Prepare for cooking:
get a kitchen with a good set of cookware
start a “Shopping list”
for each ingredient in [6 ounces of chocolate,
1/2 cup of butter]:
if I don’t have enough of the ingredient:
add what’s missing to the shopping list
Make the cake:
make the batter:
melt chocolate and butter:
In a heavy saucepan over low heat:
put in 6 ounces semi-sweet chocolate
put in 1/2 cup butter
while it hasn’t melted:
wait a little bit
stir
put aside to let cool
prepare egg mixture
sift dry ingredients
combine egg mixture, dry ingredients and milk
fold egg whites into batter
put into pans
bake at 180C for 30-35 minutes
Put the frosting on
Eat and enjoy
Monday, September 9, 13
And the egg mixture?
Prepare for cooking:
get a kitchen with a good set of cookware
start a “Shopping list”
Monday, September 9, 13
Then the dry ingredients...
And folding in the egg whites ...
And putting everything into the pans ...
And making the frosting ...
Oh, and cleaning up afterwards...
...
Monday, September 9, 13
Making four cakes
I’m making birthday cakes for four Swedes,
Anders, Lars, Ingela, Jacob.
Prepare for cooking (*4)
Make the cake (*4)
Put the frosting on (*4)
for x in [”Anders”,“Lars”,“Ingela”,“Jacob”]:
on the cake, write “Happy Birthday, “, x
Eat and enjoy
Monday, September 9, 13
Functions
Monday, September 9, 13
Built-in functions
You’ve used several functions already
>>> len("ATGGTCA")
7
>>> abs(-6)
6
>>> float("3.1415")
3.1415000000000002
>>>
Monday, September 9, 13
What are functions?
A function is a code block with a name
Monday, September 9, 13
Functions start with ‘def’
Monday, September 9, 13
Then the name
This function is named ‘hello’
Monday, September 9, 13
The list of parameters
The parameters are always listed in parenthesis.
There are no parameters in this function
so the parameter list is empty.
Monday, September 9, 13
A colon
A function definition starts a new code block.
The definition line must end with a colon (the “:”)
Just like the ‘if’, and ‘for’ statements.
Monday, September 9, 13
The code block
These are the statements that are run when the
function is called. They can be any Python
statement (print, assignment, if, for, open, ...)
Monday, September 9, 13
Calling the function
When you “call” a function you ask Python
to execute the statements in the code block
for that function.
Monday, September 9, 13
Which function to call?
Start with the name of the function.
In this case the name is “hello”
Monday, September 9, 13
List any parameters
The parameters are always listed in parenthesis.
There are no parameters for this function
so the parameter list is empty.
Monday, September 9, 13
And the function runs
Monday, September 9, 13
Arguments and
Parameters
(Two sides of the same idea)
Monday, September 9, 13
Hello, <insert name here>
Say “Hello” followed by the person’s name
In maths we say “the function is parameterized by
the person’s name”
>>> def hello(name):
... print "Hello", name
...
>>> hello("Andrew")
Hello Andrew
>>>
Monday, September 9, 13
Change the function definition
The function now takes one parameter. When the function
is called this parameter will be accessible using the variable
named name
Monday, September 9, 13
Calling the function
The function call now needs one argument.
Here I’ll use the string “Andrew”.
Monday, September 9, 13
And the function runs
The function call assigns the string “Andrew” to
the variable “name” then does the statements
in the code block
Monday, September 9, 13
Multiple parameters
Here’s a function which takes two parameters
and subtracts the second from the first.
Two parameters in the definition
>>> def subtract(x, y):
... print x-y
...
>>> subtract(8, 5)
3
>>>
Two parameters in the call
Monday, September 9, 13
Returning values
Rarely do functions only print.
More often the function does something and
the results of that are used by something else.
For example, len computes the length of a string
or list then returns that value to the caller.
Monday, September 9, 13
subtract doesn’t return
anything
By default, a function returns the special value None
Monday, September 9, 13
The return statement
The return statement tells Python to exit the
function and return a given object.
Monday, September 9, 13
Making a function
Yes, we’re going to count letters again.
seq = "ATGCATGATGCATGAAAGGTCG"
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
Monday, September 9, 13
Identify the function
I’m going to make a function which counts bases.
What’s the best part to turn into a function?
seq = "ATGCATGATGCATGAAAGGTCG"
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
Monday, September 9, 13
Identify the input
In this example the sequence can change.
That makes seq a good choice as a parameter.
seq = "ATGCATGATGCATGAAAGGTCG"
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
Monday, September 9, 13
Identify the algorithm
This is the part of your program
which does something.
seq = "ATGCATGATGCATGAAAGGTCG"
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
Monday, September 9, 13
Identify the output
The output will use the data computed by
your function...
seq = "ATGCATGATGCATGAAAGGTCG"
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
Monday, September 9, 13
Identify the return value
... which helps you identify the return value
seq = "ATGCATGATGCATGAAAGGTCG"
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
for base in counts:
print base, “=”, counts[base]
Monday, September 9, 13
Name the function
First, come up with a good name for your function.
Monday, September 9, 13
Start with the ‘def’ line
The function definition starts with a ‘def’
def count_bases(seq):
Monday, September 9, 13
Add the code block
def count_bases(seq):
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
Monday, September 9, 13
Return the results
def count_bases(seq):
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
return counts
Monday, September 9, 13
Use the function
def count_bases(seq):
counts = {}
for base in seq:
if base not in counts:
counts[base] = 1
else:
counts[base] = counts[base] + 1
return counts
input_seq = “ATGCATGATGCATGAAAGGTCG”
results = count_bases(input_seq)
for base in results:
print base, “=”, counts[base]
Monday, September 9, 13
Use the function
def count_bases(seq): Notice that the variables
counts = {} for the parameters and
for base in seq: the return value don’t
if base not in counts:
counts[base] = 1
need to be the same
else:
counts[base] = counts[base] + 1
return counts
input_seq = “ATGCATGATGCATGAAAGGTCG”
results = count_bases(input_seq)
for base in results:
print base, “=”, counts[base]
Monday, September 9, 13
Interactively
>>> def count_bases(seq):
... counts = {}
... for base in seq:
... if base not in counts:
... counts[base] = 1
... else:
... counts[base] = counts[base] + 1
... return counts
...
>>> count_bases("ATATC") (I don’t even need a
{'A': 2, 'C': 1, 'T': 2} variable name - just use
>>> count_bases("ATATCQGAC") the values directly.)
{'A': 3, 'Q': 1, 'C': 2, 'T': 2, 'G': 1}
>>> count_bases("")
{}
>>>
Monday, September 9, 13
Functions can call functions
>>> def gc_content(seq):
... counts = count_bases(seq)
... return (counts["G"] + counts["C"]) / float(len(seq))
...
>>> gc_content("CGAATT")
0.333333333333
>>>
Monday, September 9, 13
Functions can be used
(almost) anywhere
In an ‘if’ statement
>>> def polyA_tail(seq):
... if seq.endswith("AAAAAA"):
... return True
... else:
... return False
...
>>> if polyA_tail("ATGCTGTCGATGAAAAAAA"):
... print "Has a poly-A tail"
...
Has a poly-A tail
>>>
Monday, September 9, 13
Functions can be used
(almost) anywhere
In an ‘for’ statement
>>> def split_into_codons(seq):
... codons = []
... for i in range(0, len(seq)-len(seq)%3, 3):
... codons.append(seq[i:i+3])
... return codons
...
>>> for codon in split_into_codons("ATGCATGCATGCATGCATGC"):
... print "Codon", codon
...
Codon ATG
Codon CAT
Codon GCA
Codon TGC
Codon ATG
Codon CAT
>>>
Monday, September 9, 13
Default arguments
def ask_ok(prompt, retries=4, complaint='Yes or no, please!'):
while True:
ok = raw_input(prompt)
if ok in ('y', 'ye', 'yes'):
return True
if ok in ('n', 'no', 'nop', 'nope'):
return False
retries = retries - 1
if retries < 0:
raise IOError('refusenik user')
print complaint
Monday, September 9, 13
Keyword arguments
def parrot(voltage, state='a stiff', action='voom', type='Norwegian Blue'):
print "-- This parrot wouldn't", action,
print "if you put", voltage, "volts through it."
print "-- Lovely plumage, the", type
print "-- It's", state, "!"
OK:
parrot(1000) # 1 positional argument
parrot(voltage=1000) # 1 keyword argument
parrot(voltage=1000000, action='VOOOOOM') # 2 keyword arguments
parrot(action='VOOOOOM', voltage=1000000) # 2 keyword arguments
parrot('a million', 'bereft of life', 'jump') # 3 positional arguments
parrot('a thousand', state='pushing up the daisies') # 1 positional, 1 keyword
Not OK:
parrot() # required argument missing
parrot(voltage=5.0, 'dead') # non-keyword argument after a keyword argument
parrot(110, voltage=220) # duplicate value for the same argument
parrot(actor='John Cleese') # unknown keyword argument
Monday, September 9, 13
Sorting and Modules
Monday, September 9, 13
Sorting
Lists have a sort method
Strings are sorted alphabetically, except ...
>>> L1 = ["this", "is", "a", "list", "of", "words"]
>>> print L1
['this', 'is', 'a', 'list', 'of', 'words']
>>> L1.sort()
>>> print L1
['a', 'is', 'list', 'of', 'this', 'words']
>>>
Monday, September 9, 13
>>> for i in range(32, 127):
...
...
32 =
print i, "=", chr(i)
56 = 8 80 = P 104 = h
ASCII order
33 = ! 57 = 9 81 = Q 105 = i
>>> for letter in "Hello":
34 = " 58 = : 82 = R 106 = j
35 = # 59 = ; 83 = S 107 = k ... print ord(letter)
36 = $ 60 = < 84 = T 108 = l ...
37 = % 61 = = 85 = U 109 = m 72
38 = & 62 = > 86 = V 110 = n 101
39 = ' 63 = ? 87 = W 111 = o 108
40 = ( 64 = @ 88 = X 112 = p 108
41 = ) 65 = A 89 = Y 113 = q 111
42 = * 66 = B 90 = Z 114 = r 10
43 = + 67 = C 91 = [ 115 = s >>>
44 = , 68 = D 92 = \ 116 = t
45 = - 69 = E 93 = ] 117 = u
46 = . 70 = F 94 = ^ 118 = v
47 = / 71 = G 95 = _ 119 = w
48 = 0 72 = H 96 = ` 120 = x
49 = 1 73 = I 97 = a 121 = y
50 = 2 74 = J 98 = b 122 = z
51 = 3 75 = K 99 = c 123 = {
52 = 4 76 = L 100 = d 124 = |
53 = 5 77 = M 101 = e 125 = }
54 = 6 78 = N 102 = f 126 = ~
55 = 7 79 = O 103 = g
Monday, September 9, 13
Sorting Numbers
Numbers are sorted numerically
>>> L3 = [5, 2, 7, 8]
>>> L3.sort()
>>> print L3
[2, 5, 7, 8]
>>> L4 = [-7.0, 6, 3.5, -2]
>>> L4.sort()
>>> print L4
[-7.0, -2, 3.5, 6]
>>>
Monday, September 9, 13
Sorting Both
You can sort with both numbers and strings
>>> L5 = [1, "two", 9.8, "fem"]
>>> L5.sort()
>>> print L5
[1, 9.8000000000000007, 'fem', 'two']
>>>
Monday, September 9, 13
Sort returns nothing!
Sort modifies the list “in-place”
Monday, September 9, 13
Three steps for sorting
#1 - Get the list
>>> L1 = "this is a list of words".split()
>>> print L1
['this', 'is', 'a', 'list', 'of', 'words']
#2 - Sort it
>>> L1.sort()
Monday, September 9, 13
Sorting Dictionaries
Dictionary keys are unsorted
>>> D = {"ATA": 6, "TGG": 8, "AAA": 1}
>>> print D
{'AAA': 1, 'TGG': 8, 'ATA': 6}
>>>
Monday, September 9, 13
Sorting Dictionaries
#1 - Get the list
>>> D = {"ATA": 6, "TGG": 8, "AAA": 1}
>>> print D
{'AAA': 1, 'TGG': 8, 'ATA': 6}
>>> keys = D.keys()
>>> print keys
['AAA', 'TGG', 'ATA']
>>>
Monday, September 9, 13
#2 - Sort the list
>>> D = {"ATA": 6, "TGG": 8, "AAA": 1}
>>> print D
{'AAA': 1, 'TGG': 8, 'ATA': 6}
>>> keys = D.keys()
>>> print keys
['AAA', 'TGG', 'ATA']
>>> keys.sort()
>>> print keys
['AAA', 'ATA', 'TGG']
>>> for k in keys:
... print k, D[k]
...
AAA 1
ATA 6
TGG 8
>>>
Monday, September 9, 13
#3 - Use the sorted list
>>> D = {"ATA": 6, "TGG": 8, "AAA": 1}
>>> print D
{'AAA': 1, 'TGG': 8, 'ATA': 6}
>>> keys = D.keys()
>>> print keys
['AAA', 'TGG', 'ATA']
>>> keys.sort()
>>> print keys
['AAA', 'ATA', 'TGG']
>>> for k in keys:
... print k, D[k]
...
AAA 1
ATA 6
TGG 8
>>>
Monday, September 9, 13
More info
There is a “how-to” on sorting at
http://www.amk.ca/python/howto/sorting/sorting.html
Monday, September 9, 13
Object Oriented Programming
o Abstraction (the idea that the roles of Waiters, Customers and Kitchens
are abstract ideas, apart from any particular instance of a Waiter; Python
refers to the class of waiters, JavaScript to the prototypical Waiter)
o Messages (the function calls that are used to interact with objects; here,
the words in the speech balloons, and also perhaps the coffee & cash)
o Overloading (Waiter's response to "A coffee", different response to "A
black coffee")
o Polymorphism (Waiter and Kitchen implement "A black coffee" differently)
o Encapsulation (Customers, Waiters conceal their internal data, present
interfaces relating to behavior)
o Inheritance (not exactly used here, except implicitly: all types of coffee can
be drunk or spilled, all humans can speak basic English and hold cups of
coffee, etc. A better example of Inheritance is if there are different
specializations of Waiter, e.g. Head Waiter, Sommelier, etc. Then all “inherit”
the core functions of a Waiter, but with different extra functionality)
o Various OOP Design Patterns: the Waiter is an Adapter and/or a Bridge,
the Kitchen is a Factory (and perhaps the Waiter is too), asking for coffee is
a Factory Method, etc.
Monday, September 9, 13
Modules
Modules are collections of objects (like strings,
numbers, functions, lists, and dictionaries)
You’ve seen the math module
>>> import math
>>> math.cos(0)
1.0
>>> math.cos(math.radians(45))
0.70710678118654746
>>> math.sqrt(2) / 2
0.70710678118654757
>>> math.hypot(5, 12)
13.0
>>>
Monday, September 9, 13
Importing a module
The import statement tells Python to find
module with the given name.
Monday, September 9, 13
Using the new module
Objects in the math module are
accessed with the “dot notation”
Monday, September 9, 13
Attributes
The dot notation is used for attributes, which are
also called properties.
Monday, September 9, 13
Make a module
First, create a new file
Monday, September 9, 13
Add Python code
In the file “seq_functions.py” add the following
BASES = "ATCG"
def GC_content(s):
return (s.count("G") + s.count("C")) / float(len(s))
Monday, September 9, 13
Test it interactively
>>> import seq_functions
>>> seq_functions.BASES
'ATCG'
>>>
seq_functions.GC_content("ATCG")
0.5
>>>
Monday, September 9, 13
Using it from a program
Create a new file called “main.py”
Monday, September 9, 13
Making changes
If you edit “seq_functions.py” then you must tell
Python to reread the statements from the module.
Monday, September 9, 13
Important modules:
Biopython, SQL & COM
Monday, September 9, 13
Information sources
• python.org
• tutor list (for beginners), the Python
Package index, on-line help, tutorials, links to
other documentation, and more.
Monday, September 9, 13
Biopython
• www.biopython.org
• Collection of many bioinformatics modules
• Some well tested, some experimental
• Check with biopython.org before writing new
software. It may already exist.
Monday, September 9, 13
The Seq object
>>> from Bio import Seq
>>> seq = Seq.Seq("ATGCATGCATGATGATCG")
>>> print seq
Seq('ATGCATGCATGATGATCG', Alphabet())
>>>
Monday, September 9, 13
Alphabets
>>> from Bio import Seq
>>> from Bio.Alphabet import IUPAC
>>> protein = Seq.Seq("ATGCATGCATGC", IUPAC.protein)
>>> dna = Seq.Seq("ATGCATGCATGC", IUPAC.unambiguous_dna)
>>> protein[:10]
Seq('ATGCATGCAT', IUPACProtein())
>>> protein[:10] + protein[::-1]
Seq('ATGCATGCATCGTACGTACGTA', IUPACProtein())
>>> dna[:6]
Seq('ATGCAT', IUPACUnambiguousDNA())
>>> dna[0]
'A'
>>> protein[:10] + dna[:6]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/local/lib/python2.3/site-packages/Bio/Seq.py", line 45, in __add__
raise TypeError, ("incompatable alphabets", str(self.alphabet),
TypeError: ('incompatable alphabets', 'IUPACProtein()', 'IUPACUnambiguousDNA()')
>>>
Monday, September 9, 13
Translation
>>> from Bio import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio import Translate
>>>
>>> standard_translator = Translate.unambiguous_dna_by_id[1]
>>> seq = Seq.Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
... IUPAC.unambiguous_dna)
>>> standard_translator.translate(seq)
Seq('DRWAYIGSKI', HasStopCodon(IUPACProtein(), '*'))
>>>
Monday, September 9, 13
Reading sequence files
We’ve put a lot of work into reading common
bioinformatics file formats. As the formats change, we
update our parsers. There’s (almost) no reason for you to
write your own GenBank, SWISS-PROT, ... parser!
Monday, September 9, 13
Reading a FASTA file
>>> from Bio import Fasta
>>> parser = Fasta.RecordParser()
>>> infile = open("ls_orchid.fasta")
>>> iterator = Fasta.Iterator(infile, parser)
>>> record = iterator.next()
>>> record.title
'gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and
ITS1 and ITS2 DNA'
>>> record.sequence
'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAA
TCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCC
TCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCAC
CGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGAC
TCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATC
CCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGC
TTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGA
ACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATG
GAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC'
Monday, September 9, 13
Reading all records
>>> from Bio import Fasta
>>> parser = Fasta.RecordParser()
>>> infile = open("ls_orchid.fasta")
>>> iterator = Fasta.Iterator(infile, parser)
>>> while 1:
... record = iterator.next()
... if not record:
... break
... print record.title[record.title.find(" ")+1:-1]
...
C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DN
C.californicum 5.8S rRNA gene and ITS1 and ITS2 DN
C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DN
C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DN
C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DN
C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DN
.... additional lines removed ....
Monday, September 9, 13
Reading a GenBank file
>>> from Bio import GenBank
>>> parser = GenBank.RecordParser() Only changed
>>> infile = open("input.gb")
>>> iterator = GenBank.Iterator(infile, parser)
the format
>>> record = iterator.next() name
>>> record.locus
'10A19I'
>>> record.organism
'Oryza sativa (japonica cultivar-group)'
>>> len(record.features)
31
>>> record.features[0].key
'source'
>>> record.features[0].location
'1..99587'
>>> record.taxonomy
['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta',
'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Poales',
'Poaceae', 'Ehrhartoideae', 'Oryzeae', 'Oryza']
>>>
Monday, September 9, 13
Get data over the web
Python includes the ‘urllib2’ (successor to urllib) to fetch
data given a URL. It can handle GET and POST requests
for HTTP and HTTPS, do ftp, and read local files.
Monday, September 9, 13
NCBI’s EUtils
Designed for software to access NCBI’s databases directly.
Can query literature and sequence databases.
Biopython includes a library for working with it.
Sadly, it’s poorly documented.
Monday, September 9, 13
Remote BLAST
(at NCBI)
Biopython includes a library to run a job on NCBI’s
BLAST server. To your program it looks just like a
normal function call.
Monday, September 9, 13
BLAST (locally)
If you instead want to use a local installation of
BLAST you can use Bio.Blast.NCBIStandalone
Monday, September 9, 13
Microsoft/COM/Excel
Python runs on Unix, Macs, Microsoft Windows, and more.
Python support for Windows is very good.
“The most Microsoft compliant language outside Redmond.”
Monday, September 9, 13
SQL
Python can connect to different databases (like MySQL,
PostgreSQL, and Oracle). Usually there are two ways to
talk to the database; directly using the database-specific
interface or indirectly through a Python adapter which tries
to hide the differences between the databases.
Monday, September 9, 13
“Everything Else”
Monday, September 9, 13
Find all substrings
We’ve learned how to find the first location of
a string in another string with find. What
about finding all matches?
Start by looking at the documentation.
S.find(sub [,start [,end]]) -> int
Return -1 on failure.
Monday, September 9, 13
Experiment with find
>>> seq = "aaaaTaaaTaaT"
>>> seq.find("T")
4
>>> seq.find("T", 4)
4
>>> seq.find("T", 5)
8
>>> seq.find("T", 9)
11
>>> seq.find("T", 12)
-1
>>>
Monday, September 9, 13
How to program it?
Monday, September 9, 13
while statement
The solution is the while statment
>>> pos = seq.find("T") While the test is true
>>> while pos != -1:
... print "T at index", pos
... pos = seq.find("T", pos+1)
...
T at index 4
T at index 8 Do its code block
T at index 11
>>>
Monday, September 9, 13
There’s duplication...
Duplication is bad. (Unless you’re a gene?)
The more copies there are the more likely some
will be different than others.
>>> pos = seq.find("T")
>>> while pos != -1:
... print "T at index", pos
... pos = seq.find("T", pos+1)
...
T at index 4
T at index 8
T at index 11
>>>
Monday, September 9, 13
The break statement
The break statement says “exit this loop immediately”
instead of waiting for the normal exit.
>>> pos = -1
>>> while 1:
... pos = seq.find("T", pos+1)
... if pos == -1:
... break
... print "T at index", pos
...
T at index 4
T at index 8
T at index 11
>>>
Monday, September 9, 13
break in a for
A break also works in the for loop
Monday, September 9, 13
elif
Sometimes the if statement is more complex than if/else
“If the weather is hot then go to the beach. If it is
rainy, go to the movies. If it is cold, read a book.
Otherwise watch television.”
if is_hot(weather):
go_to_beach()
elif is_rainy(weather):
go_to_movies()
elif is_cold(weather):
read_book()
else:
watch_television()
Monday, September 9, 13
tuples
Python has another fundamental data type - a tuple.
A tuple is like a list except it’s immutable (can’t be changed)
>>> data = ("Cape Town", 2004, [])
>>> print data
('Cape Town', 2004, [])
>>> data[0]
'Cape Town'
>>> data[0] = "Johannesburg"
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object doesn't support item assignment
>>> data[1:]
(2004, [])
>>>
Monday, September 9, 13
Why tuples?
We already have a list type. What does a tuple add?
Monday, September 9, 13
String Formating
So far all the output examples used the print statement. Print
puts spaces between fields, and sticks a newline at the end.
Often you’ll need to be more precise.
Python has a new definition for the “%” operator when used
with a strings on the left-hand side - “string interpolation”
Monday, September 9, 13
Simple string interpolation
The left side of a string interpolation is always a string.
The right side of the string interpolation may be a dictionary, a
tuple, or anything else. Let’s start with the last.
Monday, September 9, 13
% examples
Also note some of the special formating codes.
>>> "This is a string: %s" % "Yes, it is"
'This is a string: Yes, it is'
>>> "This is an integer: %d" % 10
'This is an integer: 10'
>>> "This is an integer: %4d" % 10
'This is an integer: 10'
>>> "This is an integer: %04d" % 10
'This is an integer: 0010'
>>> "This is a float: %f" % 9.8
'This is a float: 9.800000'
>>> "This is a float: %.2f" % 9.8
'This is a float: 9.80'
>>>
Monday, September 9, 13
string % tuple
To convert multiple values, use a tuple on the right.
(Tuple because it can be heterogeneous)
Objects are extracted left to right. First % gets the first
element in the tuple, second % gets the second, etc.
>>> "Name: %s, age: %d, language: %s" % ("Andrew", 33, "Python")
'Name: Andrew, age: 33, language: Python'
>>>
Monday, September 9, 13
string % dictionary
When the right side is a dictionary, the left side must
include a name, which is used as the key.
>>> d = {"name": "Andrew",
... "age": 33,
... "language": "Python"}
>>>
>>> print "%(name)s is %(age)s years old. Yes, %(age)s." % d
Andrew is 33 years old. Yes, 33.
>>>
Monday, September 9, 13
Writing files
Opening a file for writing is very similar to
opening one for reading.
>>> infile = open("sequences.seq")
>>> outfile = open("sequences_small.seq", "w")
Monday, September 9, 13
The write method
% cat show_args.py
import sys
print sys.argv
% python show_args.py
['show_args.py']
% python show_args.py 2 3
['show_args.py', '2', '3']
% python show_args.py "Hello, World"
['show_args.py', 'Hello, World']
%
Monday, September 9, 13
Parsing options
from optparse import OptionParser
[...]
parser = OptionParser()
parser.add_option("-f", "--file", dest="filename",
help="write report to FILE", metavar="FILE")
parser.add_option("-q", "--quiet",
action="store_false", dest="verbose", default=True,
help="don't print status messages to stdout")
Monday, September 9, 13
Algorithmic complexity
Monday, September 9, 13
Big O notation
• In discussing the resource usage of algorithms, it is often
useful to consider asymptotic behavior (e.g. on very large
datasets), rather than every detail
Monday, September 9, 13
Big O notation
• Formally, we say f(x)=O(g(x)) if for large enough x (x>x0)
there is some constant K such that f(x) < K*g(x)
K*g(x)
f(x)
x0
It comes down to how fast the function f(x) grows for large x
Monday, September 9, 13
Big O notation
• We are often specifically interested in using Big O
notation to describe...
Monday, September 9, 13
Simplification rules
Monday, September 9, 13
Some common Big-O formulae
Super-linear
(linear)
Sub-linear
Monday, September 9, 13
Sorting and complexity
Monday, September 9, 13
Sorting and complexity
Monday, September 9, 13
Sorting and complexity
Monday, September 9, 13
Sorting and complexity
Monday, September 9, 13
Containers and complexity
Arrays (c.f. Python lists) are constant to access, Balanced trees are O(logN) to
O(log N) to search (if pre-sorted), search and modify, but a little
O(N log N) to sort, and O(N) to insert/delete slower in practice than hashtables
Monday, September 9, 13
Containers and complexity
Arrays (c.f. Python lists) are constant to access, Balanced trees are O(logN) to
O(log N) to search (if pre-sorted), search and modify, but a little
O(N log N) to sort, and O(N) to insert/delete slower in practice than hashtables
Monday, September 9, 13