Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Numerical project (Problem set 7):

Information in sequence data


BIO-369
Prof. Anne-Florence Bitbol
EPFL

This numerical project will be graded and count for 40% of your final mark. Each student should
hand in their personal solution as a Jupyter notebook using Python 3, by sending it by email to
both nicola.dietler@epfl.ch and richard.servajean@epfl.ch on May 3 at the latest. Please
clearly label question numbers using markdown cells. Please answer all questions (including those
requiring sentences and not code as an answer) in the same Jupyter notebook, using markdown cells
for text. Please name your Jupyter notebook LASTNAME FirstName.ipynb.
Two problem classes (April 21 and 28) will be dedicated to this project, and during them, you can
ask questions to the teaching assistants and discuss with other students as usual. However, in the end,
you must hand in your personal solution.

1 Loading sequence data and analyzing frequencies


We will start by working on the sequences in the file crp sites.fa, which contains DNA sequences
of various binding sites of the same transcriptional regulator CRP in the Escherichia coli genome. If
you open this file, which is in a format called fasta, you will see that, for each entry, there is first
a header line starting with “>”, and then the next line is the sequence itself as a string of letters A,
T, C and G. These sequences are already aligned, meaning that they all have the same length and
that the first nucleotide in the first sequence has the same role as the first nucleotide in the second
sequence, and so on. We will call these positions in the alignment the columns of the alignment.
We aim to understand the conserved features of CRP binding sites from these examples of different
binding site sequences. For this, we will first compute the frequency of occurrence of each nucleotide
(A, T, C, G) in each column. The frequency of A in the first column is the number of sequences where
there is an A in the first column divided by the total number of sequences in the alignment.

a) In Python, load the file crp sites.fa and extract all the sequences in it as an array of strings.
We will not need the headers for this analysis, so you can discard them.
There are different ways to do this, and for instance you may use the function SeqIO.parse from
Bio, and use list to transform its output into a list. Once you have obtained this list, sequences
without headers can be obtained from it using the method seq, for instance if your list is called
mylist, you can obtain the first sequence as a string from it by using str(mylist[0].seq). Again,
this is just one possible method, there are other possibilities.

b) Next, in order to be able to count the nucleotides in each column, transform the array of strings
you obtained into a Numpy array of integer numbers, by using the mapping that A is 0, C is 1, G
is 2 and T is 3.
Again, there are different ways to do this, but one of them is to define the mapping as a Python
dictionary using dict, and then to use it for each character in each row of the array of strings to
replace it by the appropriate integer number.
c) Now compute the number of times each nucleotide appears in the first column of the alignment,
and deduce the corresponding frequencies.

1
d) Produce a Numpy array called frequencies that contains the frequency of each nucleotide in each
column of the alignment. Each column should correspond to a nucleotide, and each row should
correspond to a column of the alignment.
e) Now we will represent the information in the frequencies graphically, by using logomaker [1]. For
this, first install logomaker e.g. from PyPI by executing pip install logomaker at the command
line. Next, execute the following Python commands in your Jupyter notebook:
import pandas as pd
import logomaker as lm
%matplotlib inline
plt.ion()
frequencies pd = pd.DataFrame(data=frequencies,columns=["A","C","G","T"])
lm.Logo(frequencies pd)
You should now see a sequence logo, where at each site (in each column) the height of each letter
is equal to the frequency of the corresponding nucleotide, and the most frequent nucleotide is on
top, the next frequent one just below, and so on.
f) Using the logo you just produced, compare columns 0 and 5 of the alignment: what is the main
difference between them? Which one should have the largest entropy? Which one is more con-
served?

g) The consensus sequence is the sequence that has the most frequent nucleotide at each site. How
can you see the consensus sequence in the logo you just produced?

2 Entropy and (information) logo


a) Produce a Numpy array called entropies that contains the entropy of each column of the align-
ment, computed from the frequencies determined before. Each row should correspond to a column
of the alignment.
b) Plot entropy versus column index, using a bar plot. What is the maximum value that entropy can
take here? Which distribution would it correspond to? Compare columns 0 and 5 again.

c) Plot the difference between the maximal possible value of entropy and the actual value of entropy
versus column index, using a bar plot. We will call this quantity the “height” of each column of the
alignment, as it will become the height in a new logo we will construct later. How is this quantity
related to conservation? What are its minimum and maximum possible values? Compare columns
0 and 5 again.

d) Produce a Numpy array called products that contains, for each nucleotide and each column of the
alignment, the product of the frequency of each nucleotide in each column of the alignment and of
the quantity “height” defined just before for each column of the alignment. Each column should
correspond to a nucleotide, and each row should correspond to a column of the alignment.

e) Now we will make a new graphical representation of the data, still using logomaker [1]. For this,
execute the following Python commands in your Jupyter notebook:
products pd = pd.DataFrame(data=products,columns=["A","C","G","T"])
lm.Logo(products pd)
You should now see a new sequence logo, where at each site (in each column) the height of each
letter is proportional to the frequency of the corresponding nucleotide, and the most frequent
nucleotide is on top, the next frequent one just below, and so on. But now, the total height of each
column (the total height of the stack of four letters) is the quantity that we called “height”. This
is the most common type of sequence logo.
f) In what sense is this new logo a good representation of the data in the alignment? Compare it to
the previous one.

g) Where on the DNA sequences studied here do you think the strongest binding of CRP to DNA
occurs?

2
3 Other datasets
We will now consider two other datasets, which are also sequence alignments in fasta format.

a) Consider the file h3 trunk.fna, which contains an alignment of DNA sequences coding for influenza
A H3N2 hemagglutinin. Use the code you wrote above (by copying it) to load the data and produce
the two types of logos studied so far for this alignment. For better visualization, you can focus on
a subset of columns, e.g. 50 consecutive ones, since this alignment has more columns than the one
studied so far. Compare the logos obtained to those obtained for CRP binding sites and discuss.
b) Now consider the file HK HisKA.fasta, which contains an alignment of histidine kinase amino acid
sequences (focusing on the HisKA conserved domain). What modifications will you need to make to
the code above to compute frequencies and entropies in this data? What is the maximum entropy
possible here? Note in addition that this alignment comprises some gaps (-) that correspond to
insertions/deletions. To compute frequencies and entropies, you can just consider gaps as an extra
character. Make these modifications and compute frequencies and entropies for this alignment.
c) This question is still about the file HK HisKA.fasta. Produce the two types of logos studied so
far for this alignment. Note that in logomaker, gaps should be ignored, and thus you should
remove the data corresponding to gaps from the tables used as input to logomaker. For better
visualization, focus on the first 30 consecutive columns of this alignment. What do you observe
at site (column) number 10? And around it? Recall that these proteins are histidine kinases, and
comment on the importance of the amino acid at site number 10 for the function of these proteins.

References
[1] A. Tareen and J. B. Kinney. Logomaker: beautiful sequence logos in Python. Bioinformatics,
36(7):2272–2274, 04 2020.

You might also like