A Majorcan Algebrist in King RNA's Court

A majorcan algebrist in King RNA’s court
Francesc Rosselló
Dept. Mathematics and Computer Science

Research Institute of Health Science (IUNICS)
University of the Balearic Islands
Partially supported by the Spanish DGES and the EU

program FEDER, project BFM2003-00771 ALBIOM.
1
Algebra: Comes from the arabic word al-jabr,
which means “combining.”
Algebrista (in old Spanish): a person whose

job was to join together the parts of broken
bones.
Some algebra in “computational biology:”
• M. L. Reed. “Algebraic structures of ge-

netic algebras.” Bull. Amer. Math. Soc.
64 (1997), 107–130.
• C. Reidys, P. Stadler. “Bio-molecular

shapes and algebraic structures.” Com-
puters & Chemistry 20 (1996), 85–94.
2
Contact structures of biopolymers
RNA molecules and proteins fold into complex

3-dimensional structures that determine their
function.
3
Prediction, study and comparison of 3-dimen-
sional structures, central topics in computa-
tional biology.
Comparison of structures on biomolecules of a

fixed length has an interest in itself:
• Comparison of suboptimal secondary struc-

tures generated by an algorithm on a given
RNA molecule.
• Comparison of structures generated by dif-

ferent algorithms on a given RNA molecule
or protein.
• Study of sequence-structure maps and phe-

notype spaces.
4
Contact structures of biopolymers
Different problems require different level of de-

tail.
Contact structure: undirected graph on num-

bered monomers with arcs representing spatial
neighborhood of non-consecutive monomers in
3-dimensional structure.
5
RNA secondary structure
One node, (at most) one contact.

Contacts cannot cross each other.
Secondary structure of Phe-tRNA
6
RNA secondary structure. . . with
pseudoknots
7
RNA tertiary structure
8
Protein contact structure
9
Let the algebra enter
[n] := {1, . . . , n}
Definition. A contact structure of length n is

an undirected graph without multiple edges or
self-loops Γ = ([n], Q), for some n ≥ 1, whose
arcs {i, j} ∈ Q, called contacts, satisfy the fol-
lowing condition:
i) For every i ∈ [n], {i, i + 1} ∈

/Q
A contact structure has unique bonds when:
ii) For every i ∈ [n], if {i, j}, {i, k} ∈ Q, then

j=k
An RNA secondary structure is contact struc-

ture with unique bonds and:
iii) without pseudoknots: if {i, j}, {k, l} ∈ Q,

and i < k < j, then i < l < j
10
Some notations:
Contact {i, j}: i · j or j · i
RN An: set of RNA secondary structures of

length n
Un: set of contact structures with unique bonds

of length n
Cn: set of contact structures of length n
Primary structure: ordered sequence of mono-

mers as word over an alphabet
11
Edit distances for contact structures with
unique bonds
A set of edit operations (e.g., relabeling, delet-

ing and inserting base pairs) and a cost func-
tion are fixed. The edit distance is the cost
of the minimum cost edit operation sequence
that transforms one structure to the other.
Computing edit distance uses to be hard (e.g.,

Max-SNP-hard), even for contact structures of
the same length. They require a complicated
algorithm, they are difficult to analyze formally
and to compute them is time consuming.
And, after all, they need not be more biolog-

ically meaningful than other simpler metrics
(and (R. Giegerich) incorrect generalization of
string distance metrics).
12
Topological indices on RN An
An interesting approach, but (for the moment)

only similarity functions. For instance (Bene-
detti, Morosetti 1996):
Represent a RNA secondary structure by a graph

with nodes the loops and edges the stems.
Then, nodes of degree 2 are omitted.
Let d(i) denote the degree of node i in a graph.

The Randić index of a graph G is
X 1
ξ(G) = q
pairs of edges d(i)d(j)
Given two RNA s.e., the smaller the differ-

ence of their representations’ Randić indices,
the closer they are.
Open problem: Get a metric with this ap-

proach.
13
Mountain representation and distances on
RN An
Given Γ = ([n], Q), for every k ∈ [n] let fΓ(k)

be the number of contacts i · j ∈ Q such that
i ≤ k < j.
Mountain representation of Phe-tRNA’s secondary structure
v
u n
uX
p p
dmount(Γ1, Γ2) = t |fΓ1 (k) − fΓ2 (k)|p
k=1
Problem: Generalize this distance to Un and

Cn.
14
Normalization and generalization to Un
Zuker et al 2000
Given Γ = ([n], Q), for every k ∈ [n],


 1/(j − i)
 if i · j ∈ Q, i < j
wΓ(i) = −1/(i − j) if j · i ∈ Q, j < i

 0 otherwise
Now
k
X
fΓ0 (k) = wΓ(i)
i=1
and
n
X
dmount(Γ1, Γ2) = |fΓ0 1 (k) − fΓ0 2 (k)|
k=1
(Still) Open problem: Extend this distance

to Cn.
15
Involution representation and distance
Reidys-Stadler, 1996
Sn: symmetric group of permutations of [n].
Given Γ = ([n], Q) ∈ Un,

Y
π(Γ) = (i, j) ∈ Sn
i·j∈Q
dinv (Γ1, Γ2)= least number of transpositions

necessary to represent π(Γ1) · π(Γ2).
dinv is a metrics on Un.
What does it measure?
16
Given Γ1 = ([n], Q1), Γ2 = ([n], Q2) ∈ Un, their
symmetric difference is
Γ1∆Γ2 = ([n], Q1∆Q2) ∈ Cn
• Orbit of Γ1∆Γ2: connected component of

it with some arc (not the isolated nodes).
• Length of an orbit: its number of nodes.
• Closed orbit of Γ1∆Γ2: cyclic connected

component; all nodes of degree 2; their
length is always even and ≥ 4.
Θ(m): # of closed orbits of length m.
17
• Open orbit of Γ1∆Γ2: all other orbits; they
have exactly 2 nodes of degree 1.
Ω(m)= # of open orbits of length m.
Ωm= # of open orbits of length ≥ m.
Result (R, 2003):

X
dinv (Γ1, Γ2) = |Q1∆Q2| − Θ(m)
m≥4
Open problem: Generalize this metric to Cn.
18
Subgroup representation and distance
Reidys-Stadler, 1996
Given Γ = ([n], Q) ∈ Un
G(Γ) = h{(i, j) | i · j ∈ Q}i ∈ Sub(Sn)

¯ ¯
¯ G(Γ ) · G(Γ ) ¯
¯ 1 2 ¯
dsgr (Γ1, Γ2) = log2 ¯ ¯
¯ G(Γ1) ∩ G(Γ2) ¯
dsgr is a metric on Un.
What does it measure?
Result (R, 2003):

dsgr (Γ1, Γ2) = |Q1∆Q2|
Problem: Generalize this representation and

metric to Cn. (For a solution, wait some min-
utes.)
19
Matrix representation and distance
Magarshak 1993, Reidys-Stadler 1996
Given Γ = ([n], Q) ∈ Un
 
s1,1 . . . s1,n

SΓ =  ... ... ... 
 ∈ GL(n, Q),
sn,1 . . . sn,n
where

 −1
 if i 6= j and i · j ∈ Q
si,j = 1 if i = j and i · l ∈/ Q for every l

 0 otherwise
Properties:
−1
i) SΓ = SΓ .
20
ii) (Magarshak) Eigenvalues “correspond” to
Watson-Crick compatible sequences:
If we represent
  Γ’s primary structure
x1
 .. 
as x =  .  ∈ Qn using
xn
A 7→ i, C 7→ −1, G 7→ 1, U 7→ −i
then
for every RNA molecule b = b1b2 . . . bn

of length n and for every Γ = ([n], Q) ∈
Un, if x ∈ Qn is the vector representing
b, then SΓ ◦ x = x if and only if b is
compatible with Γ, in the sense that if
j · k ∈ Q, then either {bj , bk } = {A, U }
or {bj , bk } = {C, G}.
21
Transfer matrix: TΓ1,Γ2 = SΓ2 ◦ SΓ1
A metric on Un can be defined through
(Γ1, Γ2) 7→ kTΓ1,Γ2 k,

with k · k some length function on GL(n, Q).
For instance, taking
kAk = rank(A − Id),

we have
dmag (Γ1, Γ2) = rank(TΓ1,Γ2 − Id)
Result (CMR, 2002):

dmag (Γ1, Γ2) = dinv (Γ1, Γ2)
22
Let us enter
Matrix model and metric
for non-W-C pairs
CMR, 2002
We introduced a systematic way to produce

Magarshak-like matrix representations of con-
tact structures with unique bonds and any com-
plementarity relation on monomers.
We had to work on a commutative quasi-semi-

ring Mm extending a finite field F2m constructed
“algorithmically.”
A qs-ring is an algebraic structure (X, +, ∗, 0, 1)

where (X, +, 0) and (X, ∗, 1) are commutative
monoid, and 0 ∗ x = x ∗ 0 = 0 for every x ∈ X
(but no distributivity).
We proved that there was no solution to the

problem on richer algebraic structures (no semir-
ing, no lattice, no qs-extension of other fields).
23
Given Γ = ([n], Q), we always define
 
s1,1 . . . s1,n
SΓ =  ... ... ... 

 ∈ GL(n, F2m )
sn,1 . . . sn,n
where


 α+1 if j 6 k
= and j·k ∈Q

 0 if j 6= k and j·k ∈ /Q
sj,k =



α if j =k and j · l ∈ Q for some l

1 if j =k and j·l∈ / Q for every l
with α any generator of the cyclic group (F2m −
{0}, ∗, 1).
−1
SΓ = SΓ
Let TΓ1,Γ2 = SΓ2 ◦ SΓ1 .
— The product of the elements of the main

diagonal of TΓ1,Γ2 is α2|Q14Q2|.
— rank(TΓ1,Γ2 − Id) = dinv (Γ1, Γ2).

24
How to obtain Mm? (m ≥ 2)
Assume we allow RNA-bases A, C, G, U, X, K

(X=xanthine and K=2,6-diaminopyrimidine)
and A· U , C · G, G· U , and X · K base pairings.
The carrier set Mm of the qs-ring Mm is the

(disjoint) union of F2m and a finite set
Xm = {a, a1, a2, . . . , a2m−2}
∪{c, c1, c2, . . . , c2m−2}
∪{g, g1, g2, . . . , g2m−2}
∪{u, u1, u2, . . . , u2m−2}
∪{x, x1, x2, . . . , x2m−2}
∪{k, k1, k2, . . . , k2m−2}
25
The product ∗ on Mm is defined as follows:
• ∗ is taken to be commutative;
• the product of elements of F2m is the usual

one;
• 0 ∗ x = 0 and 1 ∗ x = x for every x ∈ Xm;
• x ∗ y = 0 for every x, y ∈ Xm;
• α ∗ y = y1, α ∗ y1 = y2,. . . , α ∗ y2m−2 = y

for y = a, c, g, u, x, k.
• if λ = αl ∈ F2m , 2 ≤ l ≤ 2m − 2, then
l
z }| {
λ∗x = α ∗ (α ∗ · · · (α ∗x) · · · ) for every x ∈ Xm;
26
The sum + on Mm is defined as follows:
• + is taken to be commutative, the sum

of elements of F2m is the usual one, and
λ + x = x for every λ ∈ F2m and x ∈ Xm;
• if β = α`m ,
a1 + u`m = u`m + a1 =a
u1 + a`m = a`m + u1 =u
u1 + g`m = g`m + u1 =u
g1 + u`m = u`m + g1 =g
g1 + c`m = c`m + g1 =g
c1 + g`m = g`m + c1 =c
x1 + k`m = k`m + x1 =x
k1 + x`m = x`m + k1 =k
• all other sums x + y with x, y ∈ Xm are de-

fined to be equal to a fixed element in Xm
different from a, c, g, u, x, k, a1, c1, g1, u1, x1,
k1, a`m , c`m , g`m , u`m , x`m , k`m .
27
Matrix representation and metric for Cn
Matrix representations seem useful in the rep-

resentation of the temporal evolution of struc-
tures, e.g., RNA structure folding (Magarshak,
H. González)
They could provide a generalization to Cn of

dinv .
Work in (very slow) progress.
28
Edge ideal representation and metric
for Cn
LR, 2003
They generalize subgroup representation and

metric.
Some notations:
M(x): monomials in x1, . . . , xn.
M(x)(m): monomials in x1, . . . , xn of total de-

gree m.
M(x)m: monomials in x1, . . . , xn of total de-

gree ≤ m.
29
Given Γ = ([n], Q) ∈ Cn
IΓ = h{xixj | i · j ∈ Q}i ⊆ F2[x]
IΓ the edge ideal of Γ, it characterizes Γ
πm(IΓ) = IΓ/hM(x)(m)i
Γ 7→ πm(IΓ) injective on Cn for m ≥ 3
∼ G(Γ) as groups if Γ ∈ U
π3(IΓ) = n
30
¯ ¯
0
¯ π (I
m Γ1 )+π (I )
m Γ2 ¯
dm(Γ1, Γ2) = log2 ¯ π (I )∩π (I ) ¯¯
¯
m Γ1 m Γ2
d0m is a metric on Cn for every m ≥ 3
Taking
HI (m) = |{xα ∈ M(x)m | xα ∈

/ I}|
(the Hilbert function of I)
d0m(Γ1, Γ2) = HIΓ (m − 1) + HIΓ (m − 1)

1 2
−2HIΓ ∪IΓ (m − 1)
1 2
Computable with CoCoA, Macaulay, and other

computer package systems.
31
If Γ0 = ([n], ∅) and Γ1 = ([n], {i · j}), then
³n + m − 3´
0
dm(Γ0, Γ1) =
n
It should be 1. Thus, we normalize d0m by this
value:
dm = 1 0
dm on Cn
(n+m−3
n )
Now dm+1(Γ1, Γ2) ≤ dm(Γ1, Γ2)
32
Some computations
m = 3 on Cn:
d3(Γ1, Γ2) = |Q1∆Q2|
(it generalizes dsgr !)
m = 4 on Cn:
A(Γ) = |{{i · j, j · k} ⊆ Q | j 6= k}|
T (Γ) = |{{i, j, k} ⊆ [n] | i · j, j · k, i·k ∈ Q}|
d4(Γ1, Γ2µ) = |Q1∆Q2|

1
− n+1 2A(Γ1 ∪ Γ2) − A(Γ1) − A(Γ2)
¶
+2T (Γ1 ∪ Γ2) − T (Γ1) − T (Γ2)
33
m = 4 on Un:
2
d4(Γ1, Γ2) = |Q1∆Q2| − (|Q1∆Q2| − Ω2).
n+1
m = 5 on Un:
d5(Γ1, Γ2) =µ|Q1∆Q2|

− 1 2(n − 1)(|Q1∆Q2| − Ω2)
(n+2
2 ³) ´ ³ ´ ³ ´
|Q1∪Q2| |Q1| |Q2 |
+2 2 − 2 − 2
¶
+2(Ω3 + Θ(4))
d5 not only depends on the structure of Q1∆Q2,

but also on |Q1 ∩ Q2|
(Ωm: number of open orbits of length ≥ m

Θ(m): number of closed orbits of length m)
34
More info on edge ideal metrics
• Each dm, 3 ≤ m ≤ n, uses new invariants

of Γ1 ∪ Γ2.
• If Γ1 6= Γ2, then
limm→∞
( dm(Γ1, Γ2)
0 if Γ1, Γ2 are non-empty
=
1 if either Γ1 or Γ2 is empty
• (AJ) For every m there exist a, b ∈ R such

that dm+1 = adm + b on Un with high cor-
relation (a new property of Hilbert func-
tions?).
• If Γ∗1, Γ∗2 ∈ Cn+1 are obtained from Γ1, Γ2

by adding the node n + 1 isolated to both
of them, then dm(Γ1, Γ2) ≤ dm(Γ∗1, Γ∗2).
35
Other ideal-based representations and
metrics for Cn
Given Γ = ([n], Q), its clique ideal
JΓ ⊆ F2[x1, . . . , xn]
is generated by the set of monomials consisting
of one square-free monomial xi1 · · · xik for each
non-trivial maximal clique (maximal complete
subgraph) {i1, . . . , ik }, with k ≥ 2, of Γ.
If Γ ∈ Un, then JΓ = IΓ, but if Γ does not have

unique bonds they can be different.
Example: Γ = ([5], {1 · 3, 3 · 5, 1 · 5}),
IΓ = hx1x3, x3x5, x1x5i, JΓ = hx1x3x5i
Of course, it is difficult to compute.
36
Given a primary structure b = b1 . . . bn and a
contact structure Γ = ([n], Q) on it, let
KΓ ⊆ F2[a1, c1, g1, u1, . . . , an, cn, gn, un]

be the ideal generated by:
• for every i · j ∈ Q, if bi = x and bj = y, we

have a generator xiyj .
• if the node i is isolated and bi = x, we have

a generator x2 i.
KΓ characterizes the primary and secondary

structure simultaneously.
37
Now, the ideals JΓ and KΓ can be used to
define metrics in the same way as IΓ:
πm(JΓ) = JΓ/hM(x)(m)i for m = n + 1
πm(KΓ) = KΓ/hM(x)(m)i for m ≥ 3

¯ ¯
00
¯ π (J
n+1 Γ1 )+π (J
n+1 Γ2 ¯¯)
dn+1(Γ1, Γ2) = log2 ¯ π
¯
n+1 (JΓ1 )∩πn+1 (JΓ2 )
¯
¯ ¯
000
¯ πm (KΓ1 )+πm (K )
Γ2 ¯¯
dm(Γ1, Γ2) = log2 ¯ π (K )∩π (K ) ¯
¯
m Γ1 m Γ2
They are metrics. And so on. . .
Work in progress. . . stopped for the moment.
38
Many open problems
(Some of us are working on some of them)
• Further study of these metrics (diameters,

balls, other correlations).
• Biochemical meaning of these distances?
• Other models and distances?
• VIP!!! Generalization of everything to struc-

tures of variable length
• VIP??? Similar models of 3-dimensional

structures of lipids, from scratch
• Applications
39
Some references: ours
• J. Casasnovas, J. Miró, F. Rosselló. On the alge-

braic representation of RNA secondary structures
with G·U pairs. Journal of Mathematical Biology
47 (2003) 1–22.
• M. Llabrés, F. Rosselló. A new family of metrics

for biopolymer contact structures. Computational
Biology and Chemistry 28 (2004), 21–37.
• F. Rosselló. On Reidys and Stadler’s metrics for

RNA secondary structures. Mathematical and Com-
puter Modelling (to appear, 2004).
• F. Rosselló. Comparing biomolecular contact struc-

tures: the algebraic way. Chapter n in Recent
advances in biomolecular mathematics (to appear,
Kronos Publ., 2004).
40
References: Others’
• I. Hofacker, P. Stadler. Modeling RNA folding. To

appear. See Univ. Wien TBI Preprint No. BIOINF
03-013.
• V. Moulton, M. Zuker, M. Steel, R. Pointon, D.

Penny, D.. Metrics on RNA secondary structures.
Journal of Computational Biology 7 (2000), 277–
292.
• C. Reidys, P. Stadler. Bio-molecular shapes and

algebraic structures. Computers & Chemistry 20
(1996), 85–94.
• P. Schuster, P. Stadler. Discrete models of biopoly-

mers. To appear. See also Univ. Wien TBI Preprint
No. pks-99-012.
41

A Majorcan Algebrist in King RNA's Court

Uploaded by

Copyright:

Available Formats

You might also like

A Majorcan Algebrist in King RNA's Court

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Majorcan Algebrist in King RNA's Court

Uploaded by

Copyright:

Available Formats

A majorcan algebrist in King RNA’s court

Dept. Mathematics and Computer Science

Partially supported by the Spanish DGES and the EU

Algebrista (in old Spanish): a person whose

Some algebra in “computational biology:”

• M. L. Reed. “Algebraic structures of ge-

• C. Reidys, P. Stadler. “Bio-molecular

RNA molecules and proteins fold into complex

Comparison of structures on biomolecules of a

• Comparison of suboptimal secondary struc-

• Comparison of structures generated by dif-

• Study of sequence-structure maps and phe-

Different problems require different level of de-

Contact structure: undirected graph on num-

One node, (at most) one contact.

Secondary structure of Phe-tRNA

Definition. A contact structure of length n is

i) For every i ∈ [n], {i, i + 1} ∈

A contact structure has unique bonds when:

ii) For every i ∈ [n], if {i, j}, {i, k} ∈ Q, then

An RNA secondary structure is contact struc-

iii) without pseudoknots: if {i, j}, {k, l} ∈ Q,

Contact {i, j}: i · j or j · i

RN An: set of RNA secondary structures of

Un: set of contact structures with unique bonds

Cn: set of contact structures of length n

Primary structure: ordered sequence of mono-

A set of edit operations (e.g., relabeling, delet-

Computing edit distance uses to be hard (e.g.,

And, after all, they need not be more biolog-

An interesting approach, but (for the moment)

Represent a RNA secondary structure by a graph

Let d(i) denote the degree of node i in a graph.

Given two RNA s.e., the smaller the differ-

Open problem: Get a metric with this ap-

Given Γ = ([n], Q), for every k ∈ [n] let fΓ(k)

Mountain representation of Phe-tRNA’s secondary structure

Problem: Generalize this distance to Un and

Given Γ = ([n], Q), for every k ∈ [n],

(Still) Open problem: Extend this distance

Sn: symmetric group of permutations of [n].

Given Γ = ([n], Q) ∈ Un,

dinv (Γ1, Γ2)= least number of transpositions

dinv is a metrics on Un.

What does it measure?

Γ1∆Γ2 = ([n], Q1∆Q2) ∈ Cn

• Orbit of Γ1∆Γ2: connected component of

• Length of an orbit: its number of nodes.

• Closed orbit of Γ1∆Γ2: cyclic connected

Θ(m): # of closed orbits of length m.

Ω(m)= # of open orbits of length m.

Ωm= # of open orbits of length ≥ m.

Result (R, 2003):

Open problem: Generalize this metric to Cn.

G(Γ) = h{(i, j) | i · j ∈ Q}i ∈ Sub(Sn)

dsgr is a metric on Un.

What does it measure?