Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO.

3, MAY/JUNE 2015 507

Product Grammars for Alignment and Folding


€ner zu Siederdissen, Ivo L. Hofacker, and Peter F. Stadler
Christian Ho

Abstract—We develop a theory of algebraic operations over linear and context-free grammars that makes it possible to combine
simple “atomic” grammars operating on single sequences into complex, multi-dimensional grammars. We demonstrate the utility
of this framework by constructing the search spaces of complex alignment problems on multiple input sequences explicitly as
algebraic expressions of very simple one-dimensional grammars. In particular, we provide a fully worked frameshift-aware,
semiglobal DNA-protein alignment algorithm whose grammar is composed of products of small, atomic grammars. The compiler
accompanying our theory makes it easy to experiment with the combination of multiple grammars and different operations.
Composite grammars can be written out in LA TE X for documentation and as a guide to implementation of dynamic programming
algorithms. An embedding in Haskell as a domain-specific language makes the theory directly accessible to writing and using
grammar products without the detour of an external compiler. Software and supplemental files available here: http://www.bioinf.
uni-leipzig.de/Software/gramprod/

Index Terms—linear grammar, context free grammar, product structure, multiple alignment, Haskell

1 INTRODUCTION

T HE well-known dynamic programming (DP) algorithms


for the simultaneous alignment of n sequences [1] have
a structure that is reminiscent of topological product struc-
only time-consuming but also error prone. The framework of
algebraic dynamic programming could improve this situa-
tion considerably, but still does not provide a satisfactory
tures. This is expressed e.g., by the fact that intermediary solution because even the underlying grammars with nearly
tables are n-dimensional. Here we explore whether this intu- a hundred non-terminals are non-trivial to check. It is impos-
ition can be made precise and operational. To this end we sible to explore variants and refinements of these algorithms
build on the conceptual framework of Algebraic Dynamic without major programming efforts unless ways and means
Programming (ADP) [2], [3]. In this setting a dynamic pro- can be found to construct the underlying grammars in a
gramming algorithm is separated into a context-free gram- modular fashion. Product constructions, as introduced in [8]
mar (CFG) that generates the search space and an and significantly expanded in the present work, are one
evaluation algebra. In this contribution we will mainly be promising approach towards this end.
concerned with a notion of product grammars to facilitate Before we delve into a more formal presentation, consider
the construction of the search space. the context-free grammar for pairwise sequence alignment
Recent advances in RNA folding with pseudoknots [4], with affine gap costs as an example. Gotoh’s algorithm [9]
RNA-RNA interactions [5], [6], or RNA consensus structure uses three non-terminals M, D, I, depending on whether the
prediction [7] have lead to the design of dynamic program- right end of the alignment is a match state, a gap in the first
ming algorithms with dozens of intermediate tables. Their sequence, or a gap in the second sequence. The correspond-
direct implementation in C or C++ is a major effort that is not ing productions are of the form
      
M ! M uv  D uv  I uv  $$
 C. H€oner zu Siederdissen is with the Department of Theoretical Chemistry       
University of Vienna, W€ ahringerstrasse 17, A-1090, Vienna, Austria. D ! M u  D u:  I u (1)
E-mail: choener@tbi.univie.ac.at.      : 
 I.L. Hofacker is with the Department of Theoretical Chemistry, University I! M v D v I v ;
of Vienna, W€ ahringerstrasse 17, A-1090, Vienna, Austria, the Research
Group Bioinformatics and Computational, W€ ahringerstrasse 29 A-1090, where u and v denote terminal symbols, ‘’ corresponds to
Vienna, and the Center for RNA in Technology and Health, University of
gap opening, while ‘:’ denotes the (differently scored) gap
Copenhagen, Grønnega rdsvej 3, Frederiksberg C, Denmark.
E-mail: ivo@tbi.univie.ac.at. extension. The $ here takes the role of the “sentinel charac-
 P.F. Stadler is with the Bioinformatics Group of the Department of Com- ter”, i.e., matches the end of the input. Each of the non-ter-
puter Science and with the Interdisciplinary Center for Bioinformatics of minals reads simultaneously from two separate input tapes.
the University of Leipzig, D-04107, Leipzig, Germany, the Max Planck
Institute for Mathematics in the Sciences, Inselstraße 22, D-04103, Leip- To make this  property
  Xmore
 transparent
 Y  in the notation, we
zig, Germany, the Fraunhofer Institute for Cell Therapy and Immunology, write Mˆ X X
, Dˆ Y
, and Iˆ X
. This yields produc-
Perlickstrasse 1, D-04103, Leipzig, Germany, the Department of Theoreti- tions such as
cal Chemistry of the University of Vienna, W€ ahringerstrasse 17, A-1090,
Vienna, the Center for RNA in Technology and Health, Univ. Copenha- X   u   Xu 

gen, Grønnega rdsvej 3, Frederiksberg C, Denmark, and the Santa Fe Insti- ! X v ’ or
X
Y  X     Xv  (2)
tute, 1399 Hyde Park Road, Santa Fe NM 87501.
X
! Y v ’ Yv :
X X
E-mail: studla@bioinf.uni-leipzig.de.
 
Manuscript received 12 Dec. 2013; revised 29 Apr. 2014; accepted 10 May Apart from the conspicuous absence of YY , i.e., alignments
2014. Date of publication 21 May 2014; date of current version 1 June 2015. ending in an all-gap column, to which we will return later,
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. this notation strongly suggests to consider the 1-dimen-
Digital Object Identifier no. 10.1109/TCBB.2014.2326155 sional projections of the two-dimensional productions of
1545-5963 ß 2014 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
508 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 3, MAY/JUNE 2015

Eq. (2), which obviously have the form The none symbol. The “none” symbol ("), on the other hand
   is a purely formal symbol that is used here for notational con-
X ! Xu  Yu  $ and Y ! Y:  X  : (3) venience. As a terminal, it denotes the empty string. In multi-
tape algorithms such as Needleman-Wunsch it denotes the
This simple grammar either reads a symbol (non-terminal deletion in an in-del event (by not reading a character). More
X) or it ignores it (non-terminal Y , with ‘:‘ extending a gap, generally, it allows a more compact notation by recasting
‘‘ opening a gap). Each copy of the “step grammar” (3) several rules into a single common format, as in the case of a
operates on its own input tape. This example suggests that non-terminal followed by terminal in left-linear grammars
dynamic programming algorithms for alignment problems or the Greibach normal form, see Section 3.4 below. When-
in general have a product-like structure. Indeed, n-way ever "-only symbols appear in projections (defined below),
alignments can be seen as an n-fold product of the simple they can be savely eliminated. It is possible to avoid ", albeit
step grammar with itself. at the expense of a somewhat more lengthy presentation.
The aim of the present work is more general than align- In the next two sections we will consider in particular
ments. We introduce products of grammars as a very gen- left-linear grammars, i.e., those for which all productions
eral framework to facilitate a more effective design of are of the form A ! Bx with A; B 2 N and x 2 T .
dynamic programming algorithms. This requires that we The example of Gotoh’s algorithm in the introductory
clarify the precise meaning of a product of CFGs. Since section motivates us to introduce algebraic operations on
alignment algorithms are naturally expressed as left-linear grammars in a more systematic way. As a running example,
CFGs we first develop a theory for this special case and we will use one of the simplest alignment algorithms. The
demonstrate in some detail how our framework can facili- Needleman-Wunsch algorithm [11] aligns two sequences
tate the construction of complex alignment algorithms. As a x1...n and y1...m so that the sum of matches and in/del scores
showcase application we consider mixed nucleotide/pro- is maximized. The basic recursion over the memoization
tein alignments with frameshifts. We then proceed to table T reads
explore possibilities to generalize the product construction 8
to context free grammars in general and show that normal >
> Ti1;j þ d
<
forms can be employed to guide such constructions. We Ti;j1 þ d
Tij ¼ max (4)
find that the Greibach normal form (GNF) admits an asso- >
> T þ mðxi ; yj Þ
: i1;j1
ciative product that conserves the normal form but does not 0 if i ¼ 0 and j ¼ 0:
subsume the direct product of linear grammars.
In the recursive scheme, the base case is given by the align-
2 ALGEBRAIC OPERATIONS ON LINEAR ment of two empty substrings “on the left”, while the other
GRAMMARS cases extend the already aligned part of the strings to the
left. This slightly unusual variant of the algorithm was cho-
2.1 Notation sen to be identical to the grammatical description that fol-
A CFG G ¼ ðN; T; P; SÞ consists of a finite set N of non-ter- lows. The first two cases denote an in/del operation with
minals, a finite set T of terminals so that N \ T ¼ ;, a set P cost d, while mð : ; : Þ scores the (mis)match xi with yj .
of productions X ! a where X 2 N and a 2 ðT [ NÞ , and A two-tape grammar equivalent to the recursion in
a start symbol S 2 N. (It will be convenient below to con- Equ. (4) is
sider also grammars without a start symbol S). Further- X  X  a    X  "    X  a    $ 
more, we need at least one special symbol $ denoting the !    : (5)
Y Y " Y a Y a $
empty string, an “empty production” ? and the " symbol
denoting a “none”-symbol. This symbol emits nothing There are several differences between the formulation in
(like $). As a parsing symbol, however, it succeeds always Eq. (4) and Eq. (5). The recursive formulation working on the
(in contrast to $, which only succeeds on empty substrings). memoization table T does not store the alignment directly
Single and multiple tapes. Below we will make use of the but rather the score of each partial, optimal alignment. The
term tape. A (single) tape is an input sequence. Grammars grammatical description, on the other hand, describes the
operating simultaneously on k > 1 input sequences are search space of all possible alignments without any notion of
called multi-tape grammars of dimension k [10]. All termi- scoring. In addition, recursive descriptions usually include
nal Qsymbols of a k-tape grammar are k-dimensional, explicit annotations for
 base
 cases,
 $  here the empty alignment.
t 2 i ðTi [ Þ, where Ti is the alphabet of the ith tape. Non- The production rule X Y
! $
has this role in our example.
terminals are not necessarily tied to individual tapes. It will In general, grammatical descriptions abstract away certain
be convenient in many cases, however, to use “multi- implementation details. Some of these will, however,
dimensional” symbols for the non-terminals to emphasize become important when constructing more complex gram-
their semantics. mars from simpler ones, as we shall see below.
The sentinel terminal. The use of $ is arguably optional as Our task will be to construct Eq. (5) from even simpler,
the rule X ! t with t 2 T also terminates a derivation and it “atomic” constituents. These grammars are
is possible to append a sentinel character as the last charac- 
ter to each input string. While there is no formal requirement S ¼ ðfXg; fag; fX ! Xa  Xg; XÞ ; (6)
for $, an explicit treatment of the terminating case can be
advantageous in practical implementations as it does not
N ¼ ðfXg; f$g; fX ! $g; XÞ ; (7)
require adding a sentinel symbol to the alphabet.
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 509

L ¼ ðfXg; fg; fX ! Xg; XÞ : (8) is empty, i.e., each non-terminal is tagged with a unique
identifier. This is in contrast to our definition of the þ opera-
The rules in Eqs. (6–8) can be associated with single-tape tion, where we explicitly treat symbols as equal if they have
rules for Eq. (5) by writing X ! X as X ! X", which con- the same name.
forms to the intuition built up in the introduction and The  operator. While the þ operator unifies two sets of
Eq. (1). The grammar S in Eq. (6) performs a “step”. It production rules, the  operator acts as a set difference
either reads a single character on the right and recurses on operator
the left, or simply recurses. S describes the action of an  
alignment algorithm as seen by a single tape, i.e., without P1n  P2n ¼ p 2 P1n j p 2
= P2n : (11)
knowledge to what the recursion is doing simultaneously
on other tapes. Note that by itself these rules do not termi- As for þ, it requires operands of the same dimensionality. By
nate. The grammar N , Eq. (7), matches the empty input (or construction,  is not associative. Thus does not form a semi-
any empty substring of the input) and immediately termi- group but merely a magma. The empty set of production
nates. Finally, L (Eq. (8)) singles out the non-terminating rules acts as the neutral element on the right. This operator is
loop case already seen in Eq. (6). Intuitively, we can com- important to explicitly remove production rules that yield
bine these three components on a single tape as infinite derivations. In our example, we need to remove 
fX ! Xg. With the help of  we can write ðX ! Xa  XÞ

S þ N  L ¼ ðfXg; fa; $g; fX ! Xa  $g; XÞ (9) ðX ! XÞ ¼ ðX ! XaÞ. We shall see that it is often conve-
nient to “temporarily” introduce productions that later on
to obtain a grammar that simply reads an input tape. While are excluded again from the final algorithm.
not particularly useful in its own right, S þ N  L is the sin- The  monoid. The definition of a direct product of left lin-
gle-tape analog of an alignment grammar. More impor- ear grammars lies at the heart of this contribution.
tantly, it highlights the convenience of algebraic operations
on grammars, which we shall introduce rigorously below Definition 1. Let G1 ¼ ðN1 ; T1 ; P1 ; S1 Þ and G2 ¼ ðN2 ; T2 ; P2 ;
for linear grammars. S2 Þ be left-linear CFGs, i.e., all productions are of the form
Each operator introduced below primarily acts on sets A ! Bx or A ! y. Their direct product G1  G2 is the gram-
of production rules. They implicitly carry over to the mar G ¼ ðN; T; P; SÞ with non-terminals N ¼ N1  N2 [
involved sets of terminals and non-terminals in an obvious N1  f"g [ f"g  N2 , terminals T ¼ T1  T2 [ T1  f"g[
manner. Two production rules are equivalent if they are f"g  T2 , the start symbol of the product is S ¼ SS12 . The pro-
isomorphic as in Eq. (14). This is of relevance insofar that ductions are of the forms
it leads to idempotency in one of the operators below, but  A1   B1  x1    B  x1    "  y1    y1 
!  1 y  B  y
does not otherwise interfere with parsing.1 In the following A 
A2 B x2 "
 B2  x    y 
2 2 x2 2

we use the notation P n to emphasize that the productions 1 ! 1 1  1 (12)


" "
operate on n tapes. We will refer to dim G ¼ n as the  ""   ""  "    " 
A2 ! B2 x2
 y2 ;
dimension of the grammar.
where A1 ! B1 x1 and A1 ! y1 , are productions in P1 and
2.2 Algebraic Operations on Grammars A2 ! B2 x2 and A2 ! y2 are productions in P2 , respectively.
The þ monoid. The þ operator is defined as the union of all By construction G is again a left-linear CFG that now oper-
production rules of the two grammars P1n and P2n each of ates on two bands. It will be convenient to abuse the notation
dimension n: and write productions of the form Ai ! yi as Ai ! "yi .
P1n þ P2n ¼ P1n [ P2n : (10) Henceall productions
 B1  x1 in the product grammar can be writ-
ten as A 1
A2 ! B2 x2 with Ai ; Bi 2 Ni [ f"g, xi 2 Ti [ f"g
We enforce explicitly that the þ operator requires that the subject to the following
"  "conditions:
 Ai ¼ " implies
two operand grammars have the same dimensionality. The Bi ¼ xi ¼ ", AA2 6¼ " , and " on the r.h.s. is omitted. We
1

þ operation forms a monoid over the set of production will also make use of notation ðA1 ! B1 y1 Þ  ðA2 ! B2 y2 Þ
rules. Since the production rules form a set, isomorphic for the product of two individual productions. By construc-
rules collapse to a single rule. The empty set P n ¼ fg is a tion, we have
neutral element and P n þ P n ¼ P n , i.e., the þ monoid is
idempotent. Isomorphism on production rules is also sym- dimðG1  G2 Þ ¼ dim G1 þ dim G2 : (13)
bolic, that is, X ! X is isomorphic to X ! X but not to The empty string " in the two-dimensional terminals and
fX ! Y; Y ! Xg, even though the latter set of two rules non-terminals is not necessarily associated with terminating
reduces to the first. For our example,
 we have ðX ! Xa  the reading from the input band(s) as it denotes the absence
XÞ þ ðX ! $Þ ¼ ðX ! Xa X $Þ.  
of a parsing symbol. The $ terminal symbol, on the other
Please note that the þ operation has different semantics hand, explicitly parses only the empty (sub)-string.
than the usually encountered union of two grammars. In To see that  is associative we need to demonstrate that
particular, the union g [ h of two grammars is typically the productions of ðG1  G2 Þ  G3 and G1  ðG2  G3 Þ are iso-
defined in such a way that the intersection of non-terminals morphic, i.e.,
  x1     a1    x1   a1 
1. This is not completely true in the context of stochastic linear  x2   a2 
x2 ! a2 ’ ! : (14)
grammars: replication of a rule in an SCFG that already has duplicated x3 a3 x3 a3
rules requires that we sum over the probabilities for isomorphic rules.
510 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 3, MAY/JUNE 2015

This is most easily seen in the notation with the extra " sym- This
 X  grammarX contains the two-dimensional loop rule
bols since in this case the ai are strings of length 2 that are X
! X
, derived from ðX ! XÞ  ðX ! XÞ that even-
simply decomposed in a column-wise fashion. Hence multi- tually needs to be eliminated. To this end, it will be conve-
ple products are well-defined. Furthermore, permutations nient to consider yet another operation on productions.
of rows are isomorphisms. Thus G1  G2 ’ G2  G1 , i.e., The structure-preserving power  For any k-dimensional
exchanging the order of factors affects the order of the coor- grammar G and any natural number n 2 Z, G  n denotes
dinates only. Due to the associativity of , we can safely the k  n-dimensional grammar with the same structure.
extend these constructions to more than two factors. One Each k-dimensional (terminal or non-terminal) symbol
easily checks that  and þ are distributive, i.e., ðG1 þ ða1 ; . . . ; ak Þ> is transformed to an k  n-dimensional symbol
G01 Þ  ðG2 þ G02 Þ ¼ G1  G2 þ G1  G02 þ G01  G2 þ G01  G02 . ðða1 . . . ak Þ; . . . ; ða1 . . . ak ÞÞ> . Note that for a grammar with a
The canonical projection pi : G1  G2 ! Gi is obtained by single production rule we have G  G  G  2.
formally isolating the ith coordinate and contracting the For our example grammar, this operation is useful as
empty strings " and the empty productions ? ¼ ð" ! "Þ. short-hand for both Eq. (7) and Eq. (8). In the case of linear
Clearly we have pi ðT Þ ¼ Ti , pi ðNÞ ¼ Ni , pi ðSÞ ¼ Si , and grammars, the  operator is mostly useful as shorthand to
pi ðP Þ ¼ Pi . The grammar product  thus has the basic expand singleton grammars. It is worth noting, however,
properties of a well-defined product. that some algorithms in computational biology, notably the
Let lanðGÞ denote the language generated by G. Note that Sankoff algorithm [12], work on multiple tapes with a gram-
a “string” in lanðGÞ is, by construction, a sequence of mar structured very similar to the one-dimensional rela-
x
terminals, each of which is either of the form ð x12 Þ with tives. We will return to the topic in Section 3.3.
x1
x1 2 T1 and x2 2 T2 , or of the form ð " Þ with x1 2 T1 , or of
"
the form ð x2 Þ with x2 2 T2 . Thus lanðG1  G2 Þ consists of 2.3 The Needleman-Wunsch Alignment Grammar
alignments of strings ai 2 Gi . To see this, note that each We can now construct the full Needleman-Wunsch align-
string ai 2 Gi is generated from si using a finite sequence ment grammar from the much simpler 1-dimensional con-
}i ¼ ðp1i ; p2i ; . . .Þ of productions. Any partial matching of the stituents of Eqs.(6–8) in the following way:
}1 and }2 that preserves the sequential order of the two
input sequences gives rise to a sequence } of productions of NW ¼ S  S þ N  2  L  2 : (18)
the product grammar by matching all unmatched pki with
the empty production ? . By construction pi ð}Þ ¼ }i , i.e., } Written in terms of the productions only, this can be
derives an alignment of the input strings b1 and b2 . Con- rephrased as
versely, given a sequence } of productions of the product
grammar, we know that pi ð}Þ is a sequence of productions ðX ! XajXÞ  ðX ! XajXÞ
of Gi ; hence it constructs strings in lanðGi Þ. It follows that þ ðX ! $Þ  2  ðX ! XÞ  2 (19)
the product language satisfies                
¼ X ! X a  X a  X "  $ :
X X a X " X a $
pi ðlanðG1  G2 ÞÞ ¼ lanðGi Þ: (15)
Again we have used a distinct symbol $ to highlight the ter-
mination case deriving from N . Since our construction of
Similarly, we find that parse trees have a natural align- the Needleman-Wunsch grammar is based on well-defined
ment structure. Let t be a parse tree for an input b 2 algebraic operations we can readily use the same approach
lanðG1  G2 Þ. Its interior nodes are labeled by the produc- to construct much more complex alignment algorithms.
1 !B1 x1 A1 !B1 x1 " Before we proceed, however, we provide with Gotohs algo-
tions, i.e., pairs of the form ð A
A2 !B2 x2 Þ, ð "
Þ, or ð A !B x Þ.
2 2 2
The projections pi ðtÞ are explained by retaining only the ith rithm a more complex example and address the technical
coordinate of the vertex label and contracting all vertices issue of loop rules.
labeled by " in pi ðtÞ yields a valid parse tree for pi ðbÞ w.r.t.
Gi . Thus t is a tree alignment of the parse trees for the two 2.4 Gotoh’s Grammar in Product Form
input strings. The construction of Gotoh’s algorithm (Eq. 1) is a bit more
The direct product  forms a monoid on grammars with complicated than the simple product construction for the
arbitrary dimensions since Needleman-Wunsch grammar. It will make use of the fol-
  lowing simple component grammars, of which grammar S
P1m  P2n ¼ ðp1  p2 Þmþn j pm
1 2 P1 ; p2 2 P2 ;
m n n
(16) (Eq. 24) defines a start symbol:
 
where p1  p2 is explained in Definition 1. The neutral ele- G ¼ ðfX; Y g; fag; fX ! Xa  Ya; Y ! X  Y gÞ ; (20)
ment of the  monoid is the zero-dimensional grammar
which has one production rule "0 ! "0 that neither reads
nor writes anything as it does not operate on a tape. Albeit V ¼ ðfX; Y; Sg; fag; fX ! Ya; Y ! Y; S ! Y gÞ ; (21)
rather artificial at first glance, it is useful to have a neutral
element available. For our example, we have 
W ¼ ðfX; Y g; fg; fY ! X  Y gÞ ; (22)
ðX ! Xa j XÞ  ðX ! Xa j XÞ
    a    X  a    X  "    X  (17)
¼ X ! X    : N ¼ ðfXg; f$g; fX ! $gÞ ; (23)
X X a X " X a X
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 511

 2.5 Grammars with Loops


S ¼ ðfX; Y; Sg; fg; fS ! X  Y g; SÞ ; (24)
In Eq. (18) we explicitly added a terminating base case X ! $
Grammar G encodes the generic “step” and distinguishes and removed a production rule with infinite derivations
between “non-gap” state X and gap state Y . Correspond- X ! X, or, equivalently X ! X". Why do we insist on per-
ingly, the two-tape product G  G generates the core of the forming this operation explicitly instead of modifying the
 definition of the direct product  accordingly?
pairwise Gotoh grammar.  The rules X ! Xa Ya align a
character, while Y ! X  Y denotes a deletion. Gap opening The main reason lies in performance considerations. An
or gap extension costs are assigned to Y ! X and Y ! Y , “intelligent” product operator would first need to deter-
respectively. G  G has the 16 two-tape productions mine which rules have infinite derivations. For linear gram-
mars with only one non-terminal a rule is not infinite if a
X  X  a    X  a    Y  a    Y  a 
!    single terminal (except ") is present. $ rules are also fine, as
X  X  aa    X
Y
 aa    X  aa    YY  aa  long as only the empty word case X ! $ is present. Produc-
X
! X "  Y "  X "  Y "
X Y
 YY    "    X  "    Y  "    Y  "  (25) tions of the form fX ! Y; Y ! Xg, however need to be fol-
! X   
Y 
X
XX
 a  X Y
 a  X  a
 Y a lowed up to a depth of the number of production rules
!   Y  Y : present. For context-free grammars, the complexity will
Y X Y X Y
increase further, as in general multiple non-terminals may
There are seven productions  that involve the semantically exist on the right-hand side. For both convenience and effi-
meaningless non-terminal YY , which would denote a dele- ciency (by a constant factor), it does not seem to be desirable
tion on both tapes. S  S introduces transitions from the start to transform the grammar into Chomsky normal form
symbol. The grammar V  V comprises exactly the rules that (CNF). The second problem is the need for rewriting. In the
ð Þ, while W  W provides the production with case of fX ! Y; Y ! Xg, rewriting yields X ! X by insert-
Y
produce
 Y
Y
Y
on their left hand side. Subtracting these from G  G ing the rules for Y wherever Y is used. More complicated
leaves the well known transitions of Gotoh’s algorithm. It grammars might quite easily require major rewrites before
only remains to add the termination rule N  2. This con- all loop cases can be removed.
struction immediately generalized to arbitrary dimensions: Finally, using looping productions can be conceptually
the k-tape Gotoh grammar can be written as useful during construction. In case of Eq. (6), we either want
to read a character in a “step” X ! Xa or perform an in/
k k k
G þ S  V  W þ N  k:
k
(26) del with a “stand” X ! X. The direct product of Eq. (6)
1 1 1 1 then yields all possibilities of stepping or standing on two
(or more) tapes. Of these cases we only want to remove the
In this notation, alignment grammars for affine gap cost case where all tapes “stand”. This case is quite easily deter-
functions can be generated easily for any fixed dimension k. mined as Eq. (8) and just needs to be scaled (with ) to the
Already for k ¼ 3 this task is quite a bit of chore if it has to correct dimension and subtracted from the complete
be done manually. grammar.
Specialized to k ¼ 2, Eq. (26) evaluates to
S    X  Y 
! X   2.6 Implementation
X
S  X  a    X  a    X
X Y  a    $ 
! X a   Y  We have implemented a small compiler for our grammar
X      
a  X  Y  a    X  a 
a  Y a
$ (27)
X
! X "  Y "  X "
X
product formalism with four output targets. First, we
 YY   X  "   X  "   Y  " 
X
! X a  Y a  X a : generate LA TE X output. This supports researchers in the
  development of complex, multiple dimensional linear
Here, X X
corresponds
  to the
 Y match
 M in Eq. (1), while D is and context-free (in 2-GNF) grammars, facilitates the
replaced by X Y
and I by X
. This notation has the addi- comparison with the intended model for an elaborate
tional advantage that it makes explicit on which tape(s) a alignment-like algorithm. It assists implementation of the
match or insertion takes place, since for k 3 the three non- grammar in the users’ programming language of choice
terminals for match (M), deletion (D), insertion (I)  are
 not as the mathematical description of the recurrences
sufficient. We also include an explicit start symbol SS now. reduces the chance that a production rule or recursion is
The distinction between a gap opening and a gap extension simply forgotten.
is encoded at the level of the tagged rules. Compared to In addition, we directly target the functional program-
Eq. (1) we do not distinguish terminal symbols ‘:’ and ‘’ ming language Haskell [13]. It is possible to emit a Haskell
anymore. Instead we use " to denote a missing character module prototype which then needs to be extended with
without incurring a loss of expressability. user-defined evaluation (scoring) algebras. This mode mir-
It could be argued that both V and W are unnecessary if rors the LA TE X output. Advanced users may make use of
the production rules are somehow tagged as “illegal” in TemplateHaskell [14] and QuasiQuotation [15] to directly
the scoring algebra. We suggest, however, to always embed our domain-specific language as a proper extension
remove unwanted production rules (and non-terminals) of Haskell itself. Both Haskell-based approaches ultimately
explicitly. This ensures that the grammar describes exactly make use of stream fusion optimizations [16] by way of the
the search space under consideration in a formally correct ADPfusion [17] framework that produces efficient code for
manner and reduces the danger of introducing bugs into dynamic programming algorithms.
the implementation by, say, accidentally evaluating an Currently, the emitted Haskell code for non-trivial appli-
“illegal” rule with a finite score. cations is slower than optimized C by a factor of 2 [17].
512 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 3, MAY/JUNE 2015

Recent additions to the compiler infrastructure [18], which 3.1.2 Semi-Global Alignments
provide instruction-level parallelism, will reduce this factor We first modify the Needleman-Wunsch grammar,
further. As ADPfusion is built on top of the Repa [19] equ. (18), in such a way that it models a semi-global align-
library for CPU-level parallelism, we can expect improve- ment, i.e., to allow the grammar to act locally on one or
ments in this regard to be available for our dynamic pro- more tapes. This allows us to construct scanning-type algo-
gramming algorithms in the near future. rithms that can be used in genome-wide applications such
Finally, we provide colored, pretty-printed diagnostics to as HMMer [27] and Infernal [28], [29]. The basic idea is to
aid during grammar development. replace the one-dimensional “step grammar” by a slightly
more complex one that allows us to skip a prefix or a suffix:
3 ALIGNMENT ALGORITHMS 
S ¼ ðfXg; fag; fX ! Xa  XgÞ (28)
The overwhelming majority of alignment programs solve
pairwise alignment problems by exact DP but use heuristics
to combine the pairwise solutions to multiple alignments. NL ¼ ðfLg; f$g; fL ! $gÞ (29)
The main reason is practicality. Full-fledged n-way DP
alignments have exponential running time in n and hence
NX ¼ ðfXg; f$g; fX ! $gÞ (30)
are of little practical use for large n despite of elaborate
divide-and-conquer strategies have been proposed to prune
the search space, see e.g., [1]. Three-way alignments never- L ¼ ðfXg; fg; fX ! Xg; XÞ (31)
theless are employed in practise in particular when high
accuracy is crucial, see e.g., [20], [21], [22], [23]. Four-way O ¼ðfX; R; Lg; fag;
alignments were recently explored for aligning short words  (32)
from human language data [24]. We suspect that DP fR ! Ra  X; X ! L; L ! Lag; RÞ:
approaches for moderate values of n have not been explored
for specialized application because of the effort for their The extension from a global to a semi-global (or global)
implementation. In this section we demonstrate how the alignment is done using another grammar with a total of
product construction can help, using a combined nucleic- four rules. These rules allow the removal of nucleotides on
acid/protein alignment algorithm as an example. the right (via R rules) or left (L rules) and switching to and
from the actual alignment grammar. The extended step
grammar O introduces the transitions from the right
3.1 Global, Semi-Global, and Local Alignments
“ignored” part R to the aligned part X, and finally from X
3.1.1 Global Alignments to the left “ignored” part L. It reads through the “ignored”
The global alignment described above is the simplest variant parts with R ! Ra and L ! La rules. The “stop grammar”
of pairwise sequence alignment. Needleman-Wunsch style N now needs to recognize the end of the tape $ as the r.h.s.
global alignments in grammatical form have a very convenient of the non-terminal L.
structure. The global alignment of k sequences (and therefore k The combined semi-global grammar can be written as
tapes) can be written as SW k ¼ ki¼1 S  L  k þ N  k, where
SG ¼ S  S þ O  L  L  2 þ NL  NX (33)
ki¼1 S denotes the k-fold product of a grammar with itself. By
virtue of having a monoidal (and hence associative) structure has the eight productions
of the  operator it is well-defined.
R  R  a    X 
This property of the global alignment grammar was quite ! X 
"
useful in recent work on historical linguistics [25] where all X
X
 X  a    X  "    X  a    L 
! X a  X  
(34)
alignments for k-tuples with k 2 f2; 3; 4g, or two- to four- X
L  L  a    $ 
X a X " X

way alignments, were required. Scoring in these grammars ! 


X X " $
was done by algebras using the sum-of-pairs scheme. We
will come back to these kinds of scoring schemes in the con- R
clusion, as they open up ways to describe automatic genera- with X as the start symbol. It embeds the core of the
tion of algebras2 for grammar products. alignment grammar, S  S into a head and tail part that
In many applications one is interested in local align- steps through the first band only. By construction SG is
ments that allow prefixes and suffixes to remain local only on the first tape, and global on the second
unaligned. It is possible to perform a local alignment tape. Intuitively, we can understand it as SG
NWþ
with an adapted scoring scheme (as done in the Smith- O  L, i.e., as the Needleman-Wunsch grammar plus the
Waterman algorithm [26]). Within the grammar-centered skipping of a prefix and suffix on the first tape.
framework explored here, however, it seems preferable to
devise a grammar that describes such a local alignment. 3.1.3 Local Alignments
Below we consider two natural extensions that are practi- The Smith-Waterman algorithm [26] for local sequence
cal importance as semi-global (glocal) or local alignment alignment is usually implemented via the scoring
algorithms. scheme. Including a neutral element (i.e., 0 for max-opti-
mizations where sub-scores are summed up) into the
2. We do not use the term algebra product in this case as algebra prod- optimization function yields a local alignment algorithm.
ucts already describe well-defined combinations of algebras in ADP [2]. As for the semi-global alignment, we again employ a
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 513

grammar-based scheme to derive a local algorithm from misread as T in sequencing [31]; to model such effects,
a global one. Our construction is based on the observa- asymmetric substitution score matrices are required. The
tion that we can interpret the Smith-Waterman algorithm same is true for alignments of lexical data in a computa-
as a “concatenation” of three interconnected Needleman- tional linguistics setting because sound changes are direc-
Wunsch algorithms, where the first and the last one tional between two languages [24]. To enforce symmetry we
score only the excluded parts of the sequences. This can may use the same non-terminal symbols for each tape, while
be written as asymmetry can be indicated by the use of different (or
indexed) symbols on the different tapes.
SW ¼ þ2i¼0 ðS i  S i  Li  2Þ þ N 2  2 þ T  2 : (35)

Here, S; L; N are derived from the global alignment ver- 3.2 DNA-Protein Alignment
sions The problem of aligning a protein sequence to a nucleic acid
 (RNA or DNA) sequence is a rather specialized problem
S i ¼ ðfXi g; fag; fXi ! Xi a  Xi gÞ that arises in particular in the context of (homology-based)
Li ¼ ðfXi g; fg; fXi ! Xi gÞ gene annotation. The best example is probably NCBI’s
N 2 ¼ ðfX2 g; f$g; fX2 ! $gÞ prosplign, which aligns a protein query sequence to a
piece of genomic DNA allowing also for introns. A detailed
T ¼ ðfX0 ; X1 ; X2 g; fg; fX0 ! X1 ; X1 ! X2 g; X0 Þ
description of this dynamic programming algorithm has not
yet become available. An interesting variation on this theme
and T defines the transitions between the grammars. The is gene annotation in the presence of extensive insertion/
resulting product grammar contains 12 production rules deletion editing as observed in the mitochondria of Physa-
with ð X0
X0 Þ as start symbol:
rum polycephalum or trypanosomatids. Frequent changes of
the reading frame make it virtually impossible to identify
  
      
X0
! X0 "  X0 a  X0 a
X1 homologs of mitochondrial proteins by tblastn, thus call-
X0 X X1 X " X
    0 a 0  0 a ing for specialized alignment algorithms [32], [33].
X1 X2  X1 "  X1 a  X1 a
X ! X X X " X (36) Our task is to align the amino acid sequence of a protein
 1   2 1 a 1  1 a
X2 $  X2 "  X2 a  X2 a that may be present in a mitochondrial genome to the entire
!
X2 X2 a X2 " :X2 a
$ nucleic acid sequence of a mitogenome. Since we suspect
that mRNAs may be subject to insertion or deletion editing,
The na€ıve formulation is not ideal in practise in that it it is necessary to track frameshifts. Fig. 1 shows a general
requires
 0 three
 X1 memoization
 X2  tables for the three non-termi- version of such an approach. The DNA sequence is read in
nals XX0 , X1 , and X2 . In case of a localscoring system
 X2  one of three reading frames (RFs), and a deletion or inser-
where the excluded parts of the alignment ( X 0
X0 and X2 ) tion does not yield a “simple” in/del but also a frame shift
are scored by a constant, it is possible to replace the OðnmÞ to account for the effect of in/dels on the translation of the
memo-tables with tables of size Oð1Þ. This is possible by rec- DNA into protein according to the “codons” of the genetic
ognizing that every subword (index) in such a table can be code. In Fig. 1 frame shifts (with scoring functions rf1 and
memoized by the same single value. We will come back to rf2) are enabled. Staying within a frame is modelled either
this point in the discussion section. by a (mis)match stay or by the deletion of all three charac-
ters of a codon (del). Finally, the alignment is to be calcu-
3.1.4 Symmetric and Asymmetric Scoring lated locally w.r.t. the DNA sequence but globally w.r.t. the
It is important to recognize that the grammar alone is a amino acid sequence. In the grammar of Fig. 1 this is
device that enumerates all possible alignments of a DNA achieved by adding a component grammar that “skips” an
sequence with a protein sequence. In particular, the gram- unaligned prefix and suffix on the DNA band while leaving
mar itself will not disallow alignments that are biologically the protein band untouched. This follows the same insight
unsound. However, each grammar created using our frame- as in the simpler alignment grammars above.
work has all of its rules tagged with function symbols. As each of the three frames, and shifts to the other two
These function symbols are also known as algebra symbols frames, is by itself similar to the other two frames, a special
in the context of ADP [2] where we also borrowed the tag- encoding saves a lot of work. The F non-terminal indicating
ging symbol ‘<<<‘ used in Fig. 1. In this sense, our frame- the current frame is indexed with indices 0, 1, and 2. Frame
work is very similar to S-attribute grammars [30]. shifts are thus calculated modulo 3 instead of explicitly creat-
Nevertheless, we can support the construction of the ing all three frame indices F0 to F2 and their corresponding
scoring algebra already during grammar design be explic- production rules. Furthermore, all alignments are local with
itly making use of symmetries in the model. The alignment respect to the DNA sequence but global with respect to the
of two sequences of the same type is usually simplified due protein sequence. The product of an embedding grammar
to mirrored operations. Recalling the alignment grammar (DNAlocal) with a grammar that does not read any amino
from above, we speak of in/del operations as an insertion in acid character yields the correct semi-global embedding.
one sequence that may just as well be described as a deletion For the protein sequence, the corresponding PROstand
in the other sequence. In addition, it does not matter which grammar can simply be reused.
sequence is bound to which input tape. In some applications The complexity of the DNA-protein alignment stems
this symmetry is broken. For example, ancient DNA is par- from the fact that we need to “align” the different frame
tially chemically degraded by cytosin deamination, i.e., C is shifting possibilities in the DNA input while matching
514 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 3, MAY/JUNE 2015

Fig. 1. Atomic grammars for the DNA-Protein alignment example. (I) Nucleotides are read in triplets (three nucleotides each). The genome is aligned
locally to the complete amino acid sequence. Using DNAlocal, nucleotides can be removed from the left or right end of the DNA sequence. The
choice between local or global alignment for each tape is made based on adding the grammar product DNAlocal >< PROstand. The DNA grammar
switches between reading frames. DNAdone and DNAstand handle the terminating and looping case. (II) The PROtein grammar works similarly, but
reads only a single amino acid at a time. The expansion of the DNA grammar is more complicated, as the indexed non-terminal symbol F expands to
three different non-terminals corresponding to the three possible reading frames. (III) The grammar product of DNA and PROtein without the looping
case “stand” and with the terminating case “done”. In code, >< represents the direct product (). The resulting 34-production rule grammar is
shown in the Supplemental Material, available online, together with an extended description.

zero to three nucleotides to zero or one amino acid in the two DNA sequences with two protein sequences. This
protein input. In addition, once a frame shift has occurred grammar can be calculated at basically no additional cost but
all following alignments of three nucleotides against one would pose a daunting task if implemented by hand.
amino acid are scored in the new reading frame until In addition to the 34-production rule grammar, a set of
another frame shift occurs or the alignment is completed. (10 þ 1) function types3 is created. Each function type is asso-
Under normal circumstances, the scoring algebra for the ciated with a pair of function symbols, one for the DNA and
DNA-Protein grammar will assign very high costs to one for the protein tape; apart from the choice function (the
frameshift productions rf1 and rf2. In order to model “þ1”). This signature, as it is called in Algebraic Dynamic
the frequent cytosine insertions in P. polycephalum, how- Programming [2], [17], associates one type with one or more
ever, we simply use a moderate or low penalty for rf1 of the production rules. For example, the production rule
when the incomplete codon is corrected by the inclusion of created from the product of F{i} -> stay <<< F
a ‘C’ that is not encoded in the DNA. {i} c c c and P ->amino  <<< a creates,
  P    among others,
Our framework simplifies the complexity of designing the production rule FP0 ! FP0 ac "c "c .
this algorithm considerably. While the combined grammar is The corresponding evaluation function stay_amino
highly complex, the individual grammars are rather simple. has type S  ðN  AÞ  ðN  EÞ  ðN  EÞ ! S. This type
As already mentioned, the protein “stepping grammar” is specifies that stay_amino accepts the score (of type S)
one of the simplest possible ones. The DNA grammar is of the alignments as calculated up to this position fol-
more complex as we need to handle stepping and frame lowed by tuples of characters read from each tape. The
shifts in all three reading frames. But considering that we first nucleotide character (of the set N ¼ fA; C; G; Tg) is
allow indexed non-terminals and calculations on these indi- aligned with the corresponding amino acid character (of
ces (modulo 3 in the frame shift case), even the frame shift the set A of the 20 amino acids). The remaining two
grammar has only four rules, just twice as much as the sim- nucleotides are aligned with elements from the empty
plest stepping grammar. set (E which emits the “non-informative character”
The resulting 34-production rule grammar is easily calcu-
lated in our frame work. We emphasize that one may read- 3. The 11th function type designates the optimizing choice function,
ily extend this grammar to allow for, say, an alignment of which is part of every evaluation algebra.
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 515

3.2.1 Application of the Frame-Shift Grammar


To evaluate our DNA-Protein alignment algorithm we use
a scenario as a test case in which frameshifts are rather fre-
quent events. The mitochondrial genome of the amoeba
Physarum polycephalum has long resisted comprehensive
annotation because insertion editing is so frequent in most
of its transcripts that blastp-based searches for homo-
logs of known mitochondrial proteins have long remained
unsuccessful [32]. This situation has changed only with
the construction of a dedicated DNA/protein alignment
algorithm that specifically modelled the C insertion [33],
Fig. 2. Alignment of R. americana proteins to the P. polycephalum [34]. With the recent characterization of the mitochondrial
mitogenome. The central panel displays expression data from [35]. transcriptome of P. polycephalum [35] a comprehensive list
Above and below the known protein-coding (P) and ncRNA (R) of insertion editing sites became available.
genes are shown (thick black lines with delimiters for each gene)
together with the alignment scores (normalized per nucleotide) for Here we test the implementation of the frameshift DNA/
the R. americana proteins. protein alignment algorithm given in Fig. 1 for the task of
annotating the P. polycephalum mitogenome (62,862 nt [36])
by homology search. We use the 67 protein sequences
represented by the empty tuple ()). This design allows encoded by the mitochondrial genome of Reclinomonas amer-
us to effectively align the single amino acid character icana. This jakobid excavate is only very distantly related to
with three nucleotide characters. Finally, the stay_ P. polycephalum. It has been reported, however, to host the
amino function emits a score of type S. most complete, bacterial-like protein complement of all
General scoring for Frameshift-aware alignments: The user mitogenomes investigated so far [37]. We therefore aligned
can now implement the required scoring functions. The scor- each of the Reclinomonas americana proteins semi-globally
ing we describe here makes no assumptions on the knowl- against the mitogenome of P. polycephalum.
edge of frequent cytosin (C) insertions. Instead we The P. polycephalum mitogenome contains 39 protein
implement scoring for a generic frameshift-aware alignment coding genes annotated in Redbase4 [35], 32 of which have
algorithm which we then evaluate. a homolog in R. americana. For 18 of these the best hit of
To this end we use three score lookup tables: (i) The the DNA-Protein alignment correctly identifies the geno-
alignment of three nucleotides to a single amino acid is mic location of the gene in P. polycephalum with only minor
performed by translating the codon into its respective deviations of the exact start and stop positions. In [33] 9
amino acid, after which the, say, BLOSUM similarity matri- genes previously not identified in P. polycephalum were
ces can be used to score the alignment. (ii) In case of a annotated with a handcrafted DNA-Protein alignment
frame shift combined with a (mis)match either only one or algorithm that specifically favours C insertions using all
two nucleotides are aligned with an amino acid. For, say, proteins in NCBI’s non-redundant protein database.
two nucleotides on the DNA sequence there are 12 possible Our approach recovers three of these nine proteins.
codons: ac1 c2 , c1 ac2 , and c1 c2 a, where a 2 fA; C; G; T g and However, we have made no efforts to optimize parameters,
c1 ; c2 are the data of the DNA sequence. The insertion that our algorithm does not distinguish between C insertions
maximizes the BLOSUM similarity is used to score the (mis) and other frame-shifting insertions or deletions, and we use
match. (iii) In addition to the BLOSUM-based scores, six gap only a single, evolutionary very remote mitogenome as
parameters are required. Complete deletions of either a query. We recover also nearly half of the C insertion sites.
three-nucleotide codon or an amino acid (gccc ¼ 15, Using the P. polycephalum proteins as query, we find nearly
ga ¼ 10), as well as the frame shift versions are penalized. all editing sites located in the coding regions (e.g., 76 of the
To model the abundance of insertion editing sites, one- 79 in nad5).
nucleotide frameshifts receive only a moderate penalty The purpose of this application example, however, is not
(approximately four strong mismatches (20), partially off- to improve the annotation of the P. polycephalum mitoge-
set by the BLOSUM-based match score for the repaired nome beyond the level of accuracy that was achieved with
codon). Aligning a single nucleotide to an amino acid the help of transcriptome sequencing [35]. Our point is that
incurs a malus of 60, while nucleotide deletions incurring a pilot study into this rather specialized topic can be set up
a frameshift are heavily penalized with a malus of 45 or literally within a few hours with the help of grammar prod-
75. Finally, given that we want to align the protein semi- ucts and the software support described in Section 2.6.
globally to the DNA sequence, transitions to and from the Applications such as the alignment of tens or more pro-
flanking part of the DNA have zero cost. tein sequences to full genomes, albeit a rather small one in
It is important to keep in mind that the generation of can- this case, require a modicum of performance considerations.
didate alignments by the grammar is completely separate Since we need to scan a full genome (though one only about
from the concrete scoring of the alignment by a scoring alge- 60,000 nt in length), we have opted for a sliding window
bra. The amalgamation of the two concepts grammar and approach for the DNA sequence, whereas the protein
scoring algebra is taken care of by the ADPfusion frame- sequence is always aligned in full.
work [17], which also optimizes the resulting code such that
its running time performance is competitive with hand-
written C-based implementations. 4. http://bioserv.mps.ohio-state.edu/redbase/.
516 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 3, MAY/JUNE 2015

In addition, the algorithm makes use of multiple cores question whether such generalization could be obtained as
on a single machine in a parallel setup. This algorithm is subsets of the product of NUS with itself:
embarrassingly parallel as all pairs of DNA windows and
S  $    $    $    Sa    Sa    Sa  
proteins can be aligned simultaneously, given enough !      
S $ Sa SB $ Sa SB
resources.  SB    SB    SB 
 
The final ingredient to good performance is the imple- $ Sa SB
S  $    Sa    SB 
mentation of the dynamic programming algorithm itself. !  
B
B aS a^
 aS 
aS a^ aS^
   a
a (40)
Our implementation uses ADPfusion as the underlying a^  aS a^  aS^
! $
dynamic programming framework. In this way, questions  SB   aSa^  Sa SB

of performance for grammar products are reduced to those B


! aS a^
:
of better grammar design. ADPfusion then transforms the
resulting grammar into efficient Haskell code. A closer inspection of this “formal” product reveals several
Our larger-scale DNA-Protein example performs quite problems.
well—although we, of course, have no algorithm in C to While some of the formal   right  S hand sides have natural
a
compare against. The alignments of the 67 protein sequen- explanations, such as Sa Sa
’ S a ,  a^  require
others  a  S the 
ces of various lengths ranging from around 100 amino introduction of " symbols, such as aS Sa
’ " S
a^
a
.
acids to several hundred to the approximately 60,000 nt of Other terms are more difficult to
 Sa make sense of. For instance,
 Sa 
the mitogenome can be done in 289 minutes on an Intel what
 S  ashould
 we mean by SB ? We might write SB ’
Xeon E5-2680 running at 2.7 GHz running in single S B leaving us with an undesirable combination of a ter-
threaded mode. minal and a non-terminal. In line with our construction for
linear grammars we may include all order-preserving  Sa 
3.3 Sankoff’s Consensus Structure Algorithm combinations
    in a form such as the following SB ’
a a
A classical problem in RNA bioinformatics is the simulta-
S
S "
"
B j S
S
"
B " or possibly even more general
neous computation of a pairwise alignment and a consensus expansions. In contrast to the linear grammars considered in
secondary structure of two input RNA sequences. David the previous section general CFGs can have arbitrary strings
Sankoff already noticed in [12] that this problem smells of of terminals and non-terminals as the r.h.s. of their produc-
product structures. tions. This may lead to an exponentially large number of
For the sake of brevity we consider here only a variant of “padded” terms in the interpretation of a formal product
the “Nussinov grammar” [38] that distinguishes the non- term. As a consequence it becomes very difficult to establish
terminal S for unconstrained structures and the non-termi- the algebraic properties of such a product.
nal B for secondary structures enclosed by a base pair A possible remedy comes from considering normal
instead of the more commonly used “Zuker grammar” that forms, of which several types have been explored in detail
accounts for the full loop decomposition [39]. The grammar for CFGs. The two best known ones are the Chomsky nor-
NUS has the productions mal form and the Greibach normal form [40], [41]. Both nor-
mal forms have been useful both in practise and as a
S ! $ j Sa j SB theoretical device. We therefore explore here the possibility
(37) to construct direct products of context-free grammars in
B ! aS a^;
Greibach normal form as the two-GNF has the useful prop-
where the terminals a and a^ denote nucleotides that can pair erty of a r.h.s. with a single terminal followed by zero, one,
with each other. This is just a short hand for the  six legal or two non-terminal symbols. This property simplifies ques-
base pairs explicitly, i.e., aS a^ ¼ aSu  cSg  gSc  gSu  tions of alignment considerably.
uSa  uSg.
The -operation is easily generalized
   ato
arbitrary g CFGs in 3.4 A Product for Greibach Normal Forms
b
the form ðA ! ab . . . gÞ  2 ¼ A A
! a b . . . g , where Every context free grammar that does not produce $ can
a; b; . . . ; g are either terminals or non-terminals. The natural be transformed into an equivalent grammar with rules of
version of the Sankoff algorithm for two input sequences is the form
S 
       S  a    S  " 
S
! $$  SS aa 
S "

S a A ! aBC j bD j c (41)
S   
S
! SS B (38)
B  a  SB  a^  known as its Greibach normal form of order 2, or simply
B
! a S a^
: two-GNF. If the empty string is produced, the rule S ! $
must be added, where S is the start symbol. We ignore this
This can be expressed much more compactly in the form technicality for brevity of exposition. It is easily included if
one allows $ as a terminal symbol. It is however mostly a
SANK ¼ NW þ ðS ! SB; B ! aS a^Þ  2; (39) question of semantics if a grammar should consider empty
input tapes legal input or not.
when we allow base pairs in either sequence only when A natural product for grammars in two-GNF, which we
they also appear in the consensus. More complex grammars denote by } , can be obtained as follows: Terminals in the
are required when we allow e.g. also the breaking of arcs, product are pairs of terminals of the input grammars, and
the insertion and deletion of entire base pairs, or alignments the set of non-terminals is, as in the case of linear grammars,
of paired and unpaired bases, see e.g., [7]. This begs the the Cartesian product of the sets of input non-terminals
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 517

   
augmented by non-terminals of the form A" and A" . The instance, the } -square of the step grammar ðX ! aX j
production rules of the product are the following: "XÞ has the productions
X  a  X    a  X    a  "  
ðA1 ! a1 B1 C1 Þ } ðA2 ! a2 B2 C2 Þ !   
   a  1  C  
X a
 a  X
a a X
    X"    a  "  
X  a  
¼ A1
! a12 B 1
" " " X
A2 B2 C2  "  X     X"    "  "  
X  "
(43)
 
ðA1 ! a1 B1 C1 Þ } ðA2 ! b2 D2 Þ a X
 "  X
a a X
   "  X"    "  " 
   a  B1  C    a1  B  C    
A1  b X :
¼ A2 ! b21 D 2 "
1
2 "
1 1
D2
" X " " "

ðA1 ! a1 B1 C1 Þ } ðA2 ! c2 Þ Only the first term in each line appears in the  product.
   a     We have fully implemented definition of the } product
A1
¼ A2 ! c21 B"1 C"1
for the two-GNF of Eq. (42), so that general CFGs in two-
ðA1 ! b1 D1 Þ } ðA2 ! a2 B2 C2 Þ GNF can be defined and multiplied like linear (left-, right-,
            and general linear) grammars in our domain-specific lan-
¼ A1
! ab12 D1 "  b1 B" D1
A2 B2 C2 a2 2 C2 guage, providing access to an efficient implementation of
ðA1 ! b1 D1 Þ } ðA2 ! b2 D2 Þ the resulting multi-tape product grammars.
      (42)
¼ A1
! bb12 D1 
A2 D2 4 DISCUSSION
        
b1 D1
D
"  b1 D" D1 Our main contribution is a formal, abstract algebra on linear
b2 " 2 b2 2 "
grammars. This algebra provides operations to create com-
     
ðA1 ! b1 D1 Þ } ðA2 ! c2 Þ ¼ A1
! b
1 D 1 plex, multi-tape grammars from simple, single-tape atomic
A2 c2 "
ones. More informally, we have created a method and imple-
ðA1 ! c1 Þ } ðA2 ! a2 B2 C2 Þ mentation to “multiply” dynamic programming algorithms.
   c1   "   "  
A1 We also provide a compiler framework that makes the gram-
¼ A2 ! a2 B2 C2
  mars readily available for actual deployment with good per-
A1  c  "   formance of the resulting code. Products of linear grammars
ðA1 ! c1 Þ } ðA2 ! b2 D2 Þ ¼ A2 ! b12 D2
  c  make it very easy to construct the grammars underlying key
A1
ðA1 ! c1 Þ } ðA2 ! c2 Þ ¼ A2 ! c12 : algorithms in the field of string comparison starting from
almost trivial single-tape factors. This approach is particu-
larly fruitful in highly specialized applications as it drasti-
By construction, } is commutative (up to exchanging the cally reduces the efforts required for implementing
coordinates). One easily checks that the product grammar is prototypes, as the example of DNA/protein alignments with
again in two-GNF since the r.h.s. of each production consists frameshifts shows. The work presented here is just a first
of a terminal followed by zero, one, or two non-terminal step towards a general theory of grammar products. Many
symbols. As in the case of the linear grammars   we explain questions, both theoretical and practical, remain open.
the productions of non-terminals of the form A" by the pro- Although we have succeeded in constructing an algebrai-
ductions
A  a of
 BA in  the first factor grammar, for instance cally meaningful product operation of CFGs in normal form
"
! " "
C
"
. The distributive law ðP1 þ P2 Þ } P3 ¼ it is currently restricted to the Greibach normal form. A
P1 } P3 þ P2 } P3 also holds by construction. fully generic version of the grammar product currently
To show that } is associative, it therefore suffices to eludes us. While this poses no theoretical problem given
show that the product is associative for any three pro- that every CFG can be transformed into an equivalent CFG
ductions. Since there are only three types of rules in a in (Greibach) Normal Form, it poses a problem in practice.
two-GNF, it suffices to consider the 27 possible products Often, a production rule is associated with a certain struc-
of triples of production rules, which altogether lead to tural feature that one wants to retain. Transformations into
57 rules. We used a computational proof to establish that normal form also increase the number of non-terminals
associativity is indeed satisfied, see Supplemental Mate- (and thereby resource usage) by a polynomial depending
rial, which can be found on the Computer Society Digital on the number of non-terminals [42].
Library at http://doi.ieeecomputersociety.org/10.1109/ It will also be important to explore both the interplay of
TCBB.2014.2326155. It is important to note that the two different operators on grammars (especially our þ opera-
“decoupling rules” for ðA1 ! b1 D1 Þ } ðA2 ! b2 D2 Þ indi- tion and the union ([) of grammars), and to formalize
cated by the box in Eq. (42) are necessary for associativ- meaning and operation. This will provide, in the long-term,
ity of the product. a full-fledged algebraic framework in which it should be
Linear grammars can be understood as special cases  easily possible to describe even complex grammatical prob-
of two-GNF with productions of the form A ! Bx  y lems. As pointed out by a reviewer, it may sometimes be
(except that we now deal with right linear instead of left convenient to introduce a grammar G ¼ ðfZg; fg; fgÞ for the
linear grammars). A comparison of the definition of  in purpose of introducing the non-terminal Z (using þ) or for
Eq. (12) and } in Eq. (42) shows that the restriction of } the purpose of removing all rules that contain Z (using ).
to linear grammars does not recover (the mirror image For instance, ðfY g;fg; fgÞ  k could remove the unwanted
of) . The discrepancy are exactly the two “decoupling double-deletion YY in Equ. (25) in a quite convenient man-
rules” necessary for associativity of the } product. For ner. Our formalism, however, defines algebraic operations
518 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 3, MAY/JUNE 2015

primarily on sets of production rules. This construction thus Good and optimal table designs based on yield size anal-
will require careful (re-)definition of the algebraic opera- ysis have been considered in [47], extending earlier ideas on
tions acting on grammars. more restricted dynamic programming algorithms [48]. Our
An intriguing idea (proposed by Robert Giegerich) is to proposed extension will consider not only the yield size but
identify irreducible grammars from which more complex the actual evaluation algebra, thereby including more
ones can be constructed, possibly in a unique way. To that domain-specific knowledge.
end, we will require a much better understanding of abstract
algebras on grammars as well as the impact of constructions ACKNOWLEDGMENTS
such as (Z; fg; fg). Since k-tape grammars can formally be
This work was funded, in part, by the Austrian FWF, project
written without making the dimension of the non-terminals
“SFB F43 RNA regulation of the transcriptome” and the
explicit, a conceptually related question is whether product
German DFG project “MI439/14-1”. CHzS thanks Jing,
grammars can be efficiently recognized. More abstractly: ist
Katja, Lydia, and Nancy (and gin, as well as a mad man in a
there a “unique prime-factorization theorem” for grammars
box). The authors thank Maria Walter for her hospitality
akin to analogous results for graphs? [43].
that was very conducive for the development of this work,
Another avenue of future research is the question of
and the referees – Prof. Robert Giegerich and one anony-
semantic ambiguity [44] of the resulting grammars. Simple
mous – for providing in-depth comments.
products of the same grammar yield ambiguous alignments
on sequences of in-dels, that is multiple derivations exist
that yield the same alignment. This problem was studied REFERENCES
extensively for stochastic context-free grammars imple- [1] D. J. Lipman, S. F. Altschul, and J. D. Kececioglu, “A tool for mul-
menting covariance models for structural RNA search [45]. tiple sequence alignment,” Proc. Nat. Acad. Sci. USA, vol. 86,
It is typically dealt with a good grammar design that explic- no. 12, pp. 4412–4415, 1989.
[2] R. Giegerich and C. Meyer, “Algebraic dynamic programming,”
itly allows only one order of successive insertions and dele- in Proc. Algebraic Methodol. Softw. Technol., 2002, vol. 2422,
tions on multiple tapes. Automatic dis-ambiguation is pp. 349–364.
probably complicated but would further simplify the crea- [3] R. Giegerich, C. Meyer, and P. Steffen, “A discipline of dynamic
tion of complex multi-tape grammars. programming over sequence data,” Sci. Comput. Program., vol. 51,
no. 3, pp. 215–263, 2004.
In this contribution we have focussed entirely on the [4] C. M. Reidys, F. W. D. Huang, J. E. Andersen, R. C. Penner, P. F.
grammars underlying the dynamic programming algo- Stadler, and M. E. Nebel, “Topology and prediction of RNA
rithms and disregarded almost entirely the construction of pseudoknots,” Bioinformatics, vol. 27, pp. 1076–1085, 2011.
[5] H. Chitsaz, R. Salari, S. C. Sahinalp, and R. Backofen, “A partition
scoring algebras for product grammars. We anticipate that function algorithm for interacting nucleic acid strands,” Bioinfor-
in many cases, a scoring algebra can be expressed as a matics, vol. 25, pp. i365–i373, 2009.
form of product itself where the two scoring functions (one [6] F. W. D. Huang, J. Qin, C. M. Reidys, and P. F. Stadler, “Partition
for each grammar) are themselves combined in some well- function and base pairing probabilities for RNA-RNA interaction
prediction,” Bioinformatics, vol. 25, pp. 2646–2654, 2009.
defined form. One possibility is the use of a folding opera- [7] S. Will, C. Schmiedl, M. Miladi, M. M€ ohl, and R. Backofen,
tion to combine scores for subsets of the individual dimen- “SPARSE: Quadratic time simultaneous alignment and folding of
sions. It then follows that given two algebras AG1 and AG2 RNAs without sequence-based heuristics,” in Proc. 17th Int. Conf.
for grammars G1 and G2 we should be able to define an Res. Comput. Mol. Biol., 2013, pp.289–290.
[8] C. H€ oner zuSiederdissen, I. L. Hofacker, and P. F. Stadler, “How
operation AG1 k AG2 which generates appropriate alge- to multiply dynamic programming algorithms,” in Proc. Brazilian
bras from algebras for atomic grammars. As long as k has Symp. Bioinformat., 2013, pp. 82–93.
some structure similar to a fold or another operation on [9] O. Gotoh, “An improved algorithm for matching biological
sequences,” J. Mol. Biol., vol. 162, pp. 705–708, 1982.
subsets of the dimensions (of the grammars) involved, [10] F. Lefebvre, Grammairs s-attribuees multi-bandes et applications a
appropriate products can be automatically defined. This l’analyse automatique de sequences biologiques, Ph.D. dissertation,
will become particularly useful when aiming at ADP-like 
Ecole Politechnique, Palaiseau, France, 1997.
[46] algebra-products to explore the rich space of combined [11] S. B. Needleman and C. D. Wunsch, “A general method applicable
to the search for similarities in the amino acid sequence of two
algebras on grammars constructed from algebraic opera- proteins,” J. Mol. Biol., vol. 48, no. 3, pp. 443–453, 1970.
tions on atomic grammars. [12] D. Sankoff, “Simultaneous solution of the RNA folding, alignment
In Section 3.1 on local alignments we mentioned that a and protosequence problems,” SIAM J. Appl. Math., vol. 45,
pp. 810–825, 1985.
na€ıve memoization of the three non-terminals of the Smith- [13] The GHC Team. (1989–2013). The Glasgow Haskell Compiler
Waterman algorithm leads to a three-fold increase in mem- (GHC) [Online]. Available: http://www.haskell.org/ghc/
ory usage compared to the usual implementation based on [14] T. Sheard, and S. P. Jones, “Template Meta-programming for
one table and a neutral element in the scoring function. In Haskell,” in Proc. ACM SIGPLAN Workshop Haskell, 2002, pp. 1–16.
[15] G. Mainland, “Why it’s nice to be quoted: Quasiquoting for
case of local alignments, the evaluation functions attached Haskell,” in Proc. ACM SIGPLAN Workshop Haskell Workshop,
to each production rule return a constant value, often the 2007, pp. 73–82.
neutral element (i.e., a score of 0 for summations), for all [16] D. Coutts, R. Leshchinskiy, and D. Stewart, “Stream fusion: From
productions and all substrings. lists to streams to nothing at all,” in Proc. 12th ACM SIGPLAN Int.
Conf. Funct. Program., 2007, pp. 315–326.
We plan to extend our framework to make it possible to [17] C. H€ oner zu Siederdissen, “Sneaking around concatMap: Efficient
evaluate combinations of grammars and algebras in more combinators for dynamic programming,” in Proc. 17th ACM SIG-
complex ways. This should allow us to automatically deter- PLAN Int. Conf. Funct. Program., 2012, pp. 215–226.
[18] G. Mainland, R. Leshchinskiy, S. P. Jones, and S. Marlow,
mine what kind of memo-table is required for each gram- “Exploiting vector instructions with generalized stream fusion,”
mar and algebra, thereby optimizing memory consumption in Proc. 18th ACM SIGPLAN Int. Conf. Funct. Program., 2013,
of the dynamic programming algorithms. pp. 37–48.
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 519

[19] G. Keller, M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, [45] R. Giegerich, and C. H€ oner zu Siederdissen, “Semantics and ambi-
and B. Lippmeier, “Regular, shape-polymorphic, parallel arrays guity of stochastic RNA family models,” IEEE/ACM Trans. Com-
in Haskell,” in Proc. 15th ACM SIGPLAN Int. Conf. Funct. Program., put. Biol. Bioinformat., vol. 8, no. 2, pp. 499–516, Mar. 2011.
2010, pp. 261–272. [46] P. Steffen, and R. Giegerich, “Versatile and declarative
[20] O. Gotoh, “Alignment of three biological sequences with an effi- dynamic programming using pair algebras,” BMC Bioinformat.,
cient traceback procedure,” J. Theor. Biol., vol. 121, pp. 327–337, vol. 6, p. 224, 2005.
1986. [47] P. Steffen and R. Giegerich, “Table design in dynamic pro-
[21] T. G. Dewey, “A sequence alignment algorithm with an arbitrary gramming,” Inf. Comput., vol. 204, no. 9, pp. 1325–1345, 2006.
gap penalty function,” J. Comput. Biol., vol. 8, pp. 177–190, 2001. [48] H. L. Bodlaender and J. A. Telle. (2004, Dec.). Space-efficient con-
[22] A. S. Konagurthu, J. Whisstock, and P. J. Stuckey, “Progressive struction variants of dynamic programming. Nordic J. Comput.
multiple alignment using sequence triplet optimization and three- [Online]. 11(4), pp. 374–385. Available: http://dl.acm.org/
residue exchange costs,” J. Bioinf. Comput. Biol., vol. 2, pp. 719– citation.cfm?id=1062991.1062995
745, 2004.
[23] M. Kruspe, and P. F. Stadler, “Progressive multiple sequence Christian Ho € ner zu Siederdissen is a postdoc-
alignments from triplets,” BMC Bioinformat., vol. 8, p. 254, 2007. toral researcher with the Institute for Theoretical
[24] L. Steiner, P. F. Stadler, and M. Cysouw, “A pipeline for computa- Chemistry at the University of Vienna. His
tional historical linguistics,” Lang. Dynamics Change, vol. 1, pp. 89– research interests include optimization problems
127, 2011. in biology, Bayesian statistics, formal languages,
[25] N. Retzlaff, “Bigramm-Alignierung und ihre Anwendung in and functional programming.
der historischen Linguistik,” Bachelors Thesis, Department of
Computer Science, Eberhard Karls Universit€at Tbingen, Tb€ uin-
gen, Germany, 2013.
[26] T. F. Smith, and M. S. Waterman, “Identification of common
molecular subsequences,” J. Mol. Biol., vol. 147, pp. 195–197, 1981.
[27] S. R. Eddy, “HMMER: Profile HMMs for protein sequence analy-
sis,” Bioinformatics, vol. 14, pp. 755–763, 1998.
Ivo L. Hofacker is a professor at the Departments
[28] S. R. Eddy, and R. Durbin, “RNA sequence analysis using covari-
of Chemistry and of Computer Science at the Uni-
ance models,” Nucl. Acids Res., vol. 22, pp. 2079–2088, 1994.
versity of Vienna, where he heads the Institute for
[29] E. P. Nawrocki, D. L. Kolbe, and S. R. Eddy, “Infernal 1.0: infer-
Theoretical Chemistry and the Research Group
ence of RNA alignments,” Bioinformatics, vol. 25, no. 10, pp. 1335–
Bioinformatics and Computational Biology,
1337, 2009.
respectively. His research interests focus on the
[30] F. Lefebvre, “An optimized parsing algorithm well suited to
prediction of RNA structures, noncoding RNA
RNA folding,” in Proc. 3rd Int. Conf. Intell. Syst. Mol. Biol.,
annotation, and the analysis of folding landscapes.
1995, pp. 222–230.
[31] K. Pr€ufer, U. Stenzel, M. Hofreiter, S. P€a€abo, J. Kelso, and R. E.
Green, “Computational challenges in the analysis of ancient
DNA,” Genome Biol., vol. 11, p. R47, 2010.
[32] J. M. Gott, N. Parimi, and R. Bundschuh, “Discovery of new genes
and deletion editing in Physarum mitochondria enabled by a novel Peter F. Stadler received the PhD degree in
algorithm for finding edited mRNAs,” Nucl. Acids Res., vol. 33, Chemistry from the University of Vienna in 1990,
pp. 5063–5072, 2005. and then worked as an assistant and associate
[33] C. Beargie, T. Liu, M. Corriveau, H. Y. Lee, J. Gott, and R. professor for theoretical chemistry at the same
Bundschuh, “Genome annotation in the presence of insertional School. In 2002, he moved to Leipzig as a full pro-
RNA editing,” Bioinformatics, vol. 24, pp. 2571–2578, 2008. fessor for bioinformatics. Since 1994, he is an
[34] R. Bundschuh, “Computational approaches to insertional RNA external professor at the Santa Fe Institute. He is
editing,” Methods Enzymol., vol. 424, pp. 173–195, 2007. an external scientific member of the Max Planck
[35] R. Bundschuh, J. Altm€ uller, C. Becker, P. N€urnberg, and J. M. Society since 2009 and a corresponding member
Gott, “Complete characterization of the edited transcriptome of abroad of the Austrian Academy of Sciences
the mitochondrion of Physarum polycephalum using deep sequenc- since 2010.
ing of RNA,” Nucl. Acids Res., vol. 39, pp. 6044–6055, 2011.
[36] H. Takano, T. Abe, R. Sakurai, Y. Moriyama, Y. Miyazawa, H. " For more information on this or any other computing topic,
Nozaki, S. Kawano, N. Sasaki, and T. Kuroiwa, “The complete
please visit our Digital Library at www.computer.org/publications/dlib.
DNA sequence of the mitochondrial genome of Physarum poly-
cephalum,” Mol. General Genetics, vol. 264, pp. 539–545, 2001.
[37] B. F. Lang, G. Burger, C. J. O’Kelly, R. Cedergren, G. B. Golding, C.
Lemieux, D. Sankoff, M. Turmel, and M. W. Gray, “An ancestral
mitochondrial DNA resembling a eubacterial genome in mini-
ature,” Nature, vol. 387, pp. 493–497, 1997.
[38] R. Nussinov, G. Piecznik, J. R. Griggs, and D. J. Kleitman,
“Algorithms for loop matching,” SIAM J. Appl. Math., vol. 35,
pp. 68–82, 1978.
[39] R. D. Dowell, and S. R. Eddy, “Evaluation of several lightweight
stochastic context-free grammars for RNA secondary structure
prediction,” BMC Bioinformat., vol. 5, p. 71, 2004.
[40] S. A. Greibach, “A new normal-form theorem for context-free
phrase structure grammars,” J. ACM, vol. 12, pp. 42–52, 1965.
[41] A. Ehrenfeucht, and G. Rozenberg, “An easy proof of Greibach
normal form,” Inf. Control, vol. 63, pp. 190–199, 1984.
[42] N. Blum, and R. Koch, “Greibach normal form transformation
revisited,” Inf. Comput., vol. 150, pp. 112–118, 1999.
[43] R. Hammack, W. Imrich, and S. Klavzar, Handbook of Product
Graphs, 2nd ed. Boca Raton, FL, USA: CRC Press, 2011.
[44] R. Giegerich, “Explaining and controlling ambiguity in dynamic
programming,” in Combinatorial Pattern Matching. New York, NY,
USA: Springer, 2000, pp. 46–59.

You might also like