Professional Documents
Culture Documents
Product Grammars For Alignment and Folding
Product Grammars For Alignment and Folding
Abstract—We develop a theory of algebraic operations over linear and context-free grammars that makes it possible to combine
simple “atomic” grammars operating on single sequences into complex, multi-dimensional grammars. We demonstrate the utility
of this framework by constructing the search spaces of complex alignment problems on multiple input sequences explicitly as
algebraic expressions of very simple one-dimensional grammars. In particular, we provide a fully worked frameshift-aware,
semiglobal DNA-protein alignment algorithm whose grammar is composed of products of small, atomic grammars. The compiler
accompanying our theory makes it easy to experiment with the combination of multiple grammars and different operations.
Composite grammars can be written out in LA TE X for documentation and as a guide to implementation of dynamic programming
algorithms. An embedding in Haskell as a domain-specific language makes the theory directly accessible to writing and using
grammar products without the detour of an external compiler. Software and supplemental files available here: http://www.bioinf.
uni-leipzig.de/Software/gramprod/
Index Terms—linear grammar, context free grammar, product structure, multiple alignment, Haskell
1 INTRODUCTION
Eq. (2), which obviously have the form The none symbol. The “none” symbol ("), on the other hand
is a purely formal symbol that is used here for notational con-
X ! Xu Yu $ and Y ! Y: X : (3) venience. As a terminal, it denotes the empty string. In multi-
tape algorithms such as Needleman-Wunsch it denotes the
This simple grammar either reads a symbol (non-terminal deletion in an in-del event (by not reading a character). More
X) or it ignores it (non-terminal Y , with ‘:‘ extending a gap, generally, it allows a more compact notation by recasting
‘‘ opening a gap). Each copy of the “step grammar” (3) several rules into a single common format, as in the case of a
operates on its own input tape. This example suggests that non-terminal followed by terminal in left-linear grammars
dynamic programming algorithms for alignment problems or the Greibach normal form, see Section 3.4 below. When-
in general have a product-like structure. Indeed, n-way ever "-only symbols appear in projections (defined below),
alignments can be seen as an n-fold product of the simple they can be savely eliminated. It is possible to avoid ", albeit
step grammar with itself. at the expense of a somewhat more lengthy presentation.
The aim of the present work is more general than align- In the next two sections we will consider in particular
ments. We introduce products of grammars as a very gen- left-linear grammars, i.e., those for which all productions
eral framework to facilitate a more effective design of are of the form A ! Bx with A; B 2 N and x 2 T .
dynamic programming algorithms. This requires that we The example of Gotoh’s algorithm in the introductory
clarify the precise meaning of a product of CFGs. Since section motivates us to introduce algebraic operations on
alignment algorithms are naturally expressed as left-linear grammars in a more systematic way. As a running example,
CFGs we first develop a theory for this special case and we will use one of the simplest alignment algorithms. The
demonstrate in some detail how our framework can facili- Needleman-Wunsch algorithm [11] aligns two sequences
tate the construction of complex alignment algorithms. As a x1...n and y1...m so that the sum of matches and in/del scores
showcase application we consider mixed nucleotide/pro- is maximized. The basic recursion over the memoization
tein alignments with frameshifts. We then proceed to table T reads
explore possibilities to generalize the product construction 8
to context free grammars in general and show that normal >
> Ti1;j þ d
<
forms can be employed to guide such constructions. We Ti;j1 þ d
Tij ¼ max (4)
find that the Greibach normal form (GNF) admits an asso- >
> T þ mðxi ; yj Þ
: i1;j1
ciative product that conserves the normal form but does not 0 if i ¼ 0 and j ¼ 0:
subsume the direct product of linear grammars.
In the recursive scheme, the base case is given by the align-
2 ALGEBRAIC OPERATIONS ON LINEAR ment of two empty substrings “on the left”, while the other
GRAMMARS cases extend the already aligned part of the strings to the
left. This slightly unusual variant of the algorithm was cho-
2.1 Notation sen to be identical to the grammatical description that fol-
A CFG G ¼ ðN; T; P; SÞ consists of a finite set N of non-ter- lows. The first two cases denote an in/del operation with
minals, a finite set T of terminals so that N \ T ¼ ;, a set P cost d, while mð : ; : Þ scores the (mis)match xi with yj .
of productions X ! a where X 2 N and a 2 ðT [ NÞ , and A two-tape grammar equivalent to the recursion in
a start symbol S 2 N. (It will be convenient below to con- Equ. (4) is
sider also grammars without a start symbol S). Further- X X a X " X a $
more, we need at least one special symbol $ denoting the ! : (5)
Y Y " Y a Y a $
empty string, an “empty production” ? and the " symbol
denoting a “none”-symbol. This symbol emits nothing There are several differences between the formulation in
(like $). As a parsing symbol, however, it succeeds always Eq. (4) and Eq. (5). The recursive formulation working on the
(in contrast to $, which only succeeds on empty substrings). memoization table T does not store the alignment directly
Single and multiple tapes. Below we will make use of the but rather the score of each partial, optimal alignment. The
term tape. A (single) tape is an input sequence. Grammars grammatical description, on the other hand, describes the
operating simultaneously on k > 1 input sequences are search space of all possible alignments without any notion of
called multi-tape grammars of dimension k [10]. All termi- scoring. In addition, recursive descriptions usually include
nal Qsymbols of a k-tape grammar are k-dimensional, explicit annotations for
base
cases,
$ here the empty alignment.
t 2 i ðTi [ Þ, where Ti is the alphabet of the ith tape. Non- The production rule X Y
! $
has this role in our example.
terminals are not necessarily tied to individual tapes. It will In general, grammatical descriptions abstract away certain
be convenient in many cases, however, to use “multi- implementation details. Some of these will, however,
dimensional” symbols for the non-terminals to emphasize become important when constructing more complex gram-
their semantics. mars from simpler ones, as we shall see below.
The sentinel terminal. The use of $ is arguably optional as Our task will be to construct Eq. (5) from even simpler,
the rule X ! t with t 2 T also terminates a derivation and it “atomic” constituents. These grammars are
is possible to append a sentinel character as the last charac-
ter to each input string. While there is no formal requirement S ¼ ðfXg; fag; fX ! Xa Xg; XÞ ; (6)
for $, an explicit treatment of the terminating case can be
advantageous in practical implementations as it does not
N ¼ ðfXg; f$g; fX ! $g; XÞ ; (7)
require adding a sentinel symbol to the alphabet.
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 509
L ¼ ðfXg; fg; fX ! Xg; XÞ : (8) is empty, i.e., each non-terminal is tagged with a unique
identifier. This is in contrast to our definition of the þ opera-
The rules in Eqs. (6–8) can be associated with single-tape tion, where we explicitly treat symbols as equal if they have
rules for Eq. (5) by writing X ! X as X ! X", which con- the same name.
forms to the intuition built up in the introduction and The operator. While the þ operator unifies two sets of
Eq. (1). The grammar S in Eq. (6) performs a “step”. It production rules, the operator acts as a set difference
either reads a single character on the right and recurses on operator
the left, or simply recurses. S describes the action of an
alignment algorithm as seen by a single tape, i.e., without P1n P2n ¼ p 2 P1n j p 2
= P2n : (11)
knowledge to what the recursion is doing simultaneously
on other tapes. Note that by itself these rules do not termi- As for þ, it requires operands of the same dimensionality. By
nate. The grammar N , Eq. (7), matches the empty input (or construction, is not associative. Thus does not form a semi-
any empty substring of the input) and immediately termi- group but merely a magma. The empty set of production
nates. Finally, L (Eq. (8)) singles out the non-terminating rules acts as the neutral element on the right. This operator is
loop case already seen in Eq. (6). Intuitively, we can com- important to explicitly remove production rules that yield
bine these three components on a single tape as infinite derivations. In our example, we need to remove
fX ! Xg. With the help of we can write ðX ! Xa XÞ
S þ N L ¼ ðfXg; fa; $g; fX ! Xa $g; XÞ (9) ðX ! XÞ ¼ ðX ! XaÞ. We shall see that it is often conve-
nient to “temporarily” introduce productions that later on
to obtain a grammar that simply reads an input tape. While are excluded again from the final algorithm.
not particularly useful in its own right, S þ N L is the sin- The monoid. The definition of a direct product of left lin-
gle-tape analog of an alignment grammar. More impor- ear grammars lies at the heart of this contribution.
tantly, it highlights the convenience of algebraic operations
on grammars, which we shall introduce rigorously below Definition 1. Let G1 ¼ ðN1 ; T1 ; P1 ; S1 Þ and G2 ¼ ðN2 ; T2 ; P2 ;
for linear grammars. S2 Þ be left-linear CFGs, i.e., all productions are of the form
Each operator introduced below primarily acts on sets A ! Bx or A ! y. Their direct product G1 G2 is the gram-
of production rules. They implicitly carry over to the mar G ¼ ðN; T; P; SÞ with non-terminals N ¼ N1 N2 [
involved sets of terminals and non-terminals in an obvious N1 f"g [ f"g N2 , terminals T ¼ T1 T2 [ T1 f"g[
manner. Two production rules are equivalent if they are f"g T2 , the start symbol of the product is S ¼ SS12 . The pro-
isomorphic as in Eq. (14). This is of relevance insofar that ductions are of the forms
it leads to idempotency in one of the operators below, but A1 B1 x1 B x1 " y1 y1
! 1 y B y
does not otherwise interfere with parsing.1 In the following A
A2 B x2 "
B2 x y
2 2 x2 2
þ operation forms a monoid over the set of production will also make use of notation ðA1 ! B1 y1 Þ ðA2 ! B2 y2 Þ
rules. Since the production rules form a set, isomorphic for the product of two individual productions. By construc-
rules collapse to a single rule. The empty set P n ¼ fg is a tion, we have
neutral element and P n þ P n ¼ P n , i.e., the þ monoid is
idempotent. Isomorphism on production rules is also sym- dimðG1 G2 Þ ¼ dim G1 þ dim G2 : (13)
bolic, that is, X ! X is isomorphic to X ! X but not to The empty string " in the two-dimensional terminals and
fX ! Y; Y ! Xg, even though the latter set of two rules non-terminals is not necessarily associated with terminating
reduces to the first. For our example,
we have ðX ! Xa the reading from the input band(s) as it denotes the absence
XÞ þ ðX ! $Þ ¼ ðX ! Xa X $Þ.
of a parsing symbol. The $ terminal symbol, on the other
Please note that the þ operation has different semantics hand, explicitly parses only the empty (sub)-string.
than the usually encountered union of two grammars. In To see that is associative we need to demonstrate that
particular, the union g [ h of two grammars is typically the productions of ðG1 G2 Þ G3 and G1 ðG2 G3 Þ are iso-
defined in such a way that the intersection of non-terminals morphic, i.e.,
x1 a1 x1 a1
1. This is not completely true in the context of stochastic linear x2 a2
x2 ! a2 ’ ! : (14)
grammars: replication of a rule in an SCFG that already has duplicated x3 a3 x3 a3
rules requires that we sum over the probabilities for isomorphic rules.
510 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 3, MAY/JUNE 2015
This is most easily seen in the notation with the extra " sym- This
X grammarX contains the two-dimensional loop rule
bols since in this case the ai are strings of length 2 that are X
! X
, derived from ðX ! XÞ ðX ! XÞ that even-
simply decomposed in a column-wise fashion. Hence multi- tually needs to be eliminated. To this end, it will be conve-
ple products are well-defined. Furthermore, permutations nient to consider yet another operation on productions.
of rows are isomorphisms. Thus G1 G2 ’ G2 G1 , i.e., The structure-preserving power For any k-dimensional
exchanging the order of factors affects the order of the coor- grammar G and any natural number n 2 Z, G n denotes
dinates only. Due to the associativity of , we can safely the k n-dimensional grammar with the same structure.
extend these constructions to more than two factors. One Each k-dimensional (terminal or non-terminal) symbol
easily checks that and þ are distributive, i.e., ðG1 þ ða1 ; . . . ; ak Þ> is transformed to an k n-dimensional symbol
G01 Þ ðG2 þ G02 Þ ¼ G1 G2 þ G1 G02 þ G01 G2 þ G01 G02 . ðða1 . . . ak Þ; . . . ; ða1 . . . ak ÞÞ> . Note that for a grammar with a
The canonical projection pi : G1 G2 ! Gi is obtained by single production rule we have G G G 2.
formally isolating the ith coordinate and contracting the For our example grammar, this operation is useful as
empty strings " and the empty productions ? ¼ ð" ! "Þ. short-hand for both Eq. (7) and Eq. (8). In the case of linear
Clearly we have pi ðT Þ ¼ Ti , pi ðNÞ ¼ Ni , pi ðSÞ ¼ Si , and grammars, the operator is mostly useful as shorthand to
pi ðP Þ ¼ Pi . The grammar product thus has the basic expand singleton grammars. It is worth noting, however,
properties of a well-defined product. that some algorithms in computational biology, notably the
Let lanðGÞ denote the language generated by G. Note that Sankoff algorithm [12], work on multiple tapes with a gram-
a “string” in lanðGÞ is, by construction, a sequence of mar structured very similar to the one-dimensional rela-
x
terminals, each of which is either of the form ð x12 Þ with tives. We will return to the topic in Section 3.3.
x1
x1 2 T1 and x2 2 T2 , or of the form ð " Þ with x1 2 T1 , or of
"
the form ð x2 Þ with x2 2 T2 . Thus lanðG1 G2 Þ consists of 2.3 The Needleman-Wunsch Alignment Grammar
alignments of strings ai 2 Gi . To see this, note that each We can now construct the full Needleman-Wunsch align-
string ai 2 Gi is generated from si using a finite sequence ment grammar from the much simpler 1-dimensional con-
}i ¼ ðp1i ; p2i ; . . .Þ of productions. Any partial matching of the stituents of Eqs.(6–8) in the following way:
}1 and }2 that preserves the sequential order of the two
input sequences gives rise to a sequence } of productions of NW ¼ S S þ N 2 L 2 : (18)
the product grammar by matching all unmatched pki with
the empty production ? . By construction pi ð}Þ ¼ }i , i.e., } Written in terms of the productions only, this can be
derives an alignment of the input strings b1 and b2 . Con- rephrased as
versely, given a sequence } of productions of the product
grammar, we know that pi ð}Þ is a sequence of productions ðX ! XajXÞ ðX ! XajXÞ
of Gi ; hence it constructs strings in lanðGi Þ. It follows that þ ðX ! $Þ 2 ðX ! XÞ 2 (19)
the product language satisfies
¼ X ! X a X a X " $ :
X X a X " X a $
pi ðlanðG1 G2 ÞÞ ¼ lanðGi Þ: (15)
Again we have used a distinct symbol $ to highlight the ter-
mination case deriving from N . Since our construction of
Similarly, we find that parse trees have a natural align- the Needleman-Wunsch grammar is based on well-defined
ment structure. Let t be a parse tree for an input b 2 algebraic operations we can readily use the same approach
lanðG1 G2 Þ. Its interior nodes are labeled by the produc- to construct much more complex alignment algorithms.
1 !B1 x1 A1 !B1 x1 " Before we proceed, however, we provide with Gotohs algo-
tions, i.e., pairs of the form ð A
A2 !B2 x2 Þ, ð "
Þ, or ð A !B x Þ.
2 2 2
The projections pi ðtÞ are explained by retaining only the ith rithm a more complex example and address the technical
coordinate of the vertex label and contracting all vertices issue of loop rules.
labeled by " in pi ðtÞ yields a valid parse tree for pi ðbÞ w.r.t.
Gi . Thus t is a tree alignment of the parse trees for the two 2.4 Gotoh’s Grammar in Product Form
input strings. The construction of Gotoh’s algorithm (Eq. 1) is a bit more
The direct product forms a monoid on grammars with complicated than the simple product construction for the
arbitrary dimensions since Needleman-Wunsch grammar. It will make use of the fol-
lowing simple component grammars, of which grammar S
P1m P2n ¼ ðp1 p2 Þmþn j pm
1 2 P1 ; p2 2 P2 ;
m n n
(16) (Eq. 24) defines a start symbol:
where p1 p2 is explained in Definition 1. The neutral ele- G ¼ ðfX; Y g; fag; fX ! Xa Ya; Y ! X Y gÞ ; (20)
ment of the monoid is the zero-dimensional grammar
which has one production rule "0 ! "0 that neither reads
nor writes anything as it does not operate on a tape. Albeit V ¼ ðfX; Y; Sg; fag; fX ! Ya; Y ! Y; S ! Y gÞ ; (21)
rather artificial at first glance, it is useful to have a neutral
element available. For our example, we have
W ¼ ðfX; Y g; fg; fY ! X Y gÞ ; (22)
ðX ! Xa j XÞ ðX ! Xa j XÞ
a X a X " X (17)
¼ X ! X : N ¼ ðfXg; f$g; fX ! $gÞ ; (23)
X X a X " X a X
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 511
Recent additions to the compiler infrastructure [18], which 3.1.2 Semi-Global Alignments
provide instruction-level parallelism, will reduce this factor We first modify the Needleman-Wunsch grammar,
further. As ADPfusion is built on top of the Repa [19] equ. (18), in such a way that it models a semi-global align-
library for CPU-level parallelism, we can expect improve- ment, i.e., to allow the grammar to act locally on one or
ments in this regard to be available for our dynamic pro- more tapes. This allows us to construct scanning-type algo-
gramming algorithms in the near future. rithms that can be used in genome-wide applications such
Finally, we provide colored, pretty-printed diagnostics to as HMMer [27] and Infernal [28], [29]. The basic idea is to
aid during grammar development. replace the one-dimensional “step grammar” by a slightly
more complex one that allows us to skip a prefix or a suffix:
3 ALIGNMENT ALGORITHMS
S ¼ ðfXg; fag; fX ! Xa XgÞ (28)
The overwhelming majority of alignment programs solve
pairwise alignment problems by exact DP but use heuristics
to combine the pairwise solutions to multiple alignments. NL ¼ ðfLg; f$g; fL ! $gÞ (29)
The main reason is practicality. Full-fledged n-way DP
alignments have exponential running time in n and hence
NX ¼ ðfXg; f$g; fX ! $gÞ (30)
are of little practical use for large n despite of elaborate
divide-and-conquer strategies have been proposed to prune
the search space, see e.g., [1]. Three-way alignments never- L ¼ ðfXg; fg; fX ! Xg; XÞ (31)
theless are employed in practise in particular when high
accuracy is crucial, see e.g., [20], [21], [22], [23]. Four-way O ¼ðfX; R; Lg; fag;
alignments were recently explored for aligning short words (32)
from human language data [24]. We suspect that DP fR ! Ra X; X ! L; L ! Lag; RÞ:
approaches for moderate values of n have not been explored
for specialized application because of the effort for their The extension from a global to a semi-global (or global)
implementation. In this section we demonstrate how the alignment is done using another grammar with a total of
product construction can help, using a combined nucleic- four rules. These rules allow the removal of nucleotides on
acid/protein alignment algorithm as an example. the right (via R rules) or left (L rules) and switching to and
from the actual alignment grammar. The extended step
grammar O introduces the transitions from the right
3.1 Global, Semi-Global, and Local Alignments
“ignored” part R to the aligned part X, and finally from X
3.1.1 Global Alignments to the left “ignored” part L. It reads through the “ignored”
The global alignment described above is the simplest variant parts with R ! Ra and L ! La rules. The “stop grammar”
of pairwise sequence alignment. Needleman-Wunsch style N now needs to recognize the end of the tape $ as the r.h.s.
global alignments in grammatical form have a very convenient of the non-terminal L.
structure. The global alignment of k sequences (and therefore k The combined semi-global grammar can be written as
tapes) can be written as SW k ¼ ki¼1 S L k þ N k, where
SG ¼ S S þ O L L 2 þ NL NX (33)
ki¼1 S denotes the k-fold product of a grammar with itself. By
virtue of having a monoidal (and hence associative) structure has the eight productions
of the operator it is well-defined.
R R a X
This property of the global alignment grammar was quite ! X
"
useful in recent work on historical linguistics [25] where all X
X
X a X " X a L
! X a X
(34)
alignments for k-tuples with k 2 f2; 3; 4g, or two- to four- X
L L a $
X a X " X
grammar-based scheme to derive a local algorithm from misread as T in sequencing [31]; to model such effects,
a global one. Our construction is based on the observa- asymmetric substitution score matrices are required. The
tion that we can interpret the Smith-Waterman algorithm same is true for alignments of lexical data in a computa-
as a “concatenation” of three interconnected Needleman- tional linguistics setting because sound changes are direc-
Wunsch algorithms, where the first and the last one tional between two languages [24]. To enforce symmetry we
score only the excluded parts of the sequences. This can may use the same non-terminal symbols for each tape, while
be written as asymmetry can be indicated by the use of different (or
indexed) symbols on the different tapes.
SW ¼ þ2i¼0 ðS i S i Li 2Þ þ N 2 2 þ T 2 : (35)
Here, S; L; N are derived from the global alignment ver- 3.2 DNA-Protein Alignment
sions The problem of aligning a protein sequence to a nucleic acid
(RNA or DNA) sequence is a rather specialized problem
S i ¼ ðfXi g; fag; fXi ! Xi a Xi gÞ that arises in particular in the context of (homology-based)
Li ¼ ðfXi g; fg; fXi ! Xi gÞ gene annotation. The best example is probably NCBI’s
N 2 ¼ ðfX2 g; f$g; fX2 ! $gÞ prosplign, which aligns a protein query sequence to a
piece of genomic DNA allowing also for introns. A detailed
T ¼ ðfX0 ; X1 ; X2 g; fg; fX0 ! X1 ; X1 ! X2 g; X0 Þ
description of this dynamic programming algorithm has not
yet become available. An interesting variation on this theme
and T defines the transitions between the grammars. The is gene annotation in the presence of extensive insertion/
resulting product grammar contains 12 production rules deletion editing as observed in the mitochondria of Physa-
with ð X0
X0 Þ as start symbol:
rum polycephalum or trypanosomatids. Frequent changes of
the reading frame make it virtually impossible to identify
X0
! X0 " X0 a X0 a
X1 homologs of mitochondrial proteins by tblastn, thus call-
X0 X X1 X " X
0 a 0 0 a ing for specialized alignment algorithms [32], [33].
X1 X2 X1 " X1 a X1 a
X ! X X X " X (36) Our task is to align the amino acid sequence of a protein
1 2 1 a 1 1 a
X2 $ X2 " X2 a X2 a that may be present in a mitochondrial genome to the entire
!
X2 X2 a X2 " :X2 a
$ nucleic acid sequence of a mitogenome. Since we suspect
that mRNAs may be subject to insertion or deletion editing,
The na€ıve formulation is not ideal in practise in that it it is necessary to track frameshifts. Fig. 1 shows a general
requires
0 three
X1 memoization
X2 tables for the three non-termi- version of such an approach. The DNA sequence is read in
nals XX0 , X1 , and X2 . In case of a localscoring system
X2 one of three reading frames (RFs), and a deletion or inser-
where the excluded parts of the alignment ( X 0
X0 and X2 ) tion does not yield a “simple” in/del but also a frame shift
are scored by a constant, it is possible to replace the OðnmÞ to account for the effect of in/dels on the translation of the
memo-tables with tables of size Oð1Þ. This is possible by rec- DNA into protein according to the “codons” of the genetic
ognizing that every subword (index) in such a table can be code. In Fig. 1 frame shifts (with scoring functions rf1 and
memoized by the same single value. We will come back to rf2) are enabled. Staying within a frame is modelled either
this point in the discussion section. by a (mis)match stay or by the deletion of all three charac-
ters of a codon (del). Finally, the alignment is to be calcu-
3.1.4 Symmetric and Asymmetric Scoring lated locally w.r.t. the DNA sequence but globally w.r.t. the
It is important to recognize that the grammar alone is a amino acid sequence. In the grammar of Fig. 1 this is
device that enumerates all possible alignments of a DNA achieved by adding a component grammar that “skips” an
sequence with a protein sequence. In particular, the gram- unaligned prefix and suffix on the DNA band while leaving
mar itself will not disallow alignments that are biologically the protein band untouched. This follows the same insight
unsound. However, each grammar created using our frame- as in the simpler alignment grammars above.
work has all of its rules tagged with function symbols. As each of the three frames, and shifts to the other two
These function symbols are also known as algebra symbols frames, is by itself similar to the other two frames, a special
in the context of ADP [2] where we also borrowed the tag- encoding saves a lot of work. The F non-terminal indicating
ging symbol ‘<<<‘ used in Fig. 1. In this sense, our frame- the current frame is indexed with indices 0, 1, and 2. Frame
work is very similar to S-attribute grammars [30]. shifts are thus calculated modulo 3 instead of explicitly creat-
Nevertheless, we can support the construction of the ing all three frame indices F0 to F2 and their corresponding
scoring algebra already during grammar design be explic- production rules. Furthermore, all alignments are local with
itly making use of symmetries in the model. The alignment respect to the DNA sequence but global with respect to the
of two sequences of the same type is usually simplified due protein sequence. The product of an embedding grammar
to mirrored operations. Recalling the alignment grammar (DNAlocal) with a grammar that does not read any amino
from above, we speak of in/del operations as an insertion in acid character yields the correct semi-global embedding.
one sequence that may just as well be described as a deletion For the protein sequence, the corresponding PROstand
in the other sequence. In addition, it does not matter which grammar can simply be reused.
sequence is bound to which input tape. In some applications The complexity of the DNA-protein alignment stems
this symmetry is broken. For example, ancient DNA is par- from the fact that we need to “align” the different frame
tially chemically degraded by cytosin deamination, i.e., C is shifting possibilities in the DNA input while matching
514 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 3, MAY/JUNE 2015
Fig. 1. Atomic grammars for the DNA-Protein alignment example. (I) Nucleotides are read in triplets (three nucleotides each). The genome is aligned
locally to the complete amino acid sequence. Using DNAlocal, nucleotides can be removed from the left or right end of the DNA sequence. The
choice between local or global alignment for each tape is made based on adding the grammar product DNAlocal >< PROstand. The DNA grammar
switches between reading frames. DNAdone and DNAstand handle the terminating and looping case. (II) The PROtein grammar works similarly, but
reads only a single amino acid at a time. The expansion of the DNA grammar is more complicated, as the indexed non-terminal symbol F expands to
three different non-terminals corresponding to the three possible reading frames. (III) The grammar product of DNA and PROtein without the looping
case “stand” and with the terminating case “done”. In code, >< represents the direct product (). The resulting 34-production rule grammar is
shown in the Supplemental Material, available online, together with an extended description.
zero to three nucleotides to zero or one amino acid in the two DNA sequences with two protein sequences. This
protein input. In addition, once a frame shift has occurred grammar can be calculated at basically no additional cost but
all following alignments of three nucleotides against one would pose a daunting task if implemented by hand.
amino acid are scored in the new reading frame until In addition to the 34-production rule grammar, a set of
another frame shift occurs or the alignment is completed. (10 þ 1) function types3 is created. Each function type is asso-
Under normal circumstances, the scoring algebra for the ciated with a pair of function symbols, one for the DNA and
DNA-Protein grammar will assign very high costs to one for the protein tape; apart from the choice function (the
frameshift productions rf1 and rf2. In order to model “þ1”). This signature, as it is called in Algebraic Dynamic
the frequent cytosine insertions in P. polycephalum, how- Programming [2], [17], associates one type with one or more
ever, we simply use a moderate or low penalty for rf1 of the production rules. For example, the production rule
when the incomplete codon is corrected by the inclusion of created from the product of F{i} -> stay <<< F
a ‘C’ that is not encoded in the DNA. {i} c c c and P ->amino <<< a creates,
P among others,
Our framework simplifies the complexity of designing the production rule FP0 ! FP0 ac "c "c .
this algorithm considerably. While the combined grammar is The corresponding evaluation function stay_amino
highly complex, the individual grammars are rather simple. has type S ðN AÞ ðN EÞ ðN EÞ ! S. This type
As already mentioned, the protein “stepping grammar” is specifies that stay_amino accepts the score (of type S)
one of the simplest possible ones. The DNA grammar is of the alignments as calculated up to this position fol-
more complex as we need to handle stepping and frame lowed by tuples of characters read from each tape. The
shifts in all three reading frames. But considering that we first nucleotide character (of the set N ¼ fA; C; G; Tg) is
allow indexed non-terminals and calculations on these indi- aligned with the corresponding amino acid character (of
ces (modulo 3 in the frame shift case), even the frame shift the set A of the 20 amino acids). The remaining two
grammar has only four rules, just twice as much as the sim- nucleotides are aligned with elements from the empty
plest stepping grammar. set (E which emits the “non-informative character”
The resulting 34-production rule grammar is easily calcu-
lated in our frame work. We emphasize that one may read- 3. The 11th function type designates the optimizing choice function,
ily extend this grammar to allow for, say, an alignment of which is part of every evaluation algebra.
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 515
In addition, the algorithm makes use of multiple cores question whether such generalization could be obtained as
on a single machine in a parallel setup. This algorithm is subsets of the product of NUS with itself:
embarrassingly parallel as all pairs of DNA windows and
S $ $ $ Sa Sa Sa
proteins can be aligned simultaneously, given enough !
S $ Sa SB $ Sa SB
resources. SB SB SB
The final ingredient to good performance is the imple- $ Sa SB
S $ Sa SB
mentation of the dynamic programming algorithm itself. !
B
B aS a^
aS
aS a^ aS^
a
a (40)
Our implementation uses ADPfusion as the underlying a^ aS a^ aS^
! $
dynamic programming framework. In this way, questions SB aSa^ Sa SB
augmented by non-terminals of the form A" and A" . The instance, the } -square of the step grammar ðX ! aX j
production rules of the product are the following: "XÞ has the productions
X a X a X a "
ðA1 ! a1 B1 C1 Þ } ðA2 ! a2 B2 C2 Þ !
a 1 C
X a
a X
a a X
X" a "
X a
¼ A1
! a12 B 1
" " " X
A2 B2 C2 " X X" " "
X "
(43)
ðA1 ! a1 B1 C1 Þ } ðA2 ! b2 D2 Þ a X
" X
a a X
" X" " "
a B1 C a1 B C
A1 b X :
¼ A2 ! b21 D 2 "
1
2 "
1 1
D2
" X " " "
ðA1 ! a1 B1 C1 Þ } ðA2 ! c2 Þ Only the first term in each line appears in the product.
a We have fully implemented definition of the } product
A1
¼ A2 ! c21 B"1 C"1
for the two-GNF of Eq. (42), so that general CFGs in two-
ðA1 ! b1 D1 Þ } ðA2 ! a2 B2 C2 Þ GNF can be defined and multiplied like linear (left-, right-,
and general linear) grammars in our domain-specific lan-
¼ A1
! ab12 D1 " b1 B" D1
A2 B2 C2 a2 2 C2 guage, providing access to an efficient implementation of
ðA1 ! b1 D1 Þ } ðA2 ! b2 D2 Þ the resulting multi-tape product grammars.
(42)
¼ A1
! bb12 D1
A2 D2 4 DISCUSSION
b1 D1
D
" b1 D" D1 Our main contribution is a formal, abstract algebra on linear
b2 " 2 b2 2 "
grammars. This algebra provides operations to create com-
ðA1 ! b1 D1 Þ } ðA2 ! c2 Þ ¼ A1
! b
1 D 1 plex, multi-tape grammars from simple, single-tape atomic
A2 c2 "
ones. More informally, we have created a method and imple-
ðA1 ! c1 Þ } ðA2 ! a2 B2 C2 Þ mentation to “multiply” dynamic programming algorithms.
c1 " "
A1 We also provide a compiler framework that makes the gram-
¼ A2 ! a2 B2 C2
mars readily available for actual deployment with good per-
A1 c " formance of the resulting code. Products of linear grammars
ðA1 ! c1 Þ } ðA2 ! b2 D2 Þ ¼ A2 ! b12 D2
c make it very easy to construct the grammars underlying key
A1
ðA1 ! c1 Þ } ðA2 ! c2 Þ ¼ A2 ! c12 : algorithms in the field of string comparison starting from
almost trivial single-tape factors. This approach is particu-
larly fruitful in highly specialized applications as it drasti-
By construction, } is commutative (up to exchanging the cally reduces the efforts required for implementing
coordinates). One easily checks that the product grammar is prototypes, as the example of DNA/protein alignments with
again in two-GNF since the r.h.s. of each production consists frameshifts shows. The work presented here is just a first
of a terminal followed by zero, one, or two non-terminal step towards a general theory of grammar products. Many
symbols. As in the case of the linear grammars we explain questions, both theoretical and practical, remain open.
the productions of non-terminals of the form A" by the pro- Although we have succeeded in constructing an algebrai-
ductions
A a of
BA in the first factor grammar, for instance cally meaningful product operation of CFGs in normal form
"
! " "
C
"
. The distributive law ðP1 þ P2 Þ } P3 ¼ it is currently restricted to the Greibach normal form. A
P1 } P3 þ P2 } P3 also holds by construction. fully generic version of the grammar product currently
To show that } is associative, it therefore suffices to eludes us. While this poses no theoretical problem given
show that the product is associative for any three pro- that every CFG can be transformed into an equivalent CFG
ductions. Since there are only three types of rules in a in (Greibach) Normal Form, it poses a problem in practice.
two-GNF, it suffices to consider the 27 possible products Often, a production rule is associated with a certain struc-
of triples of production rules, which altogether lead to tural feature that one wants to retain. Transformations into
57 rules. We used a computational proof to establish that normal form also increase the number of non-terminals
associativity is indeed satisfied, see Supplemental Mate- (and thereby resource usage) by a polynomial depending
rial, which can be found on the Computer Society Digital on the number of non-terminals [42].
Library at http://doi.ieeecomputersociety.org/10.1109/ It will also be important to explore both the interplay of
TCBB.2014.2326155. It is important to note that the two different operators on grammars (especially our þ opera-
“decoupling rules” for ðA1 ! b1 D1 Þ } ðA2 ! b2 D2 Þ indi- tion and the union ([) of grammars), and to formalize
cated by the box in Eq. (42) are necessary for associativ- meaning and operation. This will provide, in the long-term,
ity of the product. a full-fledged algebraic framework in which it should be
Linear grammars can be understood as special cases easily possible to describe even complex grammatical prob-
of two-GNF with productions of the form A ! Bx y lems. As pointed out by a reviewer, it may sometimes be
(except that we now deal with right linear instead of left convenient to introduce a grammar G ¼ ðfZg; fg; fgÞ for the
linear grammars). A comparison of the definition of in purpose of introducing the non-terminal Z (using þ) or for
Eq. (12) and } in Eq. (42) shows that the restriction of } the purpose of removing all rules that contain Z (using ).
to linear grammars does not recover (the mirror image For instance, ðfY g;fg; fgÞ k could remove the unwanted
of) . The discrepancy are exactly the two “decoupling double-deletion YY in Equ. (25) in a quite convenient man-
rules” necessary for associativity of the } product. For ner. Our formalism, however, defines algebraic operations
518 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 12, NO. 3, MAY/JUNE 2015
primarily on sets of production rules. This construction thus Good and optimal table designs based on yield size anal-
will require careful (re-)definition of the algebraic opera- ysis have been considered in [47], extending earlier ideas on
tions acting on grammars. more restricted dynamic programming algorithms [48]. Our
An intriguing idea (proposed by Robert Giegerich) is to proposed extension will consider not only the yield size but
identify irreducible grammars from which more complex the actual evaluation algebra, thereby including more
ones can be constructed, possibly in a unique way. To that domain-specific knowledge.
end, we will require a much better understanding of abstract
algebras on grammars as well as the impact of constructions ACKNOWLEDGMENTS
such as (Z; fg; fg). Since k-tape grammars can formally be
This work was funded, in part, by the Austrian FWF, project
written without making the dimension of the non-terminals
“SFB F43 RNA regulation of the transcriptome” and the
explicit, a conceptually related question is whether product
German DFG project “MI439/14-1”. CHzS thanks Jing,
grammars can be efficiently recognized. More abstractly: ist
Katja, Lydia, and Nancy (and gin, as well as a mad man in a
there a “unique prime-factorization theorem” for grammars
box). The authors thank Maria Walter for her hospitality
akin to analogous results for graphs? [43].
that was very conducive for the development of this work,
Another avenue of future research is the question of
and the referees – Prof. Robert Giegerich and one anony-
semantic ambiguity [44] of the resulting grammars. Simple
mous – for providing in-depth comments.
products of the same grammar yield ambiguous alignments
on sequences of in-dels, that is multiple derivations exist
that yield the same alignment. This problem was studied REFERENCES
extensively for stochastic context-free grammars imple- [1] D. J. Lipman, S. F. Altschul, and J. D. Kececioglu, “A tool for mul-
menting covariance models for structural RNA search [45]. tiple sequence alignment,” Proc. Nat. Acad. Sci. USA, vol. 86,
It is typically dealt with a good grammar design that explic- no. 12, pp. 4412–4415, 1989.
[2] R. Giegerich and C. Meyer, “Algebraic dynamic programming,”
itly allows only one order of successive insertions and dele- in Proc. Algebraic Methodol. Softw. Technol., 2002, vol. 2422,
tions on multiple tapes. Automatic dis-ambiguation is pp. 349–364.
probably complicated but would further simplify the crea- [3] R. Giegerich, C. Meyer, and P. Steffen, “A discipline of dynamic
tion of complex multi-tape grammars. programming over sequence data,” Sci. Comput. Program., vol. 51,
no. 3, pp. 215–263, 2004.
In this contribution we have focussed entirely on the [4] C. M. Reidys, F. W. D. Huang, J. E. Andersen, R. C. Penner, P. F.
grammars underlying the dynamic programming algo- Stadler, and M. E. Nebel, “Topology and prediction of RNA
rithms and disregarded almost entirely the construction of pseudoknots,” Bioinformatics, vol. 27, pp. 1076–1085, 2011.
[5] H. Chitsaz, R. Salari, S. C. Sahinalp, and R. Backofen, “A partition
scoring algebras for product grammars. We anticipate that function algorithm for interacting nucleic acid strands,” Bioinfor-
in many cases, a scoring algebra can be expressed as a matics, vol. 25, pp. i365–i373, 2009.
form of product itself where the two scoring functions (one [6] F. W. D. Huang, J. Qin, C. M. Reidys, and P. F. Stadler, “Partition
for each grammar) are themselves combined in some well- function and base pairing probabilities for RNA-RNA interaction
prediction,” Bioinformatics, vol. 25, pp. 2646–2654, 2009.
defined form. One possibility is the use of a folding opera- [7] S. Will, C. Schmiedl, M. Miladi, M. M€ ohl, and R. Backofen,
tion to combine scores for subsets of the individual dimen- “SPARSE: Quadratic time simultaneous alignment and folding of
sions. It then follows that given two algebras AG1 and AG2 RNAs without sequence-based heuristics,” in Proc. 17th Int. Conf.
for grammars G1 and G2 we should be able to define an Res. Comput. Mol. Biol., 2013, pp.289–290.
[8] C. H€ oner zuSiederdissen, I. L. Hofacker, and P. F. Stadler, “How
operation AG1 k AG2 which generates appropriate alge- to multiply dynamic programming algorithms,” in Proc. Brazilian
bras from algebras for atomic grammars. As long as k has Symp. Bioinformat., 2013, pp. 82–93.
some structure similar to a fold or another operation on [9] O. Gotoh, “An improved algorithm for matching biological
sequences,” J. Mol. Biol., vol. 162, pp. 705–708, 1982.
subsets of the dimensions (of the grammars) involved, [10] F. Lefebvre, Grammairs s-attribuees multi-bandes et applications a
appropriate products can be automatically defined. This l’analyse automatique de sequences biologiques, Ph.D. dissertation,
will become particularly useful when aiming at ADP-like
Ecole Politechnique, Palaiseau, France, 1997.
[46] algebra-products to explore the rich space of combined [11] S. B. Needleman and C. D. Wunsch, “A general method applicable
to the search for similarities in the amino acid sequence of two
algebras on grammars constructed from algebraic opera- proteins,” J. Mol. Biol., vol. 48, no. 3, pp. 443–453, 1970.
tions on atomic grammars. [12] D. Sankoff, “Simultaneous solution of the RNA folding, alignment
In Section 3.1 on local alignments we mentioned that a and protosequence problems,” SIAM J. Appl. Math., vol. 45,
pp. 810–825, 1985.
na€ıve memoization of the three non-terminals of the Smith- [13] The GHC Team. (1989–2013). The Glasgow Haskell Compiler
Waterman algorithm leads to a three-fold increase in mem- (GHC) [Online]. Available: http://www.haskell.org/ghc/
ory usage compared to the usual implementation based on [14] T. Sheard, and S. P. Jones, “Template Meta-programming for
one table and a neutral element in the scoring function. In Haskell,” in Proc. ACM SIGPLAN Workshop Haskell, 2002, pp. 1–16.
[15] G. Mainland, “Why it’s nice to be quoted: Quasiquoting for
case of local alignments, the evaluation functions attached Haskell,” in Proc. ACM SIGPLAN Workshop Haskell Workshop,
to each production rule return a constant value, often the 2007, pp. 73–82.
neutral element (i.e., a score of 0 for summations), for all [16] D. Coutts, R. Leshchinskiy, and D. Stewart, “Stream fusion: From
productions and all substrings. lists to streams to nothing at all,” in Proc. 12th ACM SIGPLAN Int.
Conf. Funct. Program., 2007, pp. 315–326.
We plan to extend our framework to make it possible to [17] C. H€ oner zu Siederdissen, “Sneaking around concatMap: Efficient
evaluate combinations of grammars and algebras in more combinators for dynamic programming,” in Proc. 17th ACM SIG-
complex ways. This should allow us to automatically deter- PLAN Int. Conf. Funct. Program., 2012, pp. 215–226.
[18] G. Mainland, R. Leshchinskiy, S. P. Jones, and S. Marlow,
mine what kind of memo-table is required for each gram- “Exploiting vector instructions with generalized stream fusion,”
mar and algebra, thereby optimizing memory consumption in Proc. 18th ACM SIGPLAN Int. Conf. Funct. Program., 2013,
of the dynamic programming algorithms. pp. 37–48.
SIEDERDISSEN ET AL.: PRODUCT GRAMMARS FOR ALIGNMENT AND FOLDING 519
[19] G. Keller, M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones, [45] R. Giegerich, and C. H€ oner zu Siederdissen, “Semantics and ambi-
and B. Lippmeier, “Regular, shape-polymorphic, parallel arrays guity of stochastic RNA family models,” IEEE/ACM Trans. Com-
in Haskell,” in Proc. 15th ACM SIGPLAN Int. Conf. Funct. Program., put. Biol. Bioinformat., vol. 8, no. 2, pp. 499–516, Mar. 2011.
2010, pp. 261–272. [46] P. Steffen, and R. Giegerich, “Versatile and declarative
[20] O. Gotoh, “Alignment of three biological sequences with an effi- dynamic programming using pair algebras,” BMC Bioinformat.,
cient traceback procedure,” J. Theor. Biol., vol. 121, pp. 327–337, vol. 6, p. 224, 2005.
1986. [47] P. Steffen and R. Giegerich, “Table design in dynamic pro-
[21] T. G. Dewey, “A sequence alignment algorithm with an arbitrary gramming,” Inf. Comput., vol. 204, no. 9, pp. 1325–1345, 2006.
gap penalty function,” J. Comput. Biol., vol. 8, pp. 177–190, 2001. [48] H. L. Bodlaender and J. A. Telle. (2004, Dec.). Space-efficient con-
[22] A. S. Konagurthu, J. Whisstock, and P. J. Stuckey, “Progressive struction variants of dynamic programming. Nordic J. Comput.
multiple alignment using sequence triplet optimization and three- [Online]. 11(4), pp. 374–385. Available: http://dl.acm.org/
residue exchange costs,” J. Bioinf. Comput. Biol., vol. 2, pp. 719– citation.cfm?id=1062991.1062995
745, 2004.
[23] M. Kruspe, and P. F. Stadler, “Progressive multiple sequence Christian Ho € ner zu Siederdissen is a postdoc-
alignments from triplets,” BMC Bioinformat., vol. 8, p. 254, 2007. toral researcher with the Institute for Theoretical
[24] L. Steiner, P. F. Stadler, and M. Cysouw, “A pipeline for computa- Chemistry at the University of Vienna. His
tional historical linguistics,” Lang. Dynamics Change, vol. 1, pp. 89– research interests include optimization problems
127, 2011. in biology, Bayesian statistics, formal languages,
[25] N. Retzlaff, “Bigramm-Alignierung und ihre Anwendung in and functional programming.
der historischen Linguistik,” Bachelors Thesis, Department of
Computer Science, Eberhard Karls Universit€at Tbingen, Tb€ uin-
gen, Germany, 2013.
[26] T. F. Smith, and M. S. Waterman, “Identification of common
molecular subsequences,” J. Mol. Biol., vol. 147, pp. 195–197, 1981.
[27] S. R. Eddy, “HMMER: Profile HMMs for protein sequence analy-
sis,” Bioinformatics, vol. 14, pp. 755–763, 1998.
Ivo L. Hofacker is a professor at the Departments
[28] S. R. Eddy, and R. Durbin, “RNA sequence analysis using covari-
of Chemistry and of Computer Science at the Uni-
ance models,” Nucl. Acids Res., vol. 22, pp. 2079–2088, 1994.
versity of Vienna, where he heads the Institute for
[29] E. P. Nawrocki, D. L. Kolbe, and S. R. Eddy, “Infernal 1.0: infer-
Theoretical Chemistry and the Research Group
ence of RNA alignments,” Bioinformatics, vol. 25, no. 10, pp. 1335–
Bioinformatics and Computational Biology,
1337, 2009.
respectively. His research interests focus on the
[30] F. Lefebvre, “An optimized parsing algorithm well suited to
prediction of RNA structures, noncoding RNA
RNA folding,” in Proc. 3rd Int. Conf. Intell. Syst. Mol. Biol.,
annotation, and the analysis of folding landscapes.
1995, pp. 222–230.
[31] K. Pr€ufer, U. Stenzel, M. Hofreiter, S. P€a€abo, J. Kelso, and R. E.
Green, “Computational challenges in the analysis of ancient
DNA,” Genome Biol., vol. 11, p. R47, 2010.
[32] J. M. Gott, N. Parimi, and R. Bundschuh, “Discovery of new genes
and deletion editing in Physarum mitochondria enabled by a novel Peter F. Stadler received the PhD degree in
algorithm for finding edited mRNAs,” Nucl. Acids Res., vol. 33, Chemistry from the University of Vienna in 1990,
pp. 5063–5072, 2005. and then worked as an assistant and associate
[33] C. Beargie, T. Liu, M. Corriveau, H. Y. Lee, J. Gott, and R. professor for theoretical chemistry at the same
Bundschuh, “Genome annotation in the presence of insertional School. In 2002, he moved to Leipzig as a full pro-
RNA editing,” Bioinformatics, vol. 24, pp. 2571–2578, 2008. fessor for bioinformatics. Since 1994, he is an
[34] R. Bundschuh, “Computational approaches to insertional RNA external professor at the Santa Fe Institute. He is
editing,” Methods Enzymol., vol. 424, pp. 173–195, 2007. an external scientific member of the Max Planck
[35] R. Bundschuh, J. Altm€ uller, C. Becker, P. N€urnberg, and J. M. Society since 2009 and a corresponding member
Gott, “Complete characterization of the edited transcriptome of abroad of the Austrian Academy of Sciences
the mitochondrion of Physarum polycephalum using deep sequenc- since 2010.
ing of RNA,” Nucl. Acids Res., vol. 39, pp. 6044–6055, 2011.
[36] H. Takano, T. Abe, R. Sakurai, Y. Moriyama, Y. Miyazawa, H. " For more information on this or any other computing topic,
Nozaki, S. Kawano, N. Sasaki, and T. Kuroiwa, “The complete
please visit our Digital Library at www.computer.org/publications/dlib.
DNA sequence of the mitochondrial genome of Physarum poly-
cephalum,” Mol. General Genetics, vol. 264, pp. 539–545, 2001.
[37] B. F. Lang, G. Burger, C. J. O’Kelly, R. Cedergren, G. B. Golding, C.
Lemieux, D. Sankoff, M. Turmel, and M. W. Gray, “An ancestral
mitochondrial DNA resembling a eubacterial genome in mini-
ature,” Nature, vol. 387, pp. 493–497, 1997.
[38] R. Nussinov, G. Piecznik, J. R. Griggs, and D. J. Kleitman,
“Algorithms for loop matching,” SIAM J. Appl. Math., vol. 35,
pp. 68–82, 1978.
[39] R. D. Dowell, and S. R. Eddy, “Evaluation of several lightweight
stochastic context-free grammars for RNA secondary structure
prediction,” BMC Bioinformat., vol. 5, p. 71, 2004.
[40] S. A. Greibach, “A new normal-form theorem for context-free
phrase structure grammars,” J. ACM, vol. 12, pp. 42–52, 1965.
[41] A. Ehrenfeucht, and G. Rozenberg, “An easy proof of Greibach
normal form,” Inf. Control, vol. 63, pp. 190–199, 1984.
[42] N. Blum, and R. Koch, “Greibach normal form transformation
revisited,” Inf. Comput., vol. 150, pp. 112–118, 1999.
[43] R. Hammack, W. Imrich, and S. Klavzar, Handbook of Product
Graphs, 2nd ed. Boca Raton, FL, USA: CRC Press, 2011.
[44] R. Giegerich, “Explaining and controlling ambiguity in dynamic
programming,” in Combinatorial Pattern Matching. New York, NY,
USA: Springer, 2000, pp. 46–59.