Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

An Exceptionally Easy Way to Generate and Parse

Randomized Algebraic Expressions --- Revision 2


Brian Beckman
14 June 2012; 27 Sept 2021

ABSTRACT
Mathematica (or the Wolfram language) offers an exceptionally easy and straightforward way to generate
tools like test generators, parsers, type-checkers, visualizers, and solvers from certain concise grammars.
In this paper, we illustrate our methods with syntax-driven travesties (random utterances) and syntax-driven
parser generation for an example grammar, F, of algebraic expressions. Elsewhere, we tested our genera-
tors on much bigger grammars, for instance, the famous game W ff-n-Proof (link below), for which we
include inferencing, and also for a small programming language from “Types and Programming Languages”
by Benjamin C. Pierce, for which we exhibit type checking.

(Wff-n-Proof: http://www.wffnproof.com, http://books.google.com/books?id=xwC6SgAACAAJ&dq=Wff-n-


Proof&hl=en&sa=X&ei=LWXoT4KlIKKo2wXZ5YnaCQ&ved=0CEIQ6AEwAQ)

OVERVIEW
Travesties are syntactically valid but random utterances, useful for testing. We could, for example, easily
write a strategy in Python’s hypothesis testing library using methods from this paper (https://hypoth-
esis.readthedocs.io/en/latest/ ).

Parsers convert utterances (streams of tokens) into trees, imposing the recursive structure of the grammar.
A parser generator produces a parser from a grammar. Parser generation is usually considered abstruse
and difficult. Think what it would take you to replicate yacc. However, our yacc-alike is trivialized because of
the specific form we require of our grammars.

The grammars we found easy to process are in stripped-down Polish Prefix Form (PPF). In our version of
PPF, each non-terminal in the grammar has a unique leading head symbol. Arities of functional forms, i.e.,
numbers of arguments, are fixed. No punctuation is needed or allowed because a head immediately dic-
tates how many arguments: no curly braces, commas, semicolons, brackets, parentheses, etc. Also no
operators, overloads, or precedence tables are needed or allowed.

We speculate that more complicated grammars, say those for C++, Python, or Haskell, can be translated
into our PPF by a top-level, traditional parser that treats our PPF as an intermediate representation. We
further speculate that it might be worth the effort to transcribe complicated syntaxes into PPF because,
stripped of syntactic noise, our PPFs ease the work of downstream language-processing tools like test
generators, parsers, type-checkers, visualizers, and solvers.

We do not strive for academic rigor, preferring to illustrate by examples and to invite the reader to join the
fun, to extend and apply our methods. We also state, up-front, that any novelty, here, comes from using
Mathematica. Our methods have been known to lisp programmers, for example, for a very long time (See
my paper from 1989, “A Scheme for Interactive Graphics,” for example [link below]). The target audience,
2 AlgebraicExpressionTravesties002.nb

here, comprises programmers who are accustomed to working much too hard at too low a level of abstrac-
tion so as to achieve similar results.

https://www.researchgate.net/publication/2319434_A_Scheme_for_Interactive_Graphics

Preliminaries

In[2]:= << Utilities`CleanSlate`

If you don’t mention their explicit locations, you’ll need the following packages on your Mathematica $Path.
The following shows the explicit locations where I put the packages on my Linux system:

<< "c:/Users/brianbeckman/Dropbox/MMA/Packages/JacquardProlog.m"
<< "c:/Users/brianbeckman/Dropbox/MMA/logic.m"
In[3]:=

<< "c:/Users/brianbeckman/Dropbox/MMA/Packages/Jacquard.m"

POLISH PREFIX GRAMMARS


Our Polish prefix grammars (PPFs) are of the form [non-terminal] [some-fixed-number-of-
arguments]. Such grammars are easy to parse, and it’s even easy to automatically generate parsers for
such grammars.

DEFINITIONS

Forward references to later definitions are in non-bold italics. Undefined words like field are also in non-bold
italics. Definitions are in bold-italics.

Let a grammar be a function from non-terminal symbols to productions, each of which is a suit (ordered
collection with no duplicates, aka permutation) of alternatives. Alternatives are ordered for the convenience
of pairing (zipping) them with travesty-generation probabilities; logically, alternatives are unordered, i.e., a
set rather than a suit.

For example, F below is our grammar for algebraic expressions. We assume an algebraic field with addition
and multiplication following the usual commutative, distributive, and associative laws (Mathematica automati-
cally enforces these laws when simplifying expressions; Incidentally, Mathematica’s simplifications are not
well founded as Mathematically freely annihilates expressions multiplied by zero without stipulating that
denominators be non-zero).
AlgebraicExpressionTravesties002.nb 3

In[6]:= ClearAll[F, Expr, FSum, FInt, FVar, FProd, Start]; (* for "Field" *)
FSum

F[Expr] = ;(* "Sum" collides with a built-in. *)


FProd
FVar
FInt
F[FSum] = {{Plus, Expr, Expr}}; (* Do we want the built-in "Plus"? *)
F[FInt] = {{RandomInteger}};
F[FVar] = {
{x}, {y}, {z}, {- x}, {- y}, {- z}, {1 / x}, {1 / y}, {1 / z}
(*,{u},{v},{w},{p},{q},{r},{s},{t}*)};
F[FProd] = {{Times, Expr, Expr}};
F[Start] = {{Expr}};

In this example, F is a function from the non-terminal symbols Expr, FSum, FInt, FVar, FProd , and Start,
each to a list (as a suit) of alternatives, each alternative a list (as a sequence) of terms, recursively. This list
of non-terminal symbols is the domain of the function.

A production is a suit of alternatives. The value of F[Expr] above, namely {{FSum, FProd, FVar,
FInt}}, is an example of a production.

Each alternative is a list, stream, sequence, or array (all synonyms for an ordered collection, duplicates
allowed) of terms. The terms must match, in both order and form, the terms in utterances.

An utterance is a list (as a stream) of terminals like {Plus, x, y}, The utterance has the parse tree
FSum[Plus,FVar[x],FVar[y]], which is a stand-in, written in our PPF grammar, for the algebraic expres-
sion x+y. The parser, which we will generate automatically from the grammar, converts utterances into
parse trees. Parse trees exhibit the recursive structure of utterances, a structure imposed by the grammar.

A term is either a terminal symbol or a non-terminal symbol.

A terminal symbol is a literal like an integer or a bottom-level expression like x, 1/x, or -1. A terminal
symbol must match exactly a token appearing in the input.

A token is an atomic expression in the source language. Examples include a symbol, e.g., Plus or FSum, a
String (we don’t use strings here, but they work just fine), or an FInt enclosing a literal integer.

A non-terminal symbol recurses back into the grammar. It’s always of Mathematica type (i.e., Head)
Symbol.

The start symbol is the special, distinguished name and symbol Start, not available for other uses.

How about a pretty display for the domain of F? From the Jacquard library imported in the Preliminaries
section of this document, gridRules, exhibits terminals in dark red, non-terminals in dark blue, lists in green
boxes, lists of lists in doubled green boxes. This visual notation does not distinguish lists as suits, lists as
sets, lists as sequences, or lists as bags. Mathematica Rules have orange boxes on the left and light-yellow
boxes on the right. The following displays the rules above. The code includes a little inscrutable gymnas-
tics to prevent premature evaluation by Mathematica (See Robby Villegas’s paper “Working with Unevalu-
ated Expressions,” https://library.wolfram.com/infocenter/Conferences/377/ ).
4 AlgebraicExpressionTravesties002.nb

visGrammar[g_] := Module[{d = DownValues[g]},


In[13]:= ClearAll[visGrammar];

Function[hp, hp[[1, 1, 1]]] /@ Unevaluated /@ d,


Zip[

Function[hp, hp[[1, 1]]] /@ d, Rule]] // gridRules;


visGrammar[F]

FSum

FProd
Expr
FVar

FInt

FInt RandomInteger

Times
FProd Expr
Expr

Plus
FSum Expr
Expr

Out[15]= x

Times - 1 x

Times - 1 y
FVar
Times - 1 z

Power x - 1

Power y - 1

Power z - 1

Start Expr

Fish out the non-terminals from our grammar, F:


AlgebraicExpressionTravesties002.nb 5

nonTerminalsFromGrammar[ps_] := #[[1, 1, 1]] & /@ DownValues[ps]


In[16]:= ClearAll[nonTerminalsFromGrammar];

(nonTerminalsFromGrammar[F]) // InputForm

{Expr, FInt, FProd, FSum, FVar, Start}


Out[18]//InputForm=

The terminals in a grammar is the complement of the non-terminals against the set of all symbols, which is
the union of all the right-hand sides of the productions.

allSymbols[ps_] := Union @ Flatten[#[[2]] & /@ DownValues[ps]]


In[19]:= ClearAll[allSymbols];

In our pretty display for allSymbols via Jacquard’s gridExpression below, terminals are in a light-yellow
background, non-terminals in a purple background; gridExpression is a visual version of Mathematica’s
built-in FullForm, so it shows the Head List explicitly, whereas gridRules above exhibits lists as colored
boxes without the explicit Head List.
6 AlgebraicExpressionTravesties002.nb

In[21]:= allSymbols[F] // gridExpression

Expr

FInt

FProd

FSum

FVar

Plus

RandomInteger

Times

Power x

-1
List

Times -1

x
Out[21]=

Power y

-1

Times -1

Power z

-1

Times -1

Mathematica’s built-in InputForm is another way to visualize data.

In[22]:= allSymbols[F] // InputForm


Out[22]//InputForm=

{Expr, FInt, FProd, FSum, FVar, Plus, RandomInteger, Times, x^(-1), -x, x, y^(-1),
-y, y, z^(-1), -z, z}

Here are the terminals from our example grammar, F, prettified by Mathematica’s automatic typesetting
AlgebraicExpressionTravesties002.nb 7

when we don’t use InputForm:

In[23]:= ClearAll[terminalsFromGrammar];
terminalsFromGrammar[ps_] :=

allSymbols @ ps,
Complement[

nonTerminalsFromGrammar @ ps]
(T = terminalsFromGrammar @ F)

Plus, RandomInteger, Times, , - x, x, , - y, y, , - z, z


1 1 1
Out[25]=
x y z

ANTI-PARSING: SYNTAX-DRIVEN TRAVESTIES

◼ Inject Probabilities into the Grammar

Transform each alternative (sequence of terms) into a list of Mathematica Rules (a list of rules is a
Jacquard object): one rule for the probability of choosing the alternative in a travesty and another rule for
the original alternative itself. Generate the probabilities with a function that maps the alternatives to a
parallel, isocardinal (zippable) list. The orders of the alternatives and of the probabilities are important only
for Zip, even though the alternatives are, notionally, a set.

In[26]:= ClearAll[injectGenerationProbabilities];
injectGenerationProbabilities[
grammar_,
probsFromAlternatives_] :=
Module[{newTable, nonTerminals = nonTerminalsFromGrammar @ grammar},
Scan[(* Scan is like Map, but just for side-effects *)
Function[nonTerminal,

alternatives = grammar[nonTerminal],
With[{

probabilities = probsFromAlternatives[grammar[nonTerminal]]},
newTable[nonTerminal] = (* side-effect this definition *)
Zip[probabilities, alternatives,

{"probability"  prob, "alternative"  alt}]]]],


Function[{prob, alt},

nonTerminals]; (* scan over this list *)


newTable]; (* return the side-effected table *)

The following function assigns equal probabilities to every element of a list. It’s the default. Use something
else if you have better estimates of appropriate probabilities for travesty generation.
8 AlgebraicExpressionTravesties002.nb

In[28]:= ClearAll[equiProbabilities];
equiProbabilities[list_List] :=

With{l = Length @ list}, TableN , {l}


1
l

Here is the grammar, F, probabilized:

In[30]:= (FProbabilized = injectGenerationProbabilities[F,

]) // visGrammar
equiProbabilities

probability 0.25
alternative FSum

probability 0.25
alternative FProd
Expr
probability 0.25
alternative FVar

probability 0.25
alternative FInt

probability 1.
FInt
alternative RandomInteger

probability 1.
Times
FProd
alternative Expr
Expr

probability 1.
Plus
FSum
alternative Expr
Expr

probability 0.111111
alternative x

Out[30]=
probability 0.111111
alternative y
AlgebraicExpressionTravesties002.nb 9

probability 0.111111
alternative z

probability 0.111111

alternative Times - 1 x

probability 0.111111

FVar alternative Times - 1 y

probability 0.111111

alternative Times - 1 z

probability 0.111111

alternative Power x - 1

probability 0.111111

alternative Power y - 1

probability 0.111111

alternative Power z - 1

probability 1.
Start
alternative Expr

◼ Choose From Alternatives

We need a function from probabilized alternatives to a particular choice, given an input die roll.

(Jacquard object) by applying the Jacquard rules that implement the object via /., ReplaceAll. ReplaceAll
If there is only one alternative, choose it. Notice that we access the key “alternative” in the lookup table

fills the role of dot from object-oriented programming in Jacquard’s lightweight polymorphism, the part of
object-oriented programming that Jacquard exploits.

In[31]:= ClearAll[chooseFromAlternatives];
chooseFromAlternatives[{probabilizedAlternative_}, dieRoll_] :=
"alternative" /. probabilizedAlternative;

If there are many alternatives, pick the first whose cumulative probability is greater than or equal to
dieRoll. Track the cumulative probability by decrementing dieRoll by each looked-up probability as we
recurse down the alternatives. (More scalable is Walker’s method of aliases (e.g., https://www.keithschwarz.-
10 AlgebraicExpressionTravesties002.nb

com/darts-dice-coins/), but cumulation is good enough for small cases.)

chooseFromAlternatives[{probabilizedAlternative_, rest___}, dieRoll_] :=


With[{p = "probability" /. probabilizedAlternative},
In[33]:=

If[dieRoll < p,
(* then *)"alternative" /. probabilizedAlternative,
(* else *)chooseFromAlternatives[{rest}, dieRoll - p]]];
chooseFromAlternatives[badArgs___] := Throw[{"CHOOSE:BADARGS: ", {badArgs}}];

◼ generateSentences [ probabilizedGrammar, groundTerm, terminalSymbols, recursionLimit:100 ]

The ground term is the term to force when the recursion limit is exceeded. The default recursion limit is 100.
That’s too big for algebraic expressions, where 20 is ample. But 100 or even 500 is useful in other
applications.

◻ chainExpansion: helper
It’s debatable whether to have a symbol like debugPrint for Print debugging, or to use
Block[{Print=Identity},...] to switch off printing. We opt for the former.

debugPrint = Identity; (* Set it to Print when debugging,


In[35]:= ClearAll[debugPrint];

to Identity when not debugging. *)

In[37]:= ClearAll[chainExpansion];

Iterate down the terms, randomly choosing a branch to explore.

Case: exhausted the production:

In[38]:= chainExpansion[iP_, groundTerm_, T_, production : {}, sentence_,


i_, iLim_] := Module[{},
debugPrint[<|"branch"  "exhausted",
"i"  i, "production"  production, "sentence"  sentence|>];
sentence];

Case: we have a term in the production:


AlgebraicExpressionTravesties002.nb 11

chainExpansion[iP_, gd_, T_,


production : {term_, rest___},
In[39]:=

sentence_, i_, iLim_] /; (i < iLim) :=

(* terminal symbol *)
Module[{},

If[MemberQ[T, term],

(* handle something like "if Term then Term else Term" *)


Module[{subtrees, result},

subtrees =
If[MemberQ[T, #], {#},
chainExpansion[iP, gd, T, {#}, {}, i + 1, iLim]] & /@ {rest};
result = Join[sentence, {term}, Flatten[subtrees, 1]];
debugPrint[<|"branch"  "term", "i"  i, "prodn"  production,
"term"  term, "rest"  {rest}, "sub"  subtrees, "res"  result|>];

(* non-terminal symbol *)
result],

Module[{choice = chooseFromAlternatives[iP[term], RandomReal[]],

replacement = chainExpansion[iP, gd, T, choice, {}, i + 1, iLim];


replacement, result},

result = Join[sentence, replacement];


debugPrint[<|"branch"  "nterm", "i"  i, "prodn"  production, "term"  term,
"rest"  {rest}, "choice"  choice, "repl"  replacement, "res"  result|>];

]]];
result

Case: exceeded the recursion limit:

chainExpansion[iP_, gd_, T_, production_, sentence_, i_, iLim_] :=


Module[{choice = chooseFromAlternatives[iP[gd], RandomReal[]], result},
In[40]:=

result = Join[sentence, choice];


debugPrint[<|"branch"  "exceeded", "i"  i, "sen"  sentence,
"choice"  choice, "prodn"  production, "res"  result|>];
result];

◻ generateSentence [ iP, groundTerm, T, recursionLimit ]

In[41]:= ClearAll[generateSentence];

generateSentence[iP_, groundTerm_, T_, recursionLimit_ : 100] :=


Block[{$RecursionLimit = 4096}, (*supports recursionLimits up to 500*)
In[42]:=

chainExpansion[iP, groundTerm, T, {Start}, {}, 0, recursionLimit]]

Let' s have a button to generate random sentences. We’ll be much more interesting a little bit below, but
unit-test the gadgetry so far:
12 AlgebraicExpressionTravesties002.nb

Manipulate[foo, Button["GEN", foo =


In[43]:= DynamicModule[{foo},

((generateSentence[FProbabilized, FVar, terminalsFromGrammar @ F, 10] // InputForm))]]]

GEN
Out[43]=

FE`foo$$32

Now we have utterances in the grammar. Let’s parse them and display them.

DATA-DRIVEN PARSING

Can we write parserFromGrammar, a function that writes a parser from a grammar? Such a thing is analo-
gous to yacc, but our version is pitifully short because it exploits convenient properties of PPF.

Any of our parsers takes a stream of tokens and a tree, then iteratively augment the tree by side effect. We
don’t write down that function signature explicitly; just remember it.

Here is a parserFromGrammar that works on any of our PPF grammars:

First, inspect an example grammar to guide our work:

In[44]:= DownValues[F] // TableForm

HoldPattern[F[Expr]]  {{FSum}, {FProd}, {FVar}, {FInt}}


Out[44]//TableForm=

HoldPattern[F[FInt]]  {{RandomInteger}}
HoldPattern[F[FProd]]  {{Times, Expr, Expr}}
HoldPattern[F[FSum]]  {{Plus, Expr, Expr}}
HoldPattern[F[FVar]]  {x}, {y}, {z}, {- x}, {- y}, {- z},  ,  ,  
1 1 1

HoldPattern[F[Start]]  {{Expr}}
x y z

First, we need to disassemble an utterance into its constituent pieces.


AlgebraicExpressionTravesties002.nb 13

◼ prefix,
suffix,
terminalsFromRule,
nonTerminalHeadFromRule,
arityFromPrefixRule

In[45]:= ClearAll[
prefix, suffix,
terminalsFromRule,
nonTerminalHeadFromRule,
arityFromPrefixRule];

In[46]:= prefix = First; suffix = Rest;

terminalsFromRule[rule_, nonTerminals_] :=
Select[prefix /@ rule[[2]], ! MemberQ[nonTerminals, #] &];
In[47]:=

nonTerminalHeadFromRule[rule_, nonTerminals_] :=
With[{h = rule[[1, 1, 1]]},
In[48]:=

If[! MemberQ[nonTerminals, h],


Throw[{"NON-TERMINAL HEAD FROM RULE: CATASTROPHE", nonTerminals, h}]];
h];

arityFromPrefixRule[rule_] := With[{lens = Union[Length /@ suffix /@ rule[[2]]]},


If[Length @ lens > 1,
In[49]:=

First @ lens]
Throw[{"ARITY FROM PREFIX RULE: CATASTROPHE", lens}]];

◼ genParserBody, parserDefFromGrammarRule

In[50]:= ClearAll[genParserBody, parserDefFromGrammarRule];

For arity-0 productions:

genParserBody[0, tok_, toks_, head_, parts_, parse_] :=


{toks, head[tok, Sequence @@ parts]}
In[51]:=

Recurse for productions with right-hand sides


14 AlgebraicExpressionTravesties002.nb

remToks[list_List] := First @ list;


In[52]:= ClearAll[remToks, partTree];

partTree[list_List] := First @ Rest @ list;

genParserBody[arity_ ? (# > 0 &), tok_, toks_, head_, parts_, parse_] :=


Module[{rec = parse[toks]},
genParserBody[arity - 1, tok, remToks @ rec,
head, Join[parts, {partTree @ rec}], parse]]

Accumulate the above overloads of parserTargetSym: for each set of terminals in a rule, write a Mathemat-
ica pattern to recognize those Mathematica Alternatives (firstPattern below). The second pattern just
picks up the tree-so-far.

In[56]:= parserDefFromGrammarRule[parserTargetSym_, rule_, nonTerminals_] :=


Module[{tok, toks, tree, parts, xs},

ts = terminalsFromRule[rule, nonTerminals],
With[{

h = nonTerminalHeadFromRule[rule, nonTerminals],
a = arityFromPrefixRule[rule]},
If[Length @ ts =!= 0,
(* then *)

firstPattern = {tok : Alternatives @@ ts, toks___},


Module[{

secondPattern = tree_ : Null},


(* perform the def in-place by side-effect on the given symbol *)

genParserBody[a, tok, {toks}, h, {}, parserTargetSym]],


parserTargetSym[firstPattern, secondPattern] :=

(* else *)
parserTargetSym[{}, tree_ : Null] := {{}, tree}]];
parserTargetSym[xs___] :=
Throw[{ToString @ parserTargetSym <> ": CATASTROPHE: ", xs}]]

Build a parser up by side-effect into a variable (parserTable, a parameter below) by scanning


parserDefFromGrammarRule over all the rules in the grammar:

In[57]:= ClearAll[parserPatterns];
parserPatterns[grammar_, parserTable_] :=

parserDefFromGrammarRule[parserTable, #, nonTerminalsFromGrammar @ grammar] &,


Scan[

DownValues @ grammar];

That’s it for the parser generator! It’s equivalent to a yacc for our massively simplified PPF grammars. It’s
small because it doesn’t consider punctuation, operator precedence, and overloads in the object language
of the PPF.

Build a parser from our grammar, automatically:


AlgebraicExpressionTravesties002.nb 15

In[59]:= ClearAll[$exprTable];
parserPatterns[F, $exprTable];

Make a little ad-hoc function (ad-hoc-ness signified by the dollar sign in the name of the function) to gener-
ate random sentences, parse them, and then typeset them in Mathematica.

In[61]:= ClearAll[$fexpr];
$fexpr[dummy_, depth_ : 20] :=
With[{sen = generateSentence[
FProbabilized, FVar, terminalsFromGrammar @ F, depth]},
With[{parse = ($exprTable @ sen)[[2]]},
With[{interp = {
FInt[RandomInteger]  RandomInteger[10],
FVar  Identity,
(FSum FProd)[f_, args__]  Apply[f, {args}]}},
{TreeForm[parse, VertexLabeling  Automatic],
(parse //. interp // FullSimplify)}]]]

Here’s an animation that exhibits typeset expressions, their leaf counts, and their parse trees. It keeps track
of the biggest expression found so far, for bragging rights.

In[63]:= $fcontender = 0;
Animate[

{t, e} = $fexpr[dummy];
Module[{t, e},

With[{le = LeafCount[e]},
If[le > LeafCount @ $fcontender, $fcontender = e];
Grid[{{e, SpanFromLeft}, {le, t}}, Frame  All]]],
{dummy, 1, 25, 1}]

dummy

2 (x-y) 7+y (-1+z)+ +z


2
z
y

Out[64]=

23
16 AlgebraicExpressionTravesties002.nb

In[65]:= Dynamic[$fcontender]

5-z (1 + x + y) z
-x + -2y+ + - + (8 + y) (- 1 + z (y + z))
2 1
Out[65]=
y x z xy

You might also like