Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

CST to AST Tutorial

Fidel Viegas August 18, 2004

Introduction

Prior to SableCC 3.0, the user did not have any control over the construction of Abstract Syntax Trees. This led to huge trees with loads of unnecessary nodes. To overcome this problem, in SableCC 3.0, a new section was introduced: the Abstract Syntax Tree section. In this section, the user designates the types of nodes that may be part of a tree. This tree is constructed by the parser as before, but now the user species how this tree is constructed in the Productions section of the grammar specication; this comes to an advantage, because experienced users can reduce the size of the tree generated by the parser, hence reducing the memory consumption, which leads to speed improvement. In this document I will introduce users to the new features found in SableCC 3.0. To be more precise, this tutorial will be an extension to a tutorial I published on my web site [2]. So, for users unfamiliar with SableCC, please do read that tutorial before going through this one. With the inclusion of the Abstract Syntax Tree section, this is how the le specication will look1 : Package declaration Tokens declaration Ignored Tokens declaration Productions Abstract Syntax Tree In the sections that follow, I will take you through the new syntax for Productions and the Abstract Syntax Tree.
1 All

of these sections are optional

What to do before creating the AST section?

Before you go about creating an Abstract Syntax Tree section in your le specication, you should rst create the grammar as you would in SableCC prior to version 3.0. This is so that you have a working parser. Once you have the working parser, then you start thinking about your AST: you look at your grammar, and you take the most important elements, and add them in the AST section. Obvious things that you may remove are keywords and operators (e.g mathematical and logical operators). In this tutorial, we are going to use the SmallPascal grammar as an example. Before describing the new syntax for the Productions section, we are going to describe below the Abstract Syntax Tree and what SableCC generates out of it.

Abstract Syntax Tree section

In SableCC prior to version 3.0 we used to implement the Productions section as our grammar to generate the parser. And out of that grammar, SableCC would generate tree node types. Recapitulating, lets have a look at the following Productions section from the grammar found in Appendix B:
Productions exp = {plus} exp plus term | {minus} exp minus term; term = {mult} term mult factor | {div} term div factor | {factor} factor; factor = {number} number | {exp} l_par exp r_par;

From this Productions section, SableCC generates the parser for the language, and also generates the types. The way it generates the types is as follows: For each production, it generates a type constituted by P concatenated to the name of the production with its rst letter capitalised. (E.g.: for exp it generates PExp, for term, PTerm, etc...) For each alternative, it generates a type constituted by the name of the alternative prexed by A and concatenated to the name of the production. (E.g.: for plus, it generates a type APlusExp, for mult, it generates AMultTerm, etc...) Now, if we replace the name Productions by Abstract Syntax Tree, we have our Abstract Syntax Tree section.

That is all you have to know to create an Abstract Syntax Tree section. The only dierence being that this section only generates the types, it doesnt generate the parser. The parser is generated by the grammar in the Productions section. Also, bare in mind that if you include a Abstract Syntax Tree section in your le specication, you have to do a CST- AST mapping. This is done in the Productions section, which we are going to describe below. If you include a Abstract Syntax Tree section and dont do a mapping in the Productions section, SableCC will report an error. Well, if your Productions is exactly as your Abstract Syntax Tree then, it wont report an error, as SableCC does the mapping automatically. But, this would be the same as ommiting the Abstract Syntax Tree section, and wouldnt make much sense, because the whole purpose of this new section is to reduce the size of the trees constructed by the parser removing the the unnecessary nodes. In the following section, Ill introduce you to new Productions section in SableCC 3.0.

The New Productions Section

The new Productions section, has a new syntax2 , which looks like this:
production {-> prod_transform1 prod_transform2 ... prod_transformN} = element1 element2 ... elementN {-> alt_transform1 alt_transform2 ... alt_transformN}

In this new productions, we can specify two types of transformations: the transformation of the production itself and the transformation of the alternatives for this production. As you can see above, you have a set of prod transforms inside {- 3 and } before the = sign, and another set before the semicolon. The set before the = sign is the production transformation, and the set before the semicolon is the alternatives transformation. The {- token means, this transforms to4 . So, in the case above, that means that production transforms (or returns) to prod transform1 prod transform2 up to prod transformN. That is, it means that the parser will generate a node of type prod transform1 followed a node of type prod transform2 up to a node of type prod transformN. There are a couple of rules we must follow in order for the production to succeed: 1. The number of prod transforms inside the production transformation has to be the same number of alt transforms in the alternatives transformation. That is, the number of elements in {- prod transform1 prod transform2
2 SableCC 3.0 still accepts the old SableCC 2.0 syntax, which means that you can still generate code from your old grammars 3 This is treated as a single token composed of the { character, the - character and the character. No spaces are permitted in between. 4 In programming languages parlance, we could say that it is the return variable of a function or method.

. . . prod transformN} has to be the same as the number of elements in {alt transform1 alt transform2 . . . alt transformN} 2. The type of each element should match on both sides. That is, prod transform1 should be the same type as alt transform1, prod transform2 should be the same type as alt transform2, etc. . . . This means that prod transform1 and alt transform1 are the same type as dened in the Abstract Syntax Tree section, that they both are a list of homogenious tokens or AST nodes, or they match the same token. The best way to understand how this works, is to pretend that the production is a function (or method) that returns multiple values. So, lets say that production is our function and prod transform1 prod transform2 . . . prod transformN are the types of values it should return. So rewriting this as a function, we would have something like this:
function production returns prod_transform1, prod_transform2, ... prod_transformN

Now, as in a function, we do some processing and then return some values; prod transform1 . . . prod transformN are the variables that are going to hold those values. In a function to return the values, we would use a return statement, which would look like this:
return alt_transform1, alt_transform2, ... alt_transformN

As you can see, when we return our values, there must be a match of types between prod transform1 and al transform1, and so on. The same applies when we work with SableCC production transformations. We now know that the output variables correspond to prod transform1 . . . prod transformN, and that our return values correspond to alt transform1 . . . alt transformN, and that before we return the values in a variable, we need to do some processing, right? The same is true in SableCC. One thing I did not mention when I described the syntax were the elements (element1 element2 . . . elementN) before the alternatives transformations. You were probably wondering what they were used for. Well, these elements are the rules that govern that production. That is, they are what goes after the = sign in SableCC prior to version 3.0. And they are what is going to constitute an alternative transformation. That is, they are going to be the alt transformN inside the alternative transformation. Now there are ve ways of constructing the alternative transformation: identity, new node, homogeneous list, elimination and empty. If you do not understand the terminology as you go through each of them, dont worry. It will become clear when we go through the SmallPascal grammar in detail. Below are the descriptions of each one with a practical example.

4.1

Identity

This, as its name implies, returs the element itself. That is, it can be either a token or the return value of a production5 . Here is an example: Lets look at the program heading production of the SmallPascal grammar in Appendix A:
program_heading {-> identifier} = T.program identifier semicolon {-> identifier} ;

As you can see, the production transformation returns an identier, so on the right hand side, we must also return and identier. That identier is the one that follows T.program. This is what we call identity. As you can see, you just place the token itself. Now lets look at an example where we use the return value (remember we are still thinking in terms of functions):
expression = {term} term {-> term.expression} | {plus} expression plus term {-> New expression.plus(expression, term.expression)}| {minus} expression minus term {-> New expression.minus(expression, term.expression)}; term {-> expression} = {factor} factor {-> factor.expression} | {mult} term mult factor {-> New expression.mult(term.expression, factor.expression)} | {div} term div factor {-> New expression.div(term.expression, factor.

If you look at production expression, it is omitting the production transformation. That is, there is nothing preceding the = sign. This is because we have a production of the same name in the Abstract Syntax Tree section. When this happens, we dont have to provide the production transformation, because SableCC includes it by default. That is, it transforms to a tree node of the same name. Now, if you look at production term, you can see that it returns an expression, and if you look at alternative term of production expression, we have {term.expression}: term.expression is the return value of production term. If we were thinking in terms of functions, we would get the return value of term, which is this case is expression. Basically, you only use this numenclature when there is a return value. As we stated before, production expression does not return a value explicitly, so when we place it inside the alternative transformation, we justp place its name without the .. That is, as its own identity. On the subsection below I will explain the usage of expression.

4.2

New Node

This is used when we want to create a new node. If we look at the productions shown in the above, we can see the usage of this in alternatives plus, minus, mult and div The syntax for this kind of transformation is:
5 Rember

the function concept above? More on this later.

New production[.alternative](arg1, arg2, ..., argN)

Where [.alternative] is an optional part, and arg1 ... argN are what goes in the alternative. production is an Abstract Syntax Tree production and alternative is an alternative of that same production. To better understand it, lets include a production from the Abstract Syntax Tree section:
expression = {plus} [left]:expression [right]:expression | {minus} [left]:expression [right]:expression | {mult} [left]:expression [right]:expression | {div} [left]:expression [right]:expression | {identifier} identifier | {number} number | {expression} expression ;

Here we have a production expression, with alternatives plus, minus, mult, div, identier, number and expression. The arguments of the transformation are what goes after the alternative name. (E.g. [left]:expression [right]:expression in the plus alternative) So, if we want to create a node of type AExpressionPlus, we would do it as follows:
New expression.plus(expression1, expression1)

Where expression1 and expression2 are nodes of type expression. Looking at the grammar in Appendix A, this is an example:
{plus} expression plus term {-> New expression.plus(expression, term.expression)}

As you can see, the arguments to this constructor is expression, and term.expression. Both of type expression. We construct the new node from the elements of the alternative. That is, expression and term. If we look at production term, we can see that it returns a node of type expression. This is what we use to pass in to the constructor for plus, because that is what it takes in as argument. As you can see for expression, we used its identity. This is because, expression does not have a production transformation (or returns anything. remember the concept of functions? As you can see, expression does not return anything). If, on the other hand our expression production looked like this:
expression {-> expression} = {term} term {-> term.expression} | {plus} expression plus term {-> New expression.plus(expression, term.expression)}| {minus} expression minus term {-> New expression.minus(expression, term.expression)};

Then, we would have to change our alternative transformation as follows:


{plus} expression plus term {-> New expression.plus(expression.expression, term.expression)}

Now, we are also using the node being returned by expression, which is:
{-> expression}

4.3

Homogeneous List

There are times, when a production is actually return a list of homogeneous nodes, or we are constructing a node that takes in a list of homogeneous nodes. When we have something like this, we append a * character to the element name in the production transformation, and in the alternative transformation, we create a list with the list constructor [elm1, elm2, ..., elmN]. Here is an example:
exp_list {-> exp*} = exp exp_list_tail* {-> [exp, exp_list_tail.exp]}; exp_list_tail {-> exp} = comma exp {-> exp};

In this example, we have production exp list, which returns a list of exp (remember this exp is from the Abstract Syntax Tree section. As you can see, we appended a * to exp in {- exp*}). Now, when you construct a list on the right hand side, or in the alternative transformation, you can see that we used the list operator: [exp, exp list tail.exp]. And in that list we include both exp and exp list tail.exp. Note that we are including the node returned by exp list tail, which is exp. Remember, this is a homogeneous list. In the production transformation {- exp*}, we are saying that exp list returns a list of exp. So, on the right hand side, you have to return a list of exp. If you want to return an empty list, just return []. This is an empty list.

4.4

Elimination Node

There isnt much to talk about this one, because I havent come across a grammar that uses it. Basically, you return null in place of a node, which is optional. Here is an example:
{plus} [exp1]:expression plus [exp2]:expression? {-> New expression.plus(expression, Null)}

As you can see, we replaced Null in place of expression2. You can only use the Null operator, when your nonterminal or terminal is optional. That is, has got a ? appended to it.

4.5

Empty

Finally, there are times when we dont want to production to produce any node. In cases like that, we just provide a production transformation, and an alternative transformation without any elements as follows:
type {-> } = {boolean} boolean {-> } | {integer} integer {-> };

When we do this, the parser will ignore this, and wont include it in the tree. This is quite dierent from the Null operator. The null operator, just generates null, whereas this one, wont generate anything on the tree. Now, we are going to have a look at the grammar in Appendix B, and explain how we created the Abstract Syntax Tree as shown in the grammar shown in Appendix C. 7

CST-AST mapping

The rst thing to do is create our SableCC le specication from our EBNF grammar. So, for the EBNF grammar:
exp = exp "+" term | exp "-" term term = term "*" factor | term "/" factor | factor factor = number | "(" exp ")" number = digit+ digit = "0" .. "9"

We create a SableCC specication le as found in Appendix B. Then, from the grammar found in Appendix B, we create something like this:
Package expression; Helpers digit = [0 .. 9]; tab = 9; cr = 13; lf = 10; eol = cr lf | cr | lf; // this takes care of the different platforms blank = ( | tab | eol)+; Tokens l_par = (; r_par = ); plus = +; minus = -; mult = *; div = /; comma = ,; blank = blank; number = digit+; Ignored Tokens blank; Productions exp = {plus} exp plus term | {minus} exp minus term; term = {mult} term mult factor | {div} term div factor | {factor} factor; factor = {number} number | {exp} l_par exp r_par; Abstract Syntax Tree

exp = {plus} exp plus term | {minus} exp minus term; term = {mult} term mult factor | {div} term div factor | {factor} factor; factor = {number} number | {exp} l_par exp r_par;

Now that, we have our Abstract Syntax Tree, we look at elements which arent needed in our tree. By looking at the grammar, we can see that we dont need any of the operators, because to know if it is a plus or minus, we just need the type. That is, APlusExp, AMinusExp, etc... These are sucient to allow us to process the tree later on. We also dont need l par nor r par. So, removing all the terminals, will give us a Abstract Syntax Tree as follows:
Abstract Syntax Tree exp = {plus} exp term | {minus} exp term; term = {mult} term factor | {div} term factor | {factor} factor; factor = {number} number | {exp} exp ;

Now, if you look at our AST grammar, you can see that alternatives plus and minus look exactly the same, and mult and div also look the same. If, this was in the Productions section, it wouldnt be allowed, because you would have a reduce/reduce conict. The same does not happen in the AST section, because we are creating the types, not the parser. So, this is valid. We could leave our AST grammar as it is, but usually, we create it as a single production. So, we get rid of proudctions term and factor and shift everything to production exp, and replace any occurrences of term and factor by exp. Here is our nal AST grammar:
Abstract Syntax Tree exp = {plus} [exp1]:exp [exp2]:exp| {minus} [exp1]:exp [exp2]:exp | {mult} [exp1]:exp [exp2]:exp | {div} [exp1]:exp [exp2]:exp | {number} number | {exp} exp ;

Remember in prior versions of SableCC, when you had more than one occurrences of a nonterminal or a terminal, you had to preprend it with [name]:? The same holds true for the AST section. When you have two nonterminals, or two terminals of the same name, you have to distinguish them by giving them a

name. And you do this by prepending it with [name]: as shown in the grammar above. Now, that we have our AST grammar ready, it is time to map it to our Productions grammar. Lets look at our Productions again:
Productions exp = {plus} exp plus term | {minus} exp minus term; term = {mult} term mult factor | {div} term div factor | {factor} factor; factor = {number} number | {exp} l_par exp r_par;

By looking at the grammar above, we can see some resemblances with our new AST section. Lets then start from the top of our grammar. We want to map exp to a production in our AST section. The rt thing to do is to create the production transformations for each production. Since we only have one type in our AST, all our productions return an exp in their production transformation. The grammar will now look like this:
Productions exp {-> exp} = {plus} exp plus term | {minus} exp minus term; term {-> exp} = {mult} term mult factor | {div} term div factor | {factor} factor; factor {-> exp} = {number} number | {exp} l_par exp r_par;

Next, we create the alternative transformations: So, if we begin in our production factor, we know that we have to return a node of type exp. So, looking at alternative exp of this production, we can see that we need to return the return value of exp. So, we just return its identity. For alternative number, on ther other hand, we need to construct a new node. So, we need the New operator as described in the beginning of this document. Here is how the grammar looks like, after we mapped production factor
Productions exp {-> exp} = {plus} exp plus term | {minus} exp minus term; term {-> exp} = {mult} term mult factor | {div} term div factor | {factor} factor;

10

factor {-> exp} = {number} number {-> New exp.number(number)} | {exp} l_par exp r_par {-> exp.exp} ;

If you look at alternative number, you can see that we used the New operator. Whereas in alternative exp we used the identity. Remember when I said that, if you have a production in Productions with the same name as a production in Abstract Syntax Tree, you could ommit the production transformation? Well, we do have exp. So, in this case, we can get rid of it. Our grammar will now look like this:
Productions exp = {plus} exp plus term | {minus} exp minus term; term {-> exp} = {mult} term mult factor | {div} term div factor | {factor} factor; factor {-> exp} = {number} number {-> New exp.number(number)} | {exp} l_par exp r_par {-> exp} ;

As you can see, in alternative exp of production factor, we return exp only. This is because, we ommited it from the production itself. That is, exp does not return a node any more. Well, it does, but it is implicitly, and you dont use the .. Now, following the same principle, we do the other transformations, and our nal grammar will look like this:
Productions exp = {plus} exp plus term {-> New exp.plus(exp, term.exp)} | {minus} exp minus term {-> New exp.minus(exp, term.exp)}; term {-> exp} = {mult} term mult factor {-> New exp.mult(term.exp, factor.exp)} | {div} term div factor {-> New exp.div(term.exp, factor.exp)} | {factor} factor {-> factor.exp} ; factor {-> exp} = {number} number {-> New exp.number(number)} | {exp} l_par exp r_par {-> exp} ;

Looking at alternative factor of production term, we can see that we used the identify. In fact, we are using the return node from production factor: factor.exp, which is the same exp in factor {- exp}. The same applies to all the others. If you want a more complex grammar, have a look at the one in Appendix A. This is just a prelimiary version of the tutorial. I am going to rene it so that it makes sense to users, and it is easy to read. If you have any queries regarding this tutorial, please drop me an email. All the best Fidel. 11

Appendix A
Package org.sablecc.pascal; // package name Helpers /** * Pascal is a case-insensitive language. So, well use helpers * to simplify our regular expressions. E.g. Instead of writing, * for instance, end = (e | E) (n | N) (d | D), we * may write: end = e n d, which takes less space is makes the * regular expression more readable. */ a = a | A ; // this could also be written as [a + A] b = b | B ; // but I prefer the old lex style d = d | D ; e = e | E ; g = g | g ; i = i | I ; l = l | L ; m = m | M ; n = n | N ; o = o | O ; p = p | P ; r = r | R ; t = t | T ; v = v | V ; w = w | W ; l_curly_bracket = { ; r_curly_bracket = } ; ascii_char = [32 .. 127] ; // letters and digits letter = [[a .. z] + [A .. Z]]; digit = [0 .. 9] ; // un-printable characters tab = 9 ; cr = 13 ; lf = 10 ; blank = ; Tokens // reserved words end = e n d ; div = d i v ; // integer division var = v a r ; begin = b e g i n ; program = p r o g r a m ; writeln = w r i t e l n ; // I prefer to let the parser do the job // of tracking the standard type rather // then processing it in the semantic phase integer = i n t e g e r ; // arithmetic symbols plus = + ; minus = - ; mult = * ; assignop = := ; // symbols separators comma = , ; colon = : ; semicolon = ; ; dot = . ; l_paren = ( ; r_paren = ) ;

12

// identifiers identifier = letter (letter | digit)* ; // numbers number = digit+ ; // integer numbers only // comments comment = l_curly_bracket [ascii_char - [l_curly_bracket + r_curly_bracket]]* r_curly_bracket ; // blanks blanks = blank | cr lf | cr | lf | tab ; Ignored Tokens comment, blanks ; Productions program = program_heading declarations body dot {-> New program(program_heading.identifier, [declarations.identifier], [body.statement])}; program_heading {-> identifier} = // program must be prefixed with T. because there is a token and a production with // the same name T.program identifier semicolon {-> identifier}; // declarations declarations {-> identifier*}= variables_declaration? {-> [variables_declaration.identifier]}; variables_declaration {-> identifier*}= var variables_definition_list {-> [variables_definition_list.identifier]};

variables_definition_list {-> identifier*} = {single} variables_definition {-> [variables_definition.identifier]} | {multiple} variables_definition_list variables_definition {-> [variables_definition_list.identifier, variables_definit variables_definition {-> identifier*} = identifier_list colon type semicolon {-> [identifier_list.identifier]}; identifier_list {-> identifier*} = {single} identifier {-> [identifier]} | {multiple} identifier_list comma identifier {-> [identifier_list.identifier, identifier]}; type = integer ; // only data type allowed is the integer data type // body definition body {-> statement*} = begin statement_sequence end {-> [statement_sequence.statement]}; // statements statement_sequence {-> statement*} = {single} statement {-> [statement]} | {multiple} statement_sequence semicolon statement {-> [statement_sequence.statement, statement]}; statement = {writeln} writeln l_paren expression r_paren {-> New statement.writeln(expression)}| {assignment} identifier assignop expression {-> New statement.assignment(identifier, expression)}| {empty} ;

13

// expressions expression = {term} term {-> term.expression} | {plus} expression plus term {-> New expression.plus(expression, term.expression)}| {minus} expression minus term {-> New expression.minus(expression, term.expression)}; term {-> expression} = {factor} factor {-> factor.expression} | {mult} term mult factor {-> New expression.mult(term.expression, factor.expression)} | {div} term div factor {-> New expression.div(term.expression, factor.expression)}; factor {-> expression} = {identifier} identifier {-> New expression.identifier(identifier)}| {number} number {-> New expression.number(numer)} | {expression} l_paren expression r_paren {-> New expression.expression(expression)} ; Abstract Syntax Tree program = identifier identifier* statement* ; statement = {writeln} expression | {assignment} identifier expression | {empty} ; expression = {plus} [left]:expression [right]:expression | {minus} [left]:expression [right]:expression | {mult} [left]:expression [right]:expression | {div} [left]:expression [right]:expression | {identifier} identifier | {number} number | {expression} expression ; // end of grammar.

Appendix B
Package expression; Helpers digit = [0 .. 9]; tab = 9; cr = 13; lf = 10; eol = cr lf | cr | lf; // this takes care of the different platforms blank = ( | tab | eol)+; Tokens l_par = (; r_par = ); plus = +; minus = -; mult = *; div = /; comma = ,; blank = blank; number = digit+; Ignored Tokens blank;

14

Productions exp = {plus} exp plus term | {minus} exp minus term; term = {mult} term mult factor | {div} term div factor | {factor} factor; factor = {number} number | {exp} l_par exp r_par;

Appendix C
Package expression; Helpers digit = [0 .. 9]; tab = 9; cr = 13; lf = 10; eol = cr lf | cr | lf; // this takes care of the different platforms blank = ( | tab | eol)+; Tokens l_par = (; r_par = ); plus = +; minus = -; mult = *; div = /; comma = ,; blank = blank; number = digit+; Ignored Tokens blank; Productions exp = {plus} exp plus term {-> New exp.plus(exp, term.exp)} | {minus} exp minus term {-> New exp.minus(exp, term.exp)}; term {-> exp} = {mult} term mult factor {-> New exp.mult(term.exp, factor.exp)} | {div} term div factor {-> New exp.div(term.exp, factor.exp)} | {factor} factor {-> factor.exp} ; factor {-> exp} = {number} number {-> New exp.number(number)} | {exp} l_par exp r_par {-> exp} ; Abstract Syntax Tree exp = {plus} [exp1]:exp [exp2]:exp| {minus} [exp1]:exp [exp2]:exp | {mult} [exp1]:exp [exp2]:exp | {div} [exp1]:exp [exp2]:exp | {number} number | {exp} exp ;

15

References
[1] Etienne Gagnon, SableCC, An Object-Oriented Compiler Framework, Masters thesis, McGill University, Montreal, Quebec, March 1998. [2] Fidel Viegas, SableCC Tutorial, 2003. World-Wide Web Page URL: http://www.brainycreatures.co.uk/compiler/sablecc.asp/.

16

You might also like