YACT

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

YACT

Yet Another Compiler Tutorial


By Massimo Pozzoni

Why another tutorial on YACC and parser generators?


Because when I tried for the first time to understand how YACC works I became crazy in finding
all the information I needed. I mean: the information is available on the net, but I found it very
difficult to get a rational organized description of the overall.
So, I tried to make it by myself, at least working for me.
The objective of this work is not to give an exhaustive description of the YACC parser
functionality (or of its associated LEX lexer) but to give a logical overview of the overall
functionality and principles of operation leaving to the user the research of the further details in
the large amount of documentation already available on the net.

1. Let’s start with the parser


Every computer language has a syntax, made of rules to combine words in sentences that are
accepted by the language syntax.
The syntax can accept precise identified words (if, then, else etc.) as well as generic identifiers
(variable1, variable2 etc.) as well as numbers or other symbols (+, -, etc.).
As in any language, only some combinations of words are accepted and the rules to define which
combinations are accepted make the ‘grammar’ of the language.

The grammar can be very complex, as an example we can define that an ‘if statement’ is made
according to the following description:

IFSTAT : ‘if’ CONDITION ‘then’ ACTION ‘else’ ACTION ‘end’

In the above grammar rule, ‘if’, ‘then’ ‘else’ and ‘end’ are identified symbols that do not require
further grammar definitions. We call these ‘terminals’, because they terminate our grammar
analysis.
Instead, CONDITION and ACTION don’t terminate the grammar analysis, but we must search
inside the grammar definition how these grammar terms are defined, in the same way as we have
defined the IFSTAT term.
As an example,

CONDITION: ‘identifier’ ‘<=’ ‘number’


| ‘identifier’ ‘>= ‘number’
| ‘identifier’ ‘==’ ‘number’
The symbol ‘|’ means ‘or’: a CONDITION can be an identifier that is lower than a number or
major then a number etc.
We can put whatever we need inside a grammar, using the above notation.
The rules we have defined above (IFSTAT and CONDITION) are also called as ‘productions’:
each production has a ‘head’ at the left (ISTAT, CONDITION) and a ‘body’ at the right of the
semicolon.
In the body of a production we can find symbols that are terminals and symbols that are not
terminals. Non-terminal symbols are simply defined through other dedicated productions. Let’s
now focus instead on terminal symbols.
Every computer program that we write using the syntax of a computer language is made of words,
numbers, symbols etc. in summary we have a sequence of characters. We need to analyze the
sequence of characters to check if it is compatible with a grammar defined for that syntax: this
operation is performed by a ‘parser’.
On the other hand, if we consider the examples given above, we can see that we have not gone
down to the level of characters, we have supposed that some sequences of characters have already
been recognized in what we have called terminals. A terminal can even be a single character, for
example we can decide that the character ‘+’ is named as ‘plus’ inside the termination bodies. Or
we can have sequences of multiple characters that follow precise identified rules: for example, we
have called ‘number’ a sequence of characters that follow a syntax of a number (we can decide
which is this syntax, and we may have multiple definitions of numbers, as we want).
Basically, we don’t want the parser to analyze the sequence of characters, but we want the parser
to analyze the sequence of high level terms which may be non-terminals, defined in the parser
grammar itself or terminals (‘tokens’) prepared by a previous analysis done on the sequence of
characters. This previous analysis done on the sequence of characters is performed by a ‘lexer’
that transforms the sequence of characters in a sequence of tokens, according to another lower-
level grammar that defines which are the tokens and which sequence of characters can make each
token.
The parser calls the lexer when it needs to receive a new token form the sequence to be checked
and checks if the received sequence of tokens matches the parser grammar.
The lexer performs the first part of the job: it creates the sequence of tokens according to a specified
grammar that defines all the possible tokens accepted by the language syntax.
The parser performs the second part of the job: it receives the sequence of tokens from the lexer
and checks if this sequence matches the grammar specified for the parser.

2. Few words on the lexer


We have said that the lexer provides to the parser the sequence of tokens extracted by the sequence
of characters of the text under analysis.
Any times the parser invokes the lexer, the lexer will provide to the parser the information relative
to the next token in the text under analysis and will move its current position inside the text to the
following token.
Not necessarily all the characters of the text must be a part of a token. For example, the grammar
of the parser may decide that the new_line characters will be ignored, that the spaces or tabs will
be used just as token separators etc. But a token can even include spaces inside, for example the
grammar of the lexer can decide that a sequence of characters comprised between two “ “
characters will be identified as a unique ‘string’ token.
Basically, we can say that commonly a lexer implements 4 operations:
1 Analyzes the text using a lexer grammar that specifies the rules to split the text into a
sequence of tokens.
2 Assignes an identifier to each token (usually a code-number) and provide this number to
the parser to communicate which kind of token is following.
3 Saves and provides to the parser further information, as the value of the token itself. For
example, if the token is a number, the parser will receive information that the token is a
number by means of the token code-number but may receive also information of which is
the number. If the token is a generic identifier the parser may receive also the information
of which is the sequence of characters that make the identifier. We must say that in order
to check if the text satisfies the parser grammar it may be enough to check if the token is
an identifier, but for further operations in the flow, as for the generation of a compiled
code, it will be also required to know which is the identifier.
4 For each recognized token the lexer can also perform additional auxiliary operations, for
example, it can save the line number where the token was found or other kind of
information made available to the parser.

The unix environment provides the YACC parser generator that can be used in conjunction with
the LEX lexer generator. In general the parser generated by YACC will call a function yylex()
expecting to receive a code-number identifying the following token. It is possible to define the
code of the yylex() function inside the program embedding the parser or to embed the yylex()
function provided by the unix LEX generator.
In this text we will assume the second case.

3. LEX and YACC together


LEX and YACC are two commands available in the unix environment (now equivalent commands
are also made available in other platforms) to generate the C source code of a lexer and of a parser.
The lexer C code can be embedded in a C program that will call the function yylex() in order to
receive the code-number of the following token.
The parser C code can be embedded in a C program that will call the yyparse() in order to perform
the parser operation. The yyparse() will call a function named yylex() when the parsing algorithm
will need a further token to proceed in its operation.
The lex can be invoked as
lex lexfile.l

where the lexfile.l is a text file specifying the lexer grammar and operations in an appropriate lex
syntax.
The above command will generate the C file lex.yy.c, embedding a function yylex() together with
further data.

The yacc can be invoked as


yacc -d yaccfile.y

where the yaccfile.y is is a text file specifying the parser grammar and operations in an appropriate
yacc syntax.
The above command will generate the C file y.tab.c, including the yyparse() function together with
additional data.
The -d option tells yacc to generate also another file named y.tab.h including the definition of the
tokens and of their code-numbers so that the lexer can use the same code convention expected by
the parser. We will have to instruct the lex generator to include the y.tab.h inside the generated C
lexer.
As an alternative we will have to specify a set of code-numbers inside the lexfile.l file and the
same set of codes inside the yaccfile.y file.

In unix environment when we compile our C program including the yyparse() function created by
the yacc, we need to include the libl.a and liby.a libraries:

cc lex.yy.cc y.tab.c otherfiles.c -ll -ly

In other platforms different from unix the libl and liby may not be always required.
We can remind that the -l option tells to the compiler linker the name of the library to be used,
with the assumption that the library always begins with ‘lib’ (so the prefix ‘lib’ must be removed
in the -l specification) and always not requiring the ‘.a’ or ‘.so’ extension in the -l specification.

4. The LEX definition file


The objective of this paragraph consists in giving an overview of the lex definition file, without
the intention of giving an exhaustive description, but intended as a starting point for further
investigations.

The lex definition file consists of 3 sections, separated by the symbols %% to be placed starting
from the first column of the file.

1st section: optional C definitions or declarations and auxiliary lex definitions


%{
Optional C definitions or declarations
%}
Auxiliary lex definitions
%%
2nd section: Lex rules
%%
3rd section: optional C subroutines
The purpose of lex is to generate a C source with a code performing the lexer yylex() function.
The two optional sections give the possibility of adding our own C statements to the generated
source.
Basically, we can add C definitions or declarations (#define…, variables definitions etc.) in the
first section. C definitions or declarations must be included in a subsection comprised between %{
and %}.
A further subsection with lex definitions may follow.
We can add our own C functions, for example we could add a main() function in the last section.

The first C definition subsection will be copied at the beginning of the C code generated by lex.
The second section will be translated by lex into the function yylex(), including also the subsection
of the lex definitions.
The third section will be copied at the end of the C code generated by lex.

Let’s assume for example that in the second or in the third section we want to use a C printf
function. In this case, in the first section we will add the required C header #include <stdio.h>.
We have also said that in order to have lex and yacc to use the same code-numbers for the token
we can run yacc with the option -d to generate the file y.tab.h including the definition of the tokens
and of their code-numbers. This means that in the first section we need to insert the statement
#include “y.tab.h”.

For example, inside the y.tab.h file we can find token code-number definitions like

#define OPEN_BRACKET 258

This gives the code 258 to the OPEN_BRACKET token and we will be able to use the term
OPEN_BRACKET both inside the lex definition file and also inside the yacc grammar rules, where
we can refer to OPEN_BRACKET inside any production body.
The concept is that the definition of the terminals used in a yacc grammar is performed inside the
yacc definition file, but the same list of definitions can be included by lex in order to have the same
definitions.

There is a third important set of definitions that can be included in the first section: the indication
of variables used to send information to yacc.
We have said that the function yylex() will return the code-number associated to the token that has
been recognized. For example, we will instruct lex to return OPEN_BRACKET (which is defined
as the code 258) when the token ‘(‘ is recognized by the lexer.
But we have also said that the token code-number may not be enough to the parser to perform all
its required actions: it is enough to understand if the parser grammar is matched but for further
actions (for example to generate a compiled code) we need to have also further information. For
example, we could need to send to the parser the string of characters that make the token.
The yacc parser assumes that a type YYSTYPE is defined and in case it is not defined by the user
it will be defined inside the y.tab.h file as a C union whose fields will be read by the yacc parser
to receive the additional required information.
The name and the kind of fields of this union will be defined in the yacc definition file.
The variable ‘yylval’ of type YYSTYPE is also defined and the fields of yylval can be written by
the lexer and can be read by yacc for this exchange of information.
Basically, at this point of the description we can say that in the first section of the lex definition
file we can add the statement
extern YYSTYPE yylval;

We are saying that the C code created by lex can have access to the union yylval defined inside
the y.tab.h file and can write the fields of this union to send further information to the yacc parser.
The standard usage of the YYSTYPE is associated to a union type, but it is also possible to define
it as a structure type, in case we would like to send multiple simultaneous information to the parser
and not just a single information as allowed by a union.
Actually, without any kind of definition action the yylval will be defined as an integer, but we
assume that in a practical usage a definition of the YYSTYPE will always be done and we consider
the definition as a union as a standard definition.
Said in other terms, yacc will define the fields of a union of type YYSTYPE and an associated
variable yylval of type YYSTYPE. Moreover, yacc will make available mechanisms to read the
fields of this yylval variable for its parsing operations.
At this point of our description we need to know that the lexer is required to fill the fields of this
yylval variable as a way to pass information to the parser.

If the sentence ‘extern YYSTYPE yylval;’ is already inside the included y.tab.h file it will not be
required to repeat the same definition in the first section of the lex definition file, but we have tried
here to give an overview of how the exchange of information between the lexer and the yacc parser
works.

In case we want to use the yylex() function generated by lex with another parser, not generated by
yacc, we can add in the first section of the lex definition file all the token definition that we need
and all the variables that we need to exchange information to the parser.

Before starting the description of the second section of the lex definition file, let’s add a further
information.
When the lexer generated by lex recognizes a token, the char* ‘yytext’ (or char yytext[], depends
from the lex implementation) will contain the sequence of recognized characters that make the
token and the int yyleng variable will contain the number of recognized characters. These two
variables are defined inside the C file source code generated by lex and can be accessed inside the
lex definition file (or inside any other c module, by defining these variables as external).
The lexer generated by lex will read the text to be scanned from the variable FILE *yyin which is
set to stdin as default but can be reassigned to a user file before calling the yylex() function.
In the same way, the output will be sent to FILE *yyout, set as default to stdout.
There may be further variables or functions defined by lex that can be used in the exchange of data
between the lex generated code and additional external code, but we limit the description to the
above to avoid an excess of details.

4.1. Lex rules


We have said that the rules applied by the lexer generated by lex to identify the tokens need to be
listed in the second section of the lex definition file.
The rules must start in the 1st column (no space before the rule) and consist of two parts:
• the definition of the character pattern to be matched
• the definition of C code to be run when that pattern is matched

As a simple example, a rule can be the following:

"(" { strcpy(yylval.stringval,yytext); return OPEN_BRACKET; }

In the above we can see the sequence of characters to be matched within “” (in this case just one
character, the open round bracket) followed by a sequence of two C instructions, comprised
between graph brackets.
The last C instruction instructs the lexer to return the defined token code-number associated to the
specific token.
The first C instruction copies the recognized sequence of characters into the field stringval of the
union yylval that has been defined by yacc.
In this case we assume the union YYSTYPE has been defined as having a field char stringval[..].
Every time the lexer recognizes the defined token, the associated sequence of C commands will
be executed.
In summary, in the second section of the lex definition file we place a list of rules, each made of
the token match pattern followed by the list of associated C instructions.
In the above example the match pattern to identify a token is defined just as a sequence of character
comprised between “ “ quotes. This is just one of the possible methods. The second and more
powerful method is by means of regular expressions.

Defining a token by means of a regular expression opens a very powerful method to define the
token. Without any intention to give here any explanation on how regular expressions work, we
can just give below some of the most useful matching rules.

Regular expression Usage


. Any character except New Line
\n Newline
^ Beginnig of line
$ End of line
* Zero or more copies of the preceding expression
+ One or more copies of the preceding expression
? Zero or one copy of the preceding expression
b|c Single character b or c
(cd)* Zero or more copies of the sequence cd. Round brackets
can identify a portion of a regular expression as a grouped
expression to which the metacharacters *, + etc can be
applied
[a-zA-z0-9] [] are used to identify class of characters. In this case
whatever character is in the defined range (ranges are
defined by using the symbol - separating the first and the
last character of the range)
[a\-b] One of characters a,-,b. In this case the escape \ has been
used to identify the – character, not to be used as a range
identifier.
[-cd] One of characters -,c,d
[ \n\t] Space or newline or tab
[^cd] No character c or d
[c^d] One of the characters c,^,d
[c|d]+ One of the characters c,|,d repeated one or more times

In the usage of regular expressions to define lex rules we need to consider two additional rules that
lex will apply:
• In case more rules match, lex choses the rule matching a longer number of characters.
• In case more rules match the same number of characters, lex chooses the rule that occurs
earliest in the list.

This means that we can add a rule


. {; }

at the end of the list as a way of skipping any character that has not been identified as a part of any
regular expression. We don’t return any token to the parser.
Or we can identify sequences to be skipped, as
[ \t\r] {;}
to skip white spaces, tabs or carriage return characters

The above are just examples.

There are still a couple of concepts that can be listed.


The first is that whatever is present in the second section of the lex definition file but not starting
from the first column will be copied in the C generated file as it is.
For example, we can place comments in C syntax /* comment */, but they must start from the
second column.
The second is that it is possible to define a portion of a regular expression in the auxiliary lex
definition subsection following the C definitions and declarations.
As an example, we can define

letters [a-zA-Z]

and use {letters} inside any regular expression.


Substitutions like these can simplify the complexity of regular expression definitions.

4.2. A simple test


A simple test of the lexer without involving any parser could be done by associating to any
recognized token a print instruction as

{ printf(“%s\n”, yytext) ; return TOKENCODE ; }

where TOKENCODE is defined as a function of the token.

A simple main can be written as:

main()
{
int lex_retval;
FILE * fp;

fp = fopen(“filename.txt”,”r”);
yyin = fp;

lex_retval = yylex();
while (lex_retval !=0) // lex returns the value 0 at the end of file
{
printf("Returned Code: %d\n", lex_retval);
lex_retval = yylex();
}
fclose(fp);

return 0;
}
5. The Yacc definition file
The yacc definition file follows the same structure of the lex definition file. It consists of 3 sections,
separated by the symbols %% to be placed starting from the first column of the file.

1st section: optional C definitions or declarations and yacc declarations


%{
Optional C definitions or declarations
%}
Yacc declarations
%%
2nd section: Yacc grammar rules
%%
3rd section: optional C subroutines

The optional C definitions or declaration contains C instructions that will be copied at the
beginning of the C source created by yacc.
For example we can refer to
extern int yylex(void);
to refer to the externa yylex function created by the lexer.

The following yacc declaration section makes use of specific yacc directives that begins with the
symbol %.
Let’s consider three commonly used directives: %union, %token and %type.

The directive %union is used to define the YYSTYPE union that is used by the lexer to exchange
information with the parser through the variable yylval declared of this type.
The syntax is the following:
%union {
C-like field definition;
C-like field definition;
…..
}
The fields of the union are defined as inside a C union definition. As example: char stringval[256]
can define an array of characters where the lexer can write the sequence of characters that make
the token.
On the other hand, there is a second very important usage of the fields of the yylval union: the
fields are used not only to store values returned by the lexer but also to store values returned by
the parser at each step of the parsing sequence. In fact, the lexer can be seen as just the first step
of the parsing chain, the step that recognizes terminal symbols, but any production inside the parser
grammar recognizes a non-terminal symbol (the production head) by means of terminals and non-
terminals recognized inside the productions that make the parser grammar.
For this reason, each production, when recognizing the non-terminal symbol of its head, will have
the capability of returning a variable to the productions that are using the same terminal symbol in
their bodies. This returned variable can be one of the fields of the union.
This exchange of data from the bottom to the top (yacc is a bottom-up parser) is commonly used
to build some intermediate format, usually a tree, that will be used to proceed towards further
translations of the parsed sentence into its final target. For example, an expression that adds two
numbers can be converted into a tree structure where the numbers are the leaves and the summation
operator is the body. And this body can become the leave of a higher node of the tree.
In this way the three can be built as the parser progresses in the syntax recognition. It is easy to
think that the yylval field dedicated to this purpose can be a pointer to the tree node associated to
the non-terminal symbol that is the head of a production: each production can create its node, make
it to point to the nodes associated to the symbols inside the production body and pass the pointer
to its node to other productions using its symbol inside their production bodies.

The directive %token is used to declare tokens that can be used both by the lexer and the parser as
terminal symbols.
As an example
%token ID
will create a #define ID IDNUM
where IDNUM is a number assigned by yacc to the token.
All the defines created by the %token directive will be placed inside the y.tab.h to be included in
the lex definition file. Definitions by the %token directive not necessarily must be terminal
symbols returned by the lexer: the lexer can return a subset of the defined tokens. The %token
directive just allows to define a set of code-numbers associated to the token labels.

A second usage of the %token directive allows to specify which is the field of the yylval that is
associated to the return value from the lexer. As an example, if the field name is stringval, as
already used in the above examples, we will write:

%token <stringval> ID

It is not required to specify the associated field for all the tokens returned by the lexer to the parser.
It is mandatory to specify the field associated to a token only if the parser wants to read the value
passed by the lexer when the lexer recognizes that token. If the parser needs to read a returned
value from the lexer, the parser must know in which field of the yylval union the returned value is
stored. Different tokens can be associated to different fields, if needed.

Finally, the third directive in this analysis is the %type.


With the syntax
%type <fieldname> head
we need to define all the production heads used in the parser grammars and assign a yylval field
to each.
In this way all the symbols used inside the production bodies, terminals or non-terminals, will have
an associated field of the yylval union (as said above, terminals may even not have an associated
field only in case the parser will never need to read it).

5.3. Yacc rules


The second section of the yacc definition file contains the list of the production rules that makes
the parser grammar.
The rules must start in the 1st column (no space before the rule) and consist of two parts:
• the production rule
• the definition of C code to be run when the production is matched while the parser proceeds

Each production rule is expressed in the form

HEAD1 : BA1 BA2 …. BAn { C instructions for the 1st body }


| BB1 BB2 …. BBn { C instructions for the 2nd body }
| ….
;
HEAD2 : …..
;

All the head symbols need to be declared with the %type directive, as well as body non-terminal
symbols, while body terminal symbols need to be declared with the %token directive.
Multiple production bodies associated to the same head can be separated by the symbol |.
Even an empty body is allowed, just by not indicating any symbol in the body.
The symbol ; closes the definition of the production.

Each production can be associated to C instructions that will be executed when the parser matches
the body. These C instructions can make use of the yylval fields associated to each of the symbols
inside the body: value associated to 1st symbol can be accessed through the label the label $1,
value associated to 2nd symbol can be accessed through $2 etc. The C type of the values
corresponds to the yylval field associated to the symbol in the %token and %type directives.
Finally, the sequence of the C instructions must calculate the returning value for the head,
assigning it to the label $$.

We can assume that the C instructions associated to the grammar productions are also used to
generate the output of the parser itself.

6. About parser functionality


We can start our discussion on the parser functionality considering the simple grammar below:
G : E
E : F
| E+F
F : id
| F * id

We can note the followings about the above grammar:


1) E and F (who may represent expression and factor) have 2 alternatives each. On the other
hand, the entry-point of the grammar is unique. In general, we can always have a single
entry-point of the grammar. For example if we had G : F|F+id we could have added the
symbol F’ to write G : F’; F’: F|F+id.
So, we can assume the entry point of our grammars is unique.
2) We have 3 terminals (tokens) in our grammar: id, +, *. And we have 2 non terminals: E, F.
3) The grammar is Left-Recursive: E is defined as a function of E, at the left of the grammar
sentence. F is defined as a function of F at the left of a grammar.
The left recursion may have implications on some parsers.
Let’s consider a parser that starts the lexical analysis of the sentence from the left side. A
wide class of parsers consist in the so named Recursive Descent parsers, which means that
with a top-down approach they check a production by recursively replacing the non-
terminal symbols of the production body with the productions of the same symbols (starting
from the left symbol) until a terminal is found, which has to match the incoming token.
If we consider, as an example, the production E : E + F, replacing the first symbol E with
its productions means applying the same production, which leads to an infinite recursion,
never reaching a terminal symbol. This is a consequence of the fact that the production has
a Left Recursion.
Fortunately, Left Recursion can be automatically removed by simple grammar
manipulations (A:AB|C is equivalent to A:CA’ ; A’:BA’|e, where e means that we accept
the symbol being empty. A,B,C are three grammar symbols, as well as A’).
On the other hand, the parser implemented by yacc does not belong to this kind of parsers,
so we accept our grammar being left recursive.

We have said that in a recursive parser, which isn’t our yacc case, the parser checks if a
production can replace its symbols with the productions of the same symbols to
progressively ‘derive’ the sentence under analysis, starting from the left. Then, the analysis
proceeds towards a further portion of the sentence, checking if this matches a more
complex production, including the first production that has already been matched. On the
other hand, in case of a mismatch, the parser may be required to come back at the initial
decision to check another production. This means that what already decided is only
temporary and it can be confirmed only at the end of the parsing, with the possibility of
doing ‘backtracking’ to change previous decisions.
This backtracking need may be another bottleneck of parsers: if we want to construct some
intermediate code as the parser proceeds, this means that all the intermediate code we are
building will vanify in case the parser backtracks.
The yacc parser belongs to a class of ‘predictive’ parsers, which means that backtracking
is not accepted. If the parser has reached one point this means that no further token that
will arrive can be in conflict with what has been already checked, because for each
incoming token the production to be further checked is predicted from the token itself. Said
in other terms, the incoming token may be compatible or not compatible with what has
been already decided. If it is not compatible, this is an error: the sentence under analysis is
not matching the grammar.
Of course, not all the grammars are compatible with this kind of properties: without going
in details we can say that we may talk of LL grammars (Scan from Left, produce Left-most
‘derivation’) for recursive top-down parsers and of LR grammars (Scan from Left, produce
Right-Most ‘derivation’) for bottom-up parsers like yacc.
In a LL grammar, not only we want to scan the text from the left, but we also want to
‘derive’ the sentence under analysis by expanding symbols in a production body with
productions of the same symbols, starting from left symbols and moving to right symbols.
Similarly, in a LR grammar we start from right symbols.
In a bottom-up approach, we can say that all the terminals of the sentence under analysis
(the sentence is only made by terminals) can be progressively ‘reduced’ to higher level
grammar rules up to the start symbol of the grammar.
In the above, we have used two words: derivation and reduction, which require some more
discussion.

Let’s consider the sentence: S = id + id*id*id + id


Can this sentence be ‘derived’ from the grammar?

Derivation of a sentence from a grammar is a top-down approach where we start from the
entry point of the grammar and make replacements of symbols with productions of the
symbols, to arrive at the final sentence.

Applied Productions Result


G:E G=E
E:E+F G=E+F
F: id G = E + id
E:E+F G = E + F + id
F : F * id G = E + F * id + id
F : F * id G = E + F * id * id +id
F : id G = E + id * id * id + id
E:F G = F + id * id * id + id
F : id G = id + id * id * id + id

The above table can be better represented by a tree.


The derivation tree allows to put in evidence the sequence of derivation for each token of
the sentence, independently from the order of the derivation.
Grammars which are not well built may be ‘ambiguous’, which means that there may be
multiple derivation trees for a single sentence.

Let’s now follow a bottom-up approach: we push the incoming tokens into a stack until the
top of the stack matches a production body. At this point the production body is replaced
by the production head (reduction).

Current lookahead token Stack Production used


id id
+ F F : id
+ E E:F
+ E+
id E + id
* E+F F : id
* E+F*
id E + F * id
* E+F F : F * id
* E+F*
id E + F * id
+ E+F F : F * id
+ E E:E+F
+ E+
id E + id
E+F F : id
E E: E + F
G G:E

We can make the following comments:


1) In each line we have done one of two possible operations:
a. Either we have shifted the lookahead token into the stack (SHIFT)
b. Or we have reduced the top of the stack using a production (REDUCE)
2) We have looked at one lookahead token to decide which of the above 2 actions (shift
or reduce) to apply.
3) Reduction has always been done at the top of the stack

We need to say that, despite the above bottom-up parsing method can be easily understood, the
rules that have been applied to decide between shift and reduce may not be straightforward. That’s
why tools like yacc implement an algorithmic procedure to extract these rules from the grammar
and make the parser state machine.

Let’s start from the entry point of the grammar: G : E. We can say that at the beginning our status
of analysis is placed before the symbol E. We indicate this by placing a dot before the part of the
grammar body that still has to be recognized.
G:.E

When the parser state machine is in this initial state, it can expect only an expression E to come.
If it is not an expression E the incoming sentence is not matching the grammar.
But what does it mean in terms of the incoming token?
Let’s write a list of all the productions of the symbol that is just after the dot.
E:.F
E:.E+F

We are saying that the parser can expect a factor F as well as another expression E.
But we can go ahead: As the symbol F is just after the dot, the parser may expect also
F : . id
F: . F * id
In this analysis we break any possible recursion by checking that a production is not used more
than one time.

We can say that from this initial state we can evolve towards 3 possible further states, as a function
of what will be the incoming symbol:
• 2a) If the incoming symbol is the terminal id, the state machine can move to a new state
corresponding to F: id .
• 2b) If the incoming symbol is E, the state machine can move to a state corresponding to
the productions
• G: E .
• E:E.+F
• 2c) If the incoming symbol is F, the state machine can move to a state corresponding to the
productions
• E: F .
• F : F . * id
In these three new states the dot has moved one step at the right, because now we know that the
previous symbol has been recognized.

Let’s suppose to be in the state 1 and to receive id as lookahead token. As the token id is one of
the expected terminal tokens when we are in this state, the parser knows that it has to be shifted
inside the stack and that the state machine has to move into the state 2a.
In the state 2a there are no more expected tokens, as the dot is at the end of the production. So the
only thing that the parser can do is to reduce using the rule F : id associated to the state.
At this point the parser returns to the incoming state 1, knowing that the last grammar symbol was
F (a reduction to F has occurred).
But in the state 1 the parser knows that for a grammar symbol of F it must evolve towards the state
2c.

Basically, there are two different kind of transitions from one state to another:
• If the transition is driven by a terminal (id, in our example), it happens when the next token
matches the terminal
• If the transition is driven by a non-terminal (E, F in our example), it happens when the
parser comes back from another state where a a reduction to that non-terminal symbol has
happened
If the incoming token doesn’t match any transition driven by a terminal symbol, this means that
the state will have to make a reduction or that the sentence does not match the grammar.

Let’s suppose now to be in the state 2b): G : E . ; E:E.+F


We can apply the same addition of productions associated to the symbol that is at the right of the
dot. In this case the symbol is the +, which is a terminal symbol. So, we don’t have further
productions to be added to this state 2b). We will just create a state 3b) associated to the production
E: E + . F. In this state the dot has moved after the plus and the incoming symbol is the F, giving
further productions to be added to this state 3b.

Let’s suppose now to be in the state 2c): E : F. ; F : F . * id.


Even in this case there are no non-terminal symbols after the dot, so we don’t need to add further
productions.
In case the next token will be *, we will evolve to a state 3c): F : F * . id where the dot has moved
before the id.
In other cases the reduction E : F will be applied and the state machine will come back to the state
1 with the information that the non-terminal symbol is E (a reduction to E has occurred). At this
point the state machine will go to the state 2b). In the state 2b, either the next token will be a +, or
the reduction G : E will be applied and the state machine will go back to the state 1, where the
parsing will now be completed.

The state machine is shown in the figure below.


In the above example, we have said that if the transition from a state to another state is driven by
a terminal symbol, the transition will happen when the lookahead token will match that terminal
symbol. This correspond to a shift of the token into the stack.
We have also said that in case the incoming token doesn’t match any transition driven by a terminal
symbol, a reduction will happen.
In summary we can say that the shift has the precedence over a reduction: when both shift and
reduce could be possible (shift-reduce conflict) the shift has the precedence.
This is a good property for grammars that are built from the lest side and may be left recursive,
because symbols are reduced only when there is not the possibility of having a larger matching
including further incoming tokens.
On the other hand, grammars may give also reduce-reduce conflicts. In this case yacc makes
anyway a parser giving the priority to the productions according to the order they are listed.

Let’s now spend some words about the usage of the stack.
We have said that when the parser is in a certain state and the incoming token is matching one of
the expected tokens then the token is shifted into the stack and the parser is moved into a new state.
Actually, as the change of one state corresponds to the advance by one symbol in the sequence of
the recognized terminal and non-terminal symbols, what is pushed into the stack is the code-
number of the state of the parser. When the parser moves from state a into state b, state a is pushed
into the stack. When a reduction takes place virtually removing n symbols from the stack, this
means that n states can be removed from the stack, thus leaving at the top of the stack the state
where the parser has to come back after the reduction.

The state machine created by yacc can be made visible by using the options –debug –verbose and
by adding the assignment int yydebug=1; in the yacc definition file.
With the above, yacc generates the file y.output listing the state machine and prints to its output
all the steps done in the parsing process.

Inside each state of the state machine it is possible to find:


• The rules associated to the state, with the dot representing the status of recognition of the
rule
• The incoming tokens that will give a shift, with the new associated target states
• The possible reduction, and its reduction rule
o The yacc parser state machine can be more advanced than the simple description
given above and multiple reductions can be associated to multiple incoming tokens.
That’s because the parser can analyze the grammar to calculate the set of possible
tokens ‘following’ any symbol that can be reduced inside a production body.
• The indication of which state to go to, as a function of the symbol that has been reduced
by the state coming back to the current state

7. Error management
In case of error yacc calls a user defined function int yyerror(char *errormsg) that the user can
define in any of the C sections of the yacc definition file or in an external C module (by defining
the function as external at the beginning of the yacc definition file).
There is a pre-defined token ‘error’ that can be used as a regular token in a production rule, most
likely at the entry point of the grammar.
A typical entry point of the grammar may define multiple bodies, including
• An empty body
• The real entry point of the grammar
• An error management body.

As an example of entry point:


Entry :
| Entry Real_Entry { ;}
| Entry error {;}
;
By having a production recognizing the error token allows that rule to be successfully matched in
case of an error occurs, which means that the parser anyway finds a match.
By default, the parser stays in an error state until three tokens have been successfully shifted.
It may be useful instead to have a better control of the state of the parser when an error occurs, by
calling the predefined macro yyerrok as a C action of the rule matching the error. In this case no
tokens following the error are discarded and the current lookahead token (if present) is reanalyzed
immediately.
It may be desired to remove this lookahead token from the further parser analysis. In fact if the
parser is giving an error we can think to two cases:
• the lookahead token may not be compatible with the previous tokens already shifted. In
this case we could keep this token as the first one of the following token sequence to be
analyzed.
• The lookahead token may not be compatible with being the first token being shifted. In this
case, if the token is not cleared the error will represent forever.

The predefined macro yyclearin allows the removal of the lookahead token.
On the lookahead token, we can add that when it is available, it can be read from the variable int
yychar. If it is not available, the same variable will take the value YYEMPTY or YYEOF.

The last token read by lex may be a lookahead token or not, according to the fact that it may have
been already shifted or not into the stack when a reduction occurs. In any case, the character
sequence of the last token may be read from the lex variable yytext (which may be defined as char
yytext[] or char *yytext).

8. Conflicts
One of the most delicate aspects of a grammar can be the ambiguity rising from the precedence
we would like to assign to the tokens.
Let’s consider, as an example the simple grammar:
G: S
S: if c S
| if c S else S
| a

Where if, c, else and a are terminal symbols.


The sentence
if c if c a else a
Can be read as:
if c (if c a else a)
or as
if c (if c a) else a

which means that the else can be assigned to the first or to the second if.
If we make the yacc state machine for the above grammar, we can understand that at a certain point
we will arrive into a state with two possible productions:
• S: if c S .
• S: if c S . else S

The yacc default functionality will check if the following token is else and, in this case, it will shift
it into the stack, thus solving the ambiguity in favor of the assignment of the else to the final if.

In case this behavior is not the desired one we should tell yacc to behave in a different way, by
means of defining precedence and associativity of tokens.
Tokens can receive a precedence and the last terminal token inside a production gives the same
precedence to the production itself.
At this point, in a shift-reduce conflict, the precedence of the production subject to reduction is
compared to the precedence of the incoming token and the action associated to the higher
precedence is performed.
In case the precedence is the same, associativity is checked.
Tokens may have left or right associativity. The default behavior giving precedence to shift means
that the incoming token is managed as having a right associativity. On the other hand, specifying
a left associativity for the incoming token will modify the yacc behavior in favor of reduction.
In case the tokens are defined as non-associative, a conflict will give an error.

%left and %right and %nonassoc directives can be used to define associativity in the yacc
declaration section. These directives can be used in alternative to the %token declaration.
The order of associativity directives defines the precedence, starting from the lower to the higher
precedence.
%left PLUS MINUS
%left MULT DIV
Will give priority to MULT and DIV tokens vs. PLUS and MINUS tokens.

In case a production does not include any token from which it may take its precedence, it is possible
to assign to a production the same precedence of a token, for example a dummy token whose
precedence can be assigned by the %nonassoc directive.
At the end of the production body, after the definition of the C statements, it is possible add a
%prec directive followed by the name of the token whose priority will be exported to the
production body.
In this way, by the precedence of this token versus the precedence of the shift token it is possible
to avoid the assertion of a shift-reduce conflict warnings.

We can say that precedence and associativity can help removing conflicts coming from grammar
ambiguities. For example, if we define
E: E PLUS E
| NUMBER
we have clearly create an ambiguity in the usage of the PLUS operator.
On the other hand, by defining:
E: NUMBER
| E PLUS NUMBER
the expression is built from the left, without the need of defining the associativity of the PLUS
operator.
In this case a shift-reduce warning will be asserted, but the parser will perform in the proper way.
9. Further on parsing theory and techniques
In the above paragraphs we have talked about yacc but we have also said that another important
family of parsers consists in the top-down parsers, like recursive descent parsers.
On the other hand, we would like our parsers to be predictive, to avoid backtracking.
Having analyzed the functionality of the stack in case of a top-down parser like yacc, we can easily
understand the similar functionality of the stack in case of a top-down parser:
• In a bottom-up parser we fill the stack with the terminal symbols read from the input. When
the symbols at the top of the stack match a production body, they are replaced by the
production head.
• In a top-down parser, we start putting in the stack the production head of our grammar and
then we can replace the top symbol of the stack with production bodies, shifting symbols
in reverse order (left symbols will always be at the top of the stack). For each lookahead
symbol, if the top of the stack is a terminal symbol ‘matching’ the lookahead symbol, it
can be removed from the stack. If it is a non-terminal symbol, we can replace it by a
production of that non-terminal symbol, choosing a production that can derive the
lookahead as the first symbol.

The above explicit usage of a stack (instead of having it implicit by recursion) allows the
construction of a non-recursive top-down parser.

The requirement for a top-down parser to be predictive is that, given the lookahead symbol and
given the non-terminal symbol at the top of the stack, we have just one possible production of the
non-terminal symbol that can derive strings having the lookahead symbol as the ‘first’ symbol.
Otherwise we should try multiple productions.

Said that a grammar satisfying the above is the LL grammar, we can summarize saying that in a
LL grammar:
• If a symbol has multiple productions, they must derive strings starting with different
terminals. This is valid also in case the empty string can be derived: only one production
can derive an empty string.

On the other hand, if one of the productions of a symbol derives an empty string, given a lookahead
symbol we cannot know if we must choose the production that may derive an empty string (which
means that the lookahead symbol will be associated to the following symbol inside the stack) or if
we must choose another production that may generate that symbol.
This means that in a LL grammar:
• If a symbol has multiple productions and one of the productions can derive an empty string,
all the other productions must derive strings not starting with a terminal symbol that can
‘follow’ the symbol.

The above two requirements allow a grammar to be a LL grammar, thus supporting a predictive
top-down parsing technique, where the production to be selected is decided from the leftmost input
symbol of the incoming pattern.
In case a grammar is not a LL grammar it must be converted (if possible) into a LL grammar,
mainly by:
• Removing left recursions
• Applying left factorization

An example of left factorization is the following:


A: BC
| BD
Can be converted into
A: B A’
A’: C
| D
The first form is not a LL grammar, as the 2 productions of A start with the same symbol.
The second form is a LL grammar, if the two productions of A’ satisfy the requirements of a LL
grammar.

In a bottom-up parser, instead, the production to be selected is decided looking at the entire
instance of the production body, which means that a wider range of languages can be managed.
Moreover, despite in the examples given in previous paragraphs we have applied the simple rule
that a reduction is performed in case no shift is driven by the lookahead token, bottom-up parsers
commonly checks also the set of terminal symbols that may ‘follow’ the symbol candidate to be
reduced, before reducing it. So, the grammar can support multiple reduce items inside a single
state if they don’t have common terminal symbols that can ‘follow’ them.
A grammar that ensures no conflicts in its parsing state machine, comparing the lookahead symbol
with the possible terminal symbols that may ‘follow’ the symbol candidate to be reduced is a
SLR(1) grammar, where the 1 means that only one lookahead terminal symbol is checked.
This means that if we have productions like:
A:C
B:C
In order not to have reduce-reduce conflicts it is required that the symbols following A in the
grammar and the symbols following B in the grammar cannot derive sentences starting with the
same terminals.
The set of terminals that can start sentences generated by the symbols following S is commonly
called FOLLOW(S).
• The above means that FOLLOW(A) and FOLLOW(B) must be disjoint sets.

A second case we can consider is:


A:C
B:CD
In this case, in order not to have shift-reduce conflicts it is required the symbols following C in the
grammar and the symbol D cannot derive sentences starting with the same terminals.
The set of terminals that can start sentences generated by the symbol S is commonly called
FIRST(S).
• The above means that FIRST(S) and FOLLOW(C) must be disjoint sets.

The S inside the SLR name indicates a ‘simplified’ version of the original LR(1) where the state
machine was constructed to have all the reductions associated to specific lookahead symbols
instead of being allowed just for lookahead symbols that can ‘follow’ the symbol candidate to be
reduced.
By this simplification, a SLR parser can work with a more limited number of states.

The usage of the lookahead symbol, both in SLR or in LR parsers is the key factor in the parsing
flow, by allowing the prediction of the following action to be accomplished among possible shift
or multiple reduce actions.
Doing a step back, the starting point is the generation of all the possible states where the parser
can be in the analysis of a production. For example, for the productions G : A B C we can generate
the following ‘items’, where the dot indicates the parser status.

G: . A B C ; G: A . B C ; G: A B . C ; G: A B C .

For each of the above items it is possible to generate a CLOSURE of the item, which means adding
the productions generated by the symbol after the dot, and the productions generated applying the
same concept to the added productions (being limited by the rule that a production must be present
in the closure only one time). This is what has been done in the generation of the bottom-up parser
state machine.
Let’s consider the second case: G: A . B C.
Generating the closure means adding the productions of B. The problem may arise in case the
productions generated by B in the calculation of the CLOSURE contain two productions E and F
with the same body. We know that the conflict will be removed in case FOLLOW(E) and
FOLLOW(F) are disjoint sets. Otherwise:
• A symbol that can generate two productions with the same body will cause a reduce-reduce
conflict.
In this case it can be recommended to check the state of the state machine giving a goto to the
conflicting state. In this state, the symbol after the dot is the one generating the multiple
productions with the same body.
In case a reduce-reduce conflict is limited to a specific lookahead symbol, it may be possible to
modify the grammar by adding a new symbol including the lookahead inside and replacing this
new symbol in the productions where the original symbol was followed by the lookahead token.
This may give the advantage of parsing with 2 lookahead symbols.

In the generation of the state machine, we have done a following step: we have generated other
states, each state including a collection of ‘items’ (each collection with its CLOSURE), each
collection generated starting from one of the symbols at the right of the dot.
Each state is a set of ‘items’ (with the CLOSURE applied inside each set).
We can say that this new set (Ix) of items generated by the set I considering the symbol X (that is
present inside the set I at the right of the dot) is usually called GOTO(I,X).

The collection of these sets (one set for each state) is also commonly called as the Canonical LR(0)
collection.

So we can summarize that for a SLR parser:


• Being in state I (including A: B . x D), being x a terminal symbol, when the lookahead
symbol is x the parser will move to the state GOTO(I,x)
• Being in state I (including A: B .), when the lookahead symbol is x is in FOLLOW(B) the
parser will reduce A
• Being in state I (including A: B . C D), being C a non-terminal symbol, when the parser
comes to the state I after a reduction to C the parser will move to the state GOTO(I, C).

You might also like