Professional Documents
Culture Documents
YACT
YACT
YACT
The grammar can be very complex, as an example we can define that an ‘if statement’ is made
according to the following description:
In the above grammar rule, ‘if’, ‘then’ ‘else’ and ‘end’ are identified symbols that do not require
further grammar definitions. We call these ‘terminals’, because they terminate our grammar
analysis.
Instead, CONDITION and ACTION don’t terminate the grammar analysis, but we must search
inside the grammar definition how these grammar terms are defined, in the same way as we have
defined the IFSTAT term.
As an example,
The unix environment provides the YACC parser generator that can be used in conjunction with
the LEX lexer generator. In general the parser generated by YACC will call a function yylex()
expecting to receive a code-number identifying the following token. It is possible to define the
code of the yylex() function inside the program embedding the parser or to embed the yylex()
function provided by the unix LEX generator.
In this text we will assume the second case.
where the lexfile.l is a text file specifying the lexer grammar and operations in an appropriate lex
syntax.
The above command will generate the C file lex.yy.c, embedding a function yylex() together with
further data.
where the yaccfile.y is is a text file specifying the parser grammar and operations in an appropriate
yacc syntax.
The above command will generate the C file y.tab.c, including the yyparse() function together with
additional data.
The -d option tells yacc to generate also another file named y.tab.h including the definition of the
tokens and of their code-numbers so that the lexer can use the same code convention expected by
the parser. We will have to instruct the lex generator to include the y.tab.h inside the generated C
lexer.
As an alternative we will have to specify a set of code-numbers inside the lexfile.l file and the
same set of codes inside the yaccfile.y file.
In unix environment when we compile our C program including the yyparse() function created by
the yacc, we need to include the libl.a and liby.a libraries:
In other platforms different from unix the libl and liby may not be always required.
We can remind that the -l option tells to the compiler linker the name of the library to be used,
with the assumption that the library always begins with ‘lib’ (so the prefix ‘lib’ must be removed
in the -l specification) and always not requiring the ‘.a’ or ‘.so’ extension in the -l specification.
The lex definition file consists of 3 sections, separated by the symbols %% to be placed starting
from the first column of the file.
The first C definition subsection will be copied at the beginning of the C code generated by lex.
The second section will be translated by lex into the function yylex(), including also the subsection
of the lex definitions.
The third section will be copied at the end of the C code generated by lex.
Let’s assume for example that in the second or in the third section we want to use a C printf
function. In this case, in the first section we will add the required C header #include <stdio.h>.
We have also said that in order to have lex and yacc to use the same code-numbers for the token
we can run yacc with the option -d to generate the file y.tab.h including the definition of the tokens
and of their code-numbers. This means that in the first section we need to insert the statement
#include “y.tab.h”.
For example, inside the y.tab.h file we can find token code-number definitions like
This gives the code 258 to the OPEN_BRACKET token and we will be able to use the term
OPEN_BRACKET both inside the lex definition file and also inside the yacc grammar rules, where
we can refer to OPEN_BRACKET inside any production body.
The concept is that the definition of the terminals used in a yacc grammar is performed inside the
yacc definition file, but the same list of definitions can be included by lex in order to have the same
definitions.
There is a third important set of definitions that can be included in the first section: the indication
of variables used to send information to yacc.
We have said that the function yylex() will return the code-number associated to the token that has
been recognized. For example, we will instruct lex to return OPEN_BRACKET (which is defined
as the code 258) when the token ‘(‘ is recognized by the lexer.
But we have also said that the token code-number may not be enough to the parser to perform all
its required actions: it is enough to understand if the parser grammar is matched but for further
actions (for example to generate a compiled code) we need to have also further information. For
example, we could need to send to the parser the string of characters that make the token.
The yacc parser assumes that a type YYSTYPE is defined and in case it is not defined by the user
it will be defined inside the y.tab.h file as a C union whose fields will be read by the yacc parser
to receive the additional required information.
The name and the kind of fields of this union will be defined in the yacc definition file.
The variable ‘yylval’ of type YYSTYPE is also defined and the fields of yylval can be written by
the lexer and can be read by yacc for this exchange of information.
Basically, at this point of the description we can say that in the first section of the lex definition
file we can add the statement
extern YYSTYPE yylval;
We are saying that the C code created by lex can have access to the union yylval defined inside
the y.tab.h file and can write the fields of this union to send further information to the yacc parser.
The standard usage of the YYSTYPE is associated to a union type, but it is also possible to define
it as a structure type, in case we would like to send multiple simultaneous information to the parser
and not just a single information as allowed by a union.
Actually, without any kind of definition action the yylval will be defined as an integer, but we
assume that in a practical usage a definition of the YYSTYPE will always be done and we consider
the definition as a union as a standard definition.
Said in other terms, yacc will define the fields of a union of type YYSTYPE and an associated
variable yylval of type YYSTYPE. Moreover, yacc will make available mechanisms to read the
fields of this yylval variable for its parsing operations.
At this point of our description we need to know that the lexer is required to fill the fields of this
yylval variable as a way to pass information to the parser.
If the sentence ‘extern YYSTYPE yylval;’ is already inside the included y.tab.h file it will not be
required to repeat the same definition in the first section of the lex definition file, but we have tried
here to give an overview of how the exchange of information between the lexer and the yacc parser
works.
In case we want to use the yylex() function generated by lex with another parser, not generated by
yacc, we can add in the first section of the lex definition file all the token definition that we need
and all the variables that we need to exchange information to the parser.
Before starting the description of the second section of the lex definition file, let’s add a further
information.
When the lexer generated by lex recognizes a token, the char* ‘yytext’ (or char yytext[], depends
from the lex implementation) will contain the sequence of recognized characters that make the
token and the int yyleng variable will contain the number of recognized characters. These two
variables are defined inside the C file source code generated by lex and can be accessed inside the
lex definition file (or inside any other c module, by defining these variables as external).
The lexer generated by lex will read the text to be scanned from the variable FILE *yyin which is
set to stdin as default but can be reassigned to a user file before calling the yylex() function.
In the same way, the output will be sent to FILE *yyout, set as default to stdout.
There may be further variables or functions defined by lex that can be used in the exchange of data
between the lex generated code and additional external code, but we limit the description to the
above to avoid an excess of details.
In the above we can see the sequence of characters to be matched within “” (in this case just one
character, the open round bracket) followed by a sequence of two C instructions, comprised
between graph brackets.
The last C instruction instructs the lexer to return the defined token code-number associated to the
specific token.
The first C instruction copies the recognized sequence of characters into the field stringval of the
union yylval that has been defined by yacc.
In this case we assume the union YYSTYPE has been defined as having a field char stringval[..].
Every time the lexer recognizes the defined token, the associated sequence of C commands will
be executed.
In summary, in the second section of the lex definition file we place a list of rules, each made of
the token match pattern followed by the list of associated C instructions.
In the above example the match pattern to identify a token is defined just as a sequence of character
comprised between “ “ quotes. This is just one of the possible methods. The second and more
powerful method is by means of regular expressions.
Defining a token by means of a regular expression opens a very powerful method to define the
token. Without any intention to give here any explanation on how regular expressions work, we
can just give below some of the most useful matching rules.
In the usage of regular expressions to define lex rules we need to consider two additional rules that
lex will apply:
• In case more rules match, lex choses the rule matching a longer number of characters.
• In case more rules match the same number of characters, lex chooses the rule that occurs
earliest in the list.
at the end of the list as a way of skipping any character that has not been identified as a part of any
regular expression. We don’t return any token to the parser.
Or we can identify sequences to be skipped, as
[ \t\r] {;}
to skip white spaces, tabs or carriage return characters
letters [a-zA-Z]
main()
{
int lex_retval;
FILE * fp;
fp = fopen(“filename.txt”,”r”);
yyin = fp;
lex_retval = yylex();
while (lex_retval !=0) // lex returns the value 0 at the end of file
{
printf("Returned Code: %d\n", lex_retval);
lex_retval = yylex();
}
fclose(fp);
return 0;
}
5. The Yacc definition file
The yacc definition file follows the same structure of the lex definition file. It consists of 3 sections,
separated by the symbols %% to be placed starting from the first column of the file.
The optional C definitions or declaration contains C instructions that will be copied at the
beginning of the C source created by yacc.
For example we can refer to
extern int yylex(void);
to refer to the externa yylex function created by the lexer.
The following yacc declaration section makes use of specific yacc directives that begins with the
symbol %.
Let’s consider three commonly used directives: %union, %token and %type.
The directive %union is used to define the YYSTYPE union that is used by the lexer to exchange
information with the parser through the variable yylval declared of this type.
The syntax is the following:
%union {
C-like field definition;
C-like field definition;
…..
}
The fields of the union are defined as inside a C union definition. As example: char stringval[256]
can define an array of characters where the lexer can write the sequence of characters that make
the token.
On the other hand, there is a second very important usage of the fields of the yylval union: the
fields are used not only to store values returned by the lexer but also to store values returned by
the parser at each step of the parsing sequence. In fact, the lexer can be seen as just the first step
of the parsing chain, the step that recognizes terminal symbols, but any production inside the parser
grammar recognizes a non-terminal symbol (the production head) by means of terminals and non-
terminals recognized inside the productions that make the parser grammar.
For this reason, each production, when recognizing the non-terminal symbol of its head, will have
the capability of returning a variable to the productions that are using the same terminal symbol in
their bodies. This returned variable can be one of the fields of the union.
This exchange of data from the bottom to the top (yacc is a bottom-up parser) is commonly used
to build some intermediate format, usually a tree, that will be used to proceed towards further
translations of the parsed sentence into its final target. For example, an expression that adds two
numbers can be converted into a tree structure where the numbers are the leaves and the summation
operator is the body. And this body can become the leave of a higher node of the tree.
In this way the three can be built as the parser progresses in the syntax recognition. It is easy to
think that the yylval field dedicated to this purpose can be a pointer to the tree node associated to
the non-terminal symbol that is the head of a production: each production can create its node, make
it to point to the nodes associated to the symbols inside the production body and pass the pointer
to its node to other productions using its symbol inside their production bodies.
The directive %token is used to declare tokens that can be used both by the lexer and the parser as
terminal symbols.
As an example
%token ID
will create a #define ID IDNUM
where IDNUM is a number assigned by yacc to the token.
All the defines created by the %token directive will be placed inside the y.tab.h to be included in
the lex definition file. Definitions by the %token directive not necessarily must be terminal
symbols returned by the lexer: the lexer can return a subset of the defined tokens. The %token
directive just allows to define a set of code-numbers associated to the token labels.
A second usage of the %token directive allows to specify which is the field of the yylval that is
associated to the return value from the lexer. As an example, if the field name is stringval, as
already used in the above examples, we will write:
%token <stringval> ID
It is not required to specify the associated field for all the tokens returned by the lexer to the parser.
It is mandatory to specify the field associated to a token only if the parser wants to read the value
passed by the lexer when the lexer recognizes that token. If the parser needs to read a returned
value from the lexer, the parser must know in which field of the yylval union the returned value is
stored. Different tokens can be associated to different fields, if needed.
All the head symbols need to be declared with the %type directive, as well as body non-terminal
symbols, while body terminal symbols need to be declared with the %token directive.
Multiple production bodies associated to the same head can be separated by the symbol |.
Even an empty body is allowed, just by not indicating any symbol in the body.
The symbol ; closes the definition of the production.
Each production can be associated to C instructions that will be executed when the parser matches
the body. These C instructions can make use of the yylval fields associated to each of the symbols
inside the body: value associated to 1st symbol can be accessed through the label the label $1,
value associated to 2nd symbol can be accessed through $2 etc. The C type of the values
corresponds to the yylval field associated to the symbol in the %token and %type directives.
Finally, the sequence of the C instructions must calculate the returning value for the head,
assigning it to the label $$.
We can assume that the C instructions associated to the grammar productions are also used to
generate the output of the parser itself.
We have said that in a recursive parser, which isn’t our yacc case, the parser checks if a
production can replace its symbols with the productions of the same symbols to
progressively ‘derive’ the sentence under analysis, starting from the left. Then, the analysis
proceeds towards a further portion of the sentence, checking if this matches a more
complex production, including the first production that has already been matched. On the
other hand, in case of a mismatch, the parser may be required to come back at the initial
decision to check another production. This means that what already decided is only
temporary and it can be confirmed only at the end of the parsing, with the possibility of
doing ‘backtracking’ to change previous decisions.
This backtracking need may be another bottleneck of parsers: if we want to construct some
intermediate code as the parser proceeds, this means that all the intermediate code we are
building will vanify in case the parser backtracks.
The yacc parser belongs to a class of ‘predictive’ parsers, which means that backtracking
is not accepted. If the parser has reached one point this means that no further token that
will arrive can be in conflict with what has been already checked, because for each
incoming token the production to be further checked is predicted from the token itself. Said
in other terms, the incoming token may be compatible or not compatible with what has
been already decided. If it is not compatible, this is an error: the sentence under analysis is
not matching the grammar.
Of course, not all the grammars are compatible with this kind of properties: without going
in details we can say that we may talk of LL grammars (Scan from Left, produce Left-most
‘derivation’) for recursive top-down parsers and of LR grammars (Scan from Left, produce
Right-Most ‘derivation’) for bottom-up parsers like yacc.
In a LL grammar, not only we want to scan the text from the left, but we also want to
‘derive’ the sentence under analysis by expanding symbols in a production body with
productions of the same symbols, starting from left symbols and moving to right symbols.
Similarly, in a LR grammar we start from right symbols.
In a bottom-up approach, we can say that all the terminals of the sentence under analysis
(the sentence is only made by terminals) can be progressively ‘reduced’ to higher level
grammar rules up to the start symbol of the grammar.
In the above, we have used two words: derivation and reduction, which require some more
discussion.
Derivation of a sentence from a grammar is a top-down approach where we start from the
entry point of the grammar and make replacements of symbols with productions of the
symbols, to arrive at the final sentence.
Let’s now follow a bottom-up approach: we push the incoming tokens into a stack until the
top of the stack matches a production body. At this point the production body is replaced
by the production head (reduction).
We need to say that, despite the above bottom-up parsing method can be easily understood, the
rules that have been applied to decide between shift and reduce may not be straightforward. That’s
why tools like yacc implement an algorithmic procedure to extract these rules from the grammar
and make the parser state machine.
Let’s start from the entry point of the grammar: G : E. We can say that at the beginning our status
of analysis is placed before the symbol E. We indicate this by placing a dot before the part of the
grammar body that still has to be recognized.
G:.E
When the parser state machine is in this initial state, it can expect only an expression E to come.
If it is not an expression E the incoming sentence is not matching the grammar.
But what does it mean in terms of the incoming token?
Let’s write a list of all the productions of the symbol that is just after the dot.
E:.F
E:.E+F
We are saying that the parser can expect a factor F as well as another expression E.
But we can go ahead: As the symbol F is just after the dot, the parser may expect also
F : . id
F: . F * id
In this analysis we break any possible recursion by checking that a production is not used more
than one time.
We can say that from this initial state we can evolve towards 3 possible further states, as a function
of what will be the incoming symbol:
• 2a) If the incoming symbol is the terminal id, the state machine can move to a new state
corresponding to F: id .
• 2b) If the incoming symbol is E, the state machine can move to a state corresponding to
the productions
• G: E .
• E:E.+F
• 2c) If the incoming symbol is F, the state machine can move to a state corresponding to the
productions
• E: F .
• F : F . * id
In these three new states the dot has moved one step at the right, because now we know that the
previous symbol has been recognized.
Let’s suppose to be in the state 1 and to receive id as lookahead token. As the token id is one of
the expected terminal tokens when we are in this state, the parser knows that it has to be shifted
inside the stack and that the state machine has to move into the state 2a.
In the state 2a there are no more expected tokens, as the dot is at the end of the production. So the
only thing that the parser can do is to reduce using the rule F : id associated to the state.
At this point the parser returns to the incoming state 1, knowing that the last grammar symbol was
F (a reduction to F has occurred).
But in the state 1 the parser knows that for a grammar symbol of F it must evolve towards the state
2c.
Basically, there are two different kind of transitions from one state to another:
• If the transition is driven by a terminal (id, in our example), it happens when the next token
matches the terminal
• If the transition is driven by a non-terminal (E, F in our example), it happens when the
parser comes back from another state where a a reduction to that non-terminal symbol has
happened
If the incoming token doesn’t match any transition driven by a terminal symbol, this means that
the state will have to make a reduction or that the sentence does not match the grammar.
Let’s now spend some words about the usage of the stack.
We have said that when the parser is in a certain state and the incoming token is matching one of
the expected tokens then the token is shifted into the stack and the parser is moved into a new state.
Actually, as the change of one state corresponds to the advance by one symbol in the sequence of
the recognized terminal and non-terminal symbols, what is pushed into the stack is the code-
number of the state of the parser. When the parser moves from state a into state b, state a is pushed
into the stack. When a reduction takes place virtually removing n symbols from the stack, this
means that n states can be removed from the stack, thus leaving at the top of the stack the state
where the parser has to come back after the reduction.
The state machine created by yacc can be made visible by using the options –debug –verbose and
by adding the assignment int yydebug=1; in the yacc definition file.
With the above, yacc generates the file y.output listing the state machine and prints to its output
all the steps done in the parsing process.
7. Error management
In case of error yacc calls a user defined function int yyerror(char *errormsg) that the user can
define in any of the C sections of the yacc definition file or in an external C module (by defining
the function as external at the beginning of the yacc definition file).
There is a pre-defined token ‘error’ that can be used as a regular token in a production rule, most
likely at the entry point of the grammar.
A typical entry point of the grammar may define multiple bodies, including
• An empty body
• The real entry point of the grammar
• An error management body.
The predefined macro yyclearin allows the removal of the lookahead token.
On the lookahead token, we can add that when it is available, it can be read from the variable int
yychar. If it is not available, the same variable will take the value YYEMPTY or YYEOF.
The last token read by lex may be a lookahead token or not, according to the fact that it may have
been already shifted or not into the stack when a reduction occurs. In any case, the character
sequence of the last token may be read from the lex variable yytext (which may be defined as char
yytext[] or char *yytext).
8. Conflicts
One of the most delicate aspects of a grammar can be the ambiguity rising from the precedence
we would like to assign to the tokens.
Let’s consider, as an example the simple grammar:
G: S
S: if c S
| if c S else S
| a
which means that the else can be assigned to the first or to the second if.
If we make the yacc state machine for the above grammar, we can understand that at a certain point
we will arrive into a state with two possible productions:
• S: if c S .
• S: if c S . else S
The yacc default functionality will check if the following token is else and, in this case, it will shift
it into the stack, thus solving the ambiguity in favor of the assignment of the else to the final if.
In case this behavior is not the desired one we should tell yacc to behave in a different way, by
means of defining precedence and associativity of tokens.
Tokens can receive a precedence and the last terminal token inside a production gives the same
precedence to the production itself.
At this point, in a shift-reduce conflict, the precedence of the production subject to reduction is
compared to the precedence of the incoming token and the action associated to the higher
precedence is performed.
In case the precedence is the same, associativity is checked.
Tokens may have left or right associativity. The default behavior giving precedence to shift means
that the incoming token is managed as having a right associativity. On the other hand, specifying
a left associativity for the incoming token will modify the yacc behavior in favor of reduction.
In case the tokens are defined as non-associative, a conflict will give an error.
%left and %right and %nonassoc directives can be used to define associativity in the yacc
declaration section. These directives can be used in alternative to the %token declaration.
The order of associativity directives defines the precedence, starting from the lower to the higher
precedence.
%left PLUS MINUS
%left MULT DIV
Will give priority to MULT and DIV tokens vs. PLUS and MINUS tokens.
In case a production does not include any token from which it may take its precedence, it is possible
to assign to a production the same precedence of a token, for example a dummy token whose
precedence can be assigned by the %nonassoc directive.
At the end of the production body, after the definition of the C statements, it is possible add a
%prec directive followed by the name of the token whose priority will be exported to the
production body.
In this way, by the precedence of this token versus the precedence of the shift token it is possible
to avoid the assertion of a shift-reduce conflict warnings.
We can say that precedence and associativity can help removing conflicts coming from grammar
ambiguities. For example, if we define
E: E PLUS E
| NUMBER
we have clearly create an ambiguity in the usage of the PLUS operator.
On the other hand, by defining:
E: NUMBER
| E PLUS NUMBER
the expression is built from the left, without the need of defining the associativity of the PLUS
operator.
In this case a shift-reduce warning will be asserted, but the parser will perform in the proper way.
9. Further on parsing theory and techniques
In the above paragraphs we have talked about yacc but we have also said that another important
family of parsers consists in the top-down parsers, like recursive descent parsers.
On the other hand, we would like our parsers to be predictive, to avoid backtracking.
Having analyzed the functionality of the stack in case of a top-down parser like yacc, we can easily
understand the similar functionality of the stack in case of a top-down parser:
• In a bottom-up parser we fill the stack with the terminal symbols read from the input. When
the symbols at the top of the stack match a production body, they are replaced by the
production head.
• In a top-down parser, we start putting in the stack the production head of our grammar and
then we can replace the top symbol of the stack with production bodies, shifting symbols
in reverse order (left symbols will always be at the top of the stack). For each lookahead
symbol, if the top of the stack is a terminal symbol ‘matching’ the lookahead symbol, it
can be removed from the stack. If it is a non-terminal symbol, we can replace it by a
production of that non-terminal symbol, choosing a production that can derive the
lookahead as the first symbol.
The above explicit usage of a stack (instead of having it implicit by recursion) allows the
construction of a non-recursive top-down parser.
The requirement for a top-down parser to be predictive is that, given the lookahead symbol and
given the non-terminal symbol at the top of the stack, we have just one possible production of the
non-terminal symbol that can derive strings having the lookahead symbol as the ‘first’ symbol.
Otherwise we should try multiple productions.
Said that a grammar satisfying the above is the LL grammar, we can summarize saying that in a
LL grammar:
• If a symbol has multiple productions, they must derive strings starting with different
terminals. This is valid also in case the empty string can be derived: only one production
can derive an empty string.
On the other hand, if one of the productions of a symbol derives an empty string, given a lookahead
symbol we cannot know if we must choose the production that may derive an empty string (which
means that the lookahead symbol will be associated to the following symbol inside the stack) or if
we must choose another production that may generate that symbol.
This means that in a LL grammar:
• If a symbol has multiple productions and one of the productions can derive an empty string,
all the other productions must derive strings not starting with a terminal symbol that can
‘follow’ the symbol.
The above two requirements allow a grammar to be a LL grammar, thus supporting a predictive
top-down parsing technique, where the production to be selected is decided from the leftmost input
symbol of the incoming pattern.
In case a grammar is not a LL grammar it must be converted (if possible) into a LL grammar,
mainly by:
• Removing left recursions
• Applying left factorization
In a bottom-up parser, instead, the production to be selected is decided looking at the entire
instance of the production body, which means that a wider range of languages can be managed.
Moreover, despite in the examples given in previous paragraphs we have applied the simple rule
that a reduction is performed in case no shift is driven by the lookahead token, bottom-up parsers
commonly checks also the set of terminal symbols that may ‘follow’ the symbol candidate to be
reduced, before reducing it. So, the grammar can support multiple reduce items inside a single
state if they don’t have common terminal symbols that can ‘follow’ them.
A grammar that ensures no conflicts in its parsing state machine, comparing the lookahead symbol
with the possible terminal symbols that may ‘follow’ the symbol candidate to be reduced is a
SLR(1) grammar, where the 1 means that only one lookahead terminal symbol is checked.
This means that if we have productions like:
A:C
B:C
In order not to have reduce-reduce conflicts it is required that the symbols following A in the
grammar and the symbols following B in the grammar cannot derive sentences starting with the
same terminals.
The set of terminals that can start sentences generated by the symbols following S is commonly
called FOLLOW(S).
• The above means that FOLLOW(A) and FOLLOW(B) must be disjoint sets.
The S inside the SLR name indicates a ‘simplified’ version of the original LR(1) where the state
machine was constructed to have all the reductions associated to specific lookahead symbols
instead of being allowed just for lookahead symbols that can ‘follow’ the symbol candidate to be
reduced.
By this simplification, a SLR parser can work with a more limited number of states.
The usage of the lookahead symbol, both in SLR or in LR parsers is the key factor in the parsing
flow, by allowing the prediction of the following action to be accomplished among possible shift
or multiple reduce actions.
Doing a step back, the starting point is the generation of all the possible states where the parser
can be in the analysis of a production. For example, for the productions G : A B C we can generate
the following ‘items’, where the dot indicates the parser status.
G: . A B C ; G: A . B C ; G: A B . C ; G: A B C .
For each of the above items it is possible to generate a CLOSURE of the item, which means adding
the productions generated by the symbol after the dot, and the productions generated applying the
same concept to the added productions (being limited by the rule that a production must be present
in the closure only one time). This is what has been done in the generation of the bottom-up parser
state machine.
Let’s consider the second case: G: A . B C.
Generating the closure means adding the productions of B. The problem may arise in case the
productions generated by B in the calculation of the CLOSURE contain two productions E and F
with the same body. We know that the conflict will be removed in case FOLLOW(E) and
FOLLOW(F) are disjoint sets. Otherwise:
• A symbol that can generate two productions with the same body will cause a reduce-reduce
conflict.
In this case it can be recommended to check the state of the state machine giving a goto to the
conflicting state. In this state, the symbol after the dot is the one generating the multiple
productions with the same body.
In case a reduce-reduce conflict is limited to a specific lookahead symbol, it may be possible to
modify the grammar by adding a new symbol including the lookahead inside and replacing this
new symbol in the productions where the original symbol was followed by the lookahead token.
This may give the advantage of parsing with 2 lookahead symbols.
In the generation of the state machine, we have done a following step: we have generated other
states, each state including a collection of ‘items’ (each collection with its CLOSURE), each
collection generated starting from one of the symbols at the right of the dot.
Each state is a set of ‘items’ (with the CLOSURE applied inside each set).
We can say that this new set (Ix) of items generated by the set I considering the symbol X (that is
present inside the set I at the right of the dot) is usually called GOTO(I,X).
The collection of these sets (one set for each state) is also commonly called as the Canonical LR(0)
collection.