Designing The Lexical Analyzer

You might also like

Download as rtf, pdf, or txt
Download as rtf, pdf, or txt
You are on page 1of 7

Designing The Lexical Analyzer

Introduction
As part of the nGineer suite, there was a need to use both a lexical analyzer and a
grammatical parser, neither of which were implemented in the .NET Framework, so they
had to be written. This article explains the main design of the lexical analyzer as a
document to aid those intending to read the code or just learn about the lexical analyzer.

Goals
When I first went to design the lexical analyzer, the main goal I had in mind was to make
it as simple as possible. During the process, I had the KiSS principle in my mind. I
wanted the most easily understandable code and design.
I decided to separate the lexical analyzer from the grammatical parser in order to more
easily leverage its functionality as a tool that is able to disassemble and 'understand' the
text that makes up the C# code file.

How Lexical Analyzers Work


Every lexical analyzer is made up of a state machine. The main goal is to find the best
(i.e. most characters / longest) matching lexical element that fits to be the next in line.
Take for example the sentence:
All your base are belong to us
We have the following available lexical elements: "All" "All your" "your base" "all your
base" "are" "to us" "are belong to", "us", "us!" and space (" ").
Let's decide what the first element is. It can be "All" or "All your" ("all your base" may
be long and looks like it matches, but we're working on this character by character and
"A" is not the same as "a"). We'll decide on "All your" because it is the longest match.
The second element can be "are" or "are belong to". Working with the longest match
principle again, we'll select "are belong to". The third match will be "us" and not "us!"
because it is not a part of the analyzed text, even tough it's longer.

Top-Level Design
Most functional lexical analyzers are just made up of a set of functions, trying to discover
what the next character is. Most object oriented lexical analyzers are the same as the
functional ones, just wrapped in a class. I decided that nGineer's lexical analyzer will be
made up of self creating classes that will make up an object grid which in turn describes
the text. Each lexical element got its own class and I used inheritance to express the "is a"
factor (in the coming example: pp-equality-expression is a pp-and-expression ->
PPEqualityExpression derives from PPAndExpression).
Let's take the pp-and-expression for instance:
(Common means of describing lexical grammar are LL(1), LALR, etc. ECMA use a
different means that is described in Chapter 5 of the C# standard. We'll use that notation
in this article.)
pp-and-expression::
pp-equality-expression
pp-and-expression whitespace(opt) && whitespace(opt) pp-
equality-expression
the code generated from this will be:
using System;

namespace nGineer.LexicalGrammar
{
/// <summary>
/// A.1.10
/// pp-and-expression::
/// pp-equality-expression
/// pp-and-expression whitespace(opt) && whitespace(opt) pp-equality-
expression
/// </summary>
public abstract class PPAndExpression : PPOrExpression, ILexicalElement
{
#region ComplexAndExpression
public class ComplexAndExpression : PPAndExpression, ILexicalElement
{
internal const string AndOperator = "&&";

private PPAndExpression m_AndExpression;


private WhiteSpace m_PreOperatorWhiteSpace = null;
private WhiteSpace m_PostOperatorWhiteSpace = null;
private PPEqualityExpression m_EqualityExpression;

protected ComplexAndExpression()
{
}

internal static new ComplexAndExpression Create(VirtualString


codeSegment, PPDeclarationRequestEventHandler requestEventHandler)
{
ComplexAndExpression newObject = null;

if (codeSegment.Contains(AndOperator))
{
codeSegment.SetCheckPoint();

VirtualString tempString =
codeSegment.SubstringLast(AndOperator);
int oldLength = tempString.Length;

PPAndExpression andExpression =
PPAndExpression.Create(tempString, requestEventHandler);

if (andExpression != null)
{
codeSegment.Advance(oldLength -
tempString.Length);
WhiteSpace
preOperatorWhiteSpace = WhiteSpace.Create(codeSegment);

if (codeSegment.Length >=
AndOperator.Length &&

codeSegment.IndexOf(AndOperator, 0, AndOperator.Length) == 0)
{

codeSegment.Advance(AndOperator.Length);

WhiteSpace
postOperatorWhiteSpace = WhiteSpace.Create(codeSegment);

PPEqualityExpression
equalityExpression = PPEqualityExpression.Create(codeSegment, requestEventHandler);

if (andExpression !=
null)
{
newObject
= new ComplexAndExpression();

newObject.m_AndExpression = andExpression;

newObject.m_PreOperatorWhiteSpace = preOperatorWhiteSpace;

newObject.m_PostOperatorWhiteSpace = postOperatorWhiteSpace;

newObject.m_EqualityExpression = equalityExpression;

codeSegment.AcceptChanges();
}
else
{

codeSegment.RejectChanges();
}
}
else
{

codeSegment.RejectChanges();
}
}
else
{
codeSegment.RejectChanges();
}
}

return newObject;
}
#region <snip />
ILexicalElement[] ILexicalElement.ChildElements
{
get
{
return new ILexicalElement[] {

m_AndExpression,

m_PreOperatorWhiteSpace,

new Constant("AndOperator", AndOperator),

m_PostOperatorWhiteSpace,

m_EqualityExpression

};
}
}
#endregion

public override bool Evaluate()


{
return (m_AndExpression.Evaluate() &&
m_EqualityExpression.Evaluate());
}

public PPAndExpression AndExpression


{
get
{
return m_AndExpression;
}
}

public WhiteSpace PreOperatorWhiteSpace


{
get
{
return m_PreOperatorWhiteSpace;
}
}

public WhiteSpace PostOperatorWhiteSpace


{
get
{
return m_PostOperatorWhiteSpace;
}
}

public PPEqualityExpression EqualityExpression


{
get
{
return m_EqualityExpression;
}
}
}
#endregion

protected PPAndExpression()
{
}

internal static new PPAndExpression Create(VirtualString codeSegment,


PPDeclarationRequestEventHandler requestEventHandler)
{
codeSegment.BeginPossibilityCheck();

codeSegment.BeginPossibility();

codeSegment.EndPossibility(PPEqualityExpression.Create(codeSegment,
requestEventHandler));

codeSegment.BeginPossibility();

codeSegment.EndPossibility(ComplexAndExpression.Create(codeSegment,
requestEventHandler));

return codeSegment.EndPossibilityCheck() as PPAndExpression;


}

#region <snip />


ILexicalElement[] ILexicalElement.ChildElements
{
get
{
throw new InvalidOperationException("Can not get child
elements for abstract classes");
}
}
#endregion
}
}
The PPAndExpression object can be either a PPEqualityExpression or a combination of a
PPAndExpression object, an optional WhiteSpace object, the text "&&", another optional
WhiteSpace object and a PPEqualityExpression object. Failing to find either will mean
that the text we're looking at is not a pp-and-expression.
Note the Create method on both classes which looks to see if there's matching text
coming up then returns a new object if it does.

The Virtual String


Text in .NET is represented using the string class. I wanted to be able to go back and
forth on the string without changing it, so I initially created new instances of the string
for each call to a Create method. This made the analyzer horribly sluggish.
This is when I came up with the VirtualString, which contains one string instance and an
index that can be moved (SetCheckPoint, AcceptChanges, RejectChanges, etc.). It also
has a built in mechanism for deciding which element is better (longer) when selecting
between several possibilities (BeginPossibility, EndPossibilityCheck, etc.), as was the
case with the "All" and "All your" possibilities.

The Pre-processing Directives Problem


Using the design I formed, I came across the problem that this won't be analyzed
correctly:
#define kek

#if kek
#define keke
#endif

#if keke
#endif
This is correct and would render keke as an input section and not a skipped section. This
would not be recognized unless there was some table of symbols. But wait a second, what
if we were to create a symbol, but it was not a real symbol, but just an option to be
discarded?
So what I did was put a list of them in each input section and allow requests to be made
using delegates that were propagated to the classes below (see the PPAndExpression
example's Create method).

Walkers
There was a need for a mechanism for walking over the elements' object grid once it was
finished, to use the lexical analyzer to its fullest extent. This is where walkers come into
play.
The walkers' design is simple. Simply implement the base class WalkerBase and add
code that does whatever you want whenever a certain element is stumbled upon. There
are three basic implementations that come with nGineer - CodeWalker (persists the code
back to a file), XmlWalker (which serializes the object grid to Xml) and
HighlightedCodeWalker (on which webify is based). There's more about walkers in the
next article, "Implementing A Lexical Walker".

Using The Lexical Analyzer


Using the lexical analyzer is simple:
nGineer.LexicalGrammar.Input input = new
nGineer.LexicalGrammar.CSharpCodeLexer().Analyze(cSharpCode, defs);
Where cSharpCode is a string containing the code and defs is an array of pre-processor
conditional symbols that were defined for the file. It is useful when analyzing a file which
has #if DEBUG in it, etc.
More examples:
nGineer.LexicalGrammar.Input input = new
nGineer.LexicalGrammar.CSharpCodeLexer().Analyze(cSharpCode, "DEBUG");
nGineer.LexicalGrammar.Input input = new
nGineer.LexicalGrammar.CSharpCodeLexer().Analyze(cSharpCode, "DEBUG", "SOMETHING_ELSE");
nGineer.LexicalGrammar.Input input = new
nGineer.LexicalGrammar.CSharpCodeLexer().Analyze(myFile.ReadToEnd(), "DEBUG",
"SOMETHING_ELSE");
Please remember that pre-processing conditional symbols are case sensitive.

You might also like