Download as pdf or txt
Download as pdf or txt
You are on page 1of 7


8. Lexical Analysis


John Roberts

• Lexical Analysis

• Assignment Two

• Project Code Overview

• Lexing

Lexical Analysis

• Read a stream of characters that make up the source

program, and create a stream of tokens by combining the
characters appropriately

• tokens are sometimes also referred to as lexical units,

or lexemes

• Example: the characters ’t’, ’h’, ‘e’, ’n’ will be combined

to build the then token

• Example: the characters ‘1’, ‘2’, ‘4’, ‘7’ will be combined

to form an integer token with a value of 1247
• Lexical Analysis

• Assignment Two

• Project Code Overview

• Lexing

• Lexical Analysis

• Assignment Two

• Project Code Overview

• Lexing

The Lexer

• We will be working with the lexer package

• Recall that the responsibility is to generate Tokens

7 Generating tokens
Token Categories

Category Tokens

Reserved Words program int boolean if then else while return

Identifiers <the same as Java identifiers>

Integers <a sequence of digits>

Operators = == != < <= + - * / | &

Separators {}(),

Comments // until end of line

Whitespace <spaces> <newlines> and other Java whitespace


We’ll see how we use this shortly

1 Program program
2 Int int 8
3 BOOLean boolean
tokens file 4
If if
Then then
6 Else else
7 While while
• The tokens are defined in a tokens 8 Function function
9 Return return
file 10 Identifier <id>
11 INTeger <int>
12 LeftBrace {
• Each line in the file will have two 13 RightBrace }
strings: 14
LeftParen (
RightParen )
16 Comma ,
17 Assign =
• The Symbolic constant we will use 18 Equal ==
in the compiler for the token 19
NotEqual !=
Less <
21 LessEqual <=
22 Plus +
• The actual token 23 Minus -
24 Or |
25 And &
26 Multiply *
27 Divide /
28 Comment //

Token Setup

• will read tokens, and automatically

generate the files and

• The Tokens enum is actually a class - you can add

methods, instance fields, and a constructor that can only
be used to construct the enumerated values

• Values are accessed as Tokens.If, etc.


• Examine code to ensure we understand how it works

• Execute TokenSetup and inspect and


• Examine code to ensure we understand how it works

• Note that we will be updating this file to generate better

output (which we’ll see in a minute when we run Lexer)


• Each Token contains four pieces of information

• String of Token found in source

• TokenType

• Starting column from source file

• Ending column

• The first two items are grouped as a Symbol

13 Note we’ve seen this hash pattern before…

• String from the source, and TokenType

• All Strings (corresponding to tokens) found in the source

program will be placed into the hash table in the Symbol
class (the Symbol table)

• Before we begin, we place all Tokens in the Symbol

hash table

• Each String will (should) be inserted exactly once

1 program { int j int k 14

2 j = j + k example 3 }

Token( Symbol( "program", Tokens.Program ), 1, 7 )

Token( Symbol( "{", Tokens.LeftBrace ), 9, 9 )
Token( Symbol( "int", Tokens.Int ), 11, 13 )
Token( Symbol( "j", Tokens.Identifier ), 15, 15 )
Token( Symbol( "int", Tokens.Int ), 17, 19 )
Token( Symbol( "k", Tokens.Identifier ), 21, 21 )
Token( Symbol( "j", Tokens.Identifier ), 2, 2 )
Token( Symbol( "=", Tokens.Assign ), 4, 4 )
Token( Symbol( "j", Tokens.Identifier ), 6, 6 )
Token( Symbol( "+", Tokens.Plus ), 8, 8 )
Token( Symbol( "k", Tokens.Identifier ), 10, 10 )
Token( Symbol( "}", Tokens.RightBrace ), 1, 1 )

• Symbol( String s, Tokens kind ) - insert

s into the hash table with value given by kind; if the
entry is already in the table, then just return the entry

15 example

• Note that we repeated a Symbol three times

Symbol( “j”, Tokens.Identifier )

• For efficiency, we only want to create one instance of

each Symbol, so we use the hash table to check if the
Symbol has already been created. If so, re-use, if not,
create a new instance.

• Logic encapsulated in Symbol class

• Lexical Analysis

• Assignment

• Project Code Overview

• Lexing

Performing Lexical Analysis

• Prior to processing the user’s program, we’ll create

Symbol instances for all reserved words, operators, etc.
so we can find them later (see

• Once the lexer starts processing the user’s program, the

only new symbols that will be created (added to the hash
map) will be identifiers and numbers - all other symbols
would have been created before


• Insert all token in HashMap<String, Symbol>

• tokens HashMap in TokenType holds all of the known Token/

Symbol pairs, e.g.





• Each of these are stored in the symbol table as they are

generated (see implementation for Symbol.symbol)

• At this point, Symbol.symbols.get( “program” ) yields

Symbol( “program”, Tokens.Program )

• Scan the program line by line (character by character), and

insert symbols not already in the the symbols table (identifiers
and ints)

• If we look up an identifier in the symbols:

• Reserved word (e.g. program) - found and

Symbol( “program”, Tokens.Program ) returned

• User id not already in symbols - we don’t find it, so we put

a new entry return the new Symbol

• User id already in symbols - return the entry


• If we look up other tokens in symbols

• Numbers - put new entry, if not already there

• Not found - don’t do anything

• e.g. = vs. == vs. !=, / vs. // - these are either one or

two character tokens

• e.g x =abc + y - we can key on the character =,

and save the a for the start of the next token (the abc

You might also like