Final Presentation (Minor)

06/19/09 Copyright Ankita 1
Source program Lexical Analyzer

•Group sequence of characters into lexemes –
Lexical analyzer smallest meaningful entity in a language
(keywords, identifiers, constants)
Syntax analyzer
•Characters read from a file are buffered –
Semantic analyzer helps decrease latency due to i/o. Lexical
analyzer manages the buffer
Intermediate •Makes use of the theory of regular languages
code generator
and finite state machines
Code optimizer
Code generator
Target program
Source program Parser
• Convert a linear structure – sequence of
Lexical analyzer tokens – to a hierarchical tree-like structure –
an AST
Syntax analyzer
• The parser imposes the syntax rules of the
language
Semantic analyzer
• Work should be linear in the size of the
Intermediate input (else unusable) → type consistency
code generator cannot be checked in this phase
•Deterministic context free languages and
Code optimizer
pushdown automata for the basis
Code generator
Target program
Source program Semantic Analysis
• Calculates the program’s “meaning”
Lexical analyzer
• Rules of the language are checked (variable
declaration, type checking)
Syntax analyzer
• Type checking also needed for code
Semantic analyzer generation (code gen for a + b depends on the
type of a and b)
Intermediate
code generator
Code optimizer
Code generator
Target program
Source program Intermediate Code Generation
• Makes it easy to port compiler to other
Lexical analyzer architectures
• Can also be the basis for interpreters
Syntax analyzer
• Enables optimizations that are not machine
Semantic analyzer specific
Intermediate
code generator
Code optimizer
Code generator
Target program
Source program Intermediate Code Optimization
• Constant propagation, dead code
Lexical analyzer elimination, common sub-expression
elimination, strength reduction, etc.
Syntax analyzer
• Based on dataflow analysis – properties that
are independent of execution paths
Semantic analyzer
Intermediate
code generator
Code optimizer
Code generator
Target program
Source program Native Code Generation
• Intermediate code is translated into native
Lexical analyzer code
• Register allocation, instruction selection
Syntax analyzer
Semantic analyzer
Intermediate Native Code Optimization

code generator
• Peephole optimizations – small window is
optimized at a time
Code optimizer
Code generator
Target program
Informal sketch of lexical analysis
 Identifies tokens in input string
Issues in lexical analysis

 Lookahead
 Ambiguities
Specifying lexers
 Regular expression

Simplifies the design of the compiler
 LL(1) or LR(1) parsing with 1 indicates that the next
input symbol is used to decide the next parsing
process.
Provides efficient implementation
 Systematic techniques to implement lexical analyzers
by hand or automatically from specifications
 Stream buffering methods to scan input
Improves portability
 Non-standard symbols and alternate character
encodings can be normalized.

Token,
tokenval
Source
Lexical
Program Parser
Analyzer
Get next
token
error error
Symbol Table

y := 31 + 28*x Lexical analyzer
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id,
“x”>
token
tokenval
(token attribute) Parser

A token is a classification of lexical units
 For example: id and num
Lexemes are the specific character strings that
make up a token
 For example: abc and 123
Patterns are rules describing the set of lexemes
belonging to a token
 For example: “letter followed by letters and digits” and
“non-empty sequence of digits”

A syntactic category
 In English:
noun, verb, adjective, …
 In a programming language:
Identifier, Integer, Single-Float, Double-Float, operator
(perhaps single or multiple character), Comment,
Keyword, Whitespace, string constant, …

This is the job of the Lexer, the first pass over
the source code. The Lexer classifies program
substrings according to role.
We also tack on to each token the location (line

and column) of where it ends, so we can report
errors in the source code by location.
We read tokens until we get to an end-of-file.

Output of lexical analysis is a stream of tokens . . .
This stream is the input to the next pass, the parser.
Parser makes use of token distinctions.

Define a finite set of tokens
 Tokens describe all items of interest
 Choice of tokens depends on language, design of
parser
\
Useful tokens for this expression:
Integer, Keyword, operator, Identifier

Describe which strings belong to each token
category
Recall:
 Identifier: strings of letters or digits, starting with a
letter
 Integer: a non-empty string of digits
 Keyword: “else” or “if” or “begin” or …
 Whitespace: a non-empty sequence of blanks, newlines,
and tabs

Is it as easy as it sounds?
Not quite!
Look at some history . . .

FORTRAN rule: Whitespace is insignificant
E.g., VAR1 is the same as VA R1
Footnote: FORTRAN whitespace rule

motivated by inaccuracy of punch card
operators

The goal of lexical analysis is to
 Partition the input string into lexemes
 Identify the token type, and perhaps the value of
each lexeme .
Left-to-right scan => lookahead sometimes

required.

There are several formalisms for specifying tokens.
Regular languages are the most popular

 Simple and useful theory
 Easy to understand
 Efficient implementations are possible
 Almost powerful enough (Can’t do nested comments)
 Popular “almost automatic” tools to write programs.
The standard notation for regular languages is regular

expressions.

Single character denotes a set of one string
' c ' = { " c "}

Epsilon character denotes a set of one 0-length
string
ε = { ""}
Empty set is {} = ∅ not the same as ε .

 Size(∅)=0. Size(e) =1

Union: If A and B are REs then…
A+ B = { s | s ∈ A or s ∈ B}
Concatenation of Sets  Concatenation of
strings
AB = { ab | a ∈ A and b ∈ B}
Iteration (Kleene closure)
A = Ui ≥0 A where A = A...i times ... A
* i i

Translate regular expressions to NFA
Translate NFA to an efficient DFA
Optional
regular
NFA DFA
expressions
Simulate NFA Simulate DFA

to recognize to recognize
tokens tokens
An NFA is a 5-tuple (S, Σ, δ, s0, F) where
S is a finite set of states

Σ is a finite set of symbols, the alphabet
δ is a mapping from S×Σ to a set of states
s0 ∈ S is the start state
F ⊆ S is the set of accepting (or final) states

An NFA can be diagrammatically represented
by a labeled directed graph called a transition
graph
a S = {0,1,2,3}
Σ = {a,b}
start a b b s0 = 0
0 1 2 3
F = {3}
b

The mapping δ of an NFA can be represented
in a transition table
Input Input
State
δ(0,a) = {0,1} a b
δ(0,b) = {0} 0 {0, 1} {0}
δ(1,b) = {2} 1 {2}
δ(2,b) = {3} 2 {3}

A deterministic finite automaton is a special case
of an NFA
 No state has an ε-transition
 For each state s and input symbol a there is at most
one edge labeled a leaving s
Each entry in the transition table is a single
state
 At most one path exists to accept a string
 Simulation algorithm is simple

δ(0,a) = {0,1}
δ(0,b) = {0} Input Input
State
δ({0,1},a) = {0,1} a b
δ({0,1},b) = {0,2} 0 {0, 1} {0}
δ({0,2},a)= {0,1} {0,1} {0,1} {0,2}
δ({0,2},b)= {0,3} {0,2} {0,1} {0,3}
δ({0,3},a)= {0,1} {0,3} {0,1} {0}
δ({0,3},b)= {0}
A DFA that accepts (ab)*abb
b
b
a
start a b b
0 1 2 3
a a


Final Presentation (Minor)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Presentation (Minor)

Uploaded by

Copyright:

Available Formats

06/19/09 Copyright Ankita 1

Source program Lexical Analyzer

Intermediate Native Code Optimization

Issues in lexical analysis

06/19/09 Copyright Ankita 8

06/19/09 Copyright Ankita 9

06/19/09 Copyright Ankita 10

06/19/09 Copyright Ankita 11

06/19/09 Copyright Ankita 13

06/19/09 Copyright Ankita 14

We also tack on to each token the location (line

We read tokens until we get to an end-of-file.

06/19/09 Copyright Ankita 15

This stream is the input to the next pass, the parser.

Parser makes use of token distinctions.

06/19/09 Copyright Ankita 16

06/19/09 Copyright Ankita 17

06/19/09 Copyright Ankita 18

Look at some history . . .

06/19/09 Copyright Ankita 19

E.g., VAR1 is the same as VA R1

Footnote: FORTRAN whitespace rule

06/19/09 Copyright Ankita 20

Left-to-right scan => lookahead sometimes

06/19/09 Copyright Ankita 21

Regular languages are the most popular

The standard notation for regular languages is regular

06/19/09 Copyright Ankita 22

' c ' = { " c "}

Empty set is {} = ∅ not the same as ε .

06/19/09 Copyright Ankita 23

06/19/09 Copyright Ankita 24

Simulate NFA Simulate DFA

S is a finite set of states

06/19/09 Copyright Ankita 26

06/19/09 Copyright Ankita 27

06/19/09 Copyright Ankita 28

06/19/09 Copyright Ankita 29

06/19/09 Copyright Ankita 31

You might also like