Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 32

06/19/09 Copyright Ankita 1

Source program Lexical Analyzer


•Group sequence of characters into lexemes –
Lexical analyzer smallest meaningful entity in a language
(keywords, identifiers, constants)
Syntax analyzer
•Characters read from a file are buffered –
Semantic analyzer helps decrease latency due to i/o. Lexical
analyzer manages the buffer
Intermediate •Makes use of the theory of regular languages
code generator
and finite state machines

Code optimizer

Code generator

Target program
06/19/09 Copyright Ankita 2
Source program Parser
• Convert a linear structure – sequence of
Lexical analyzer tokens – to a hierarchical tree-like structure –
an AST
Syntax analyzer
• The parser imposes the syntax rules of the
language
Semantic analyzer
• Work should be linear in the size of the
Intermediate input (else unusable) → type consistency
code generator cannot be checked in this phase
•Deterministic context free languages and
Code optimizer
pushdown automata for the basis
Code generator

Target program
06/19/09 Copyright Ankita 3
Source program Semantic Analysis
• Calculates the program’s “meaning”
Lexical analyzer
• Rules of the language are checked (variable
declaration, type checking)
Syntax analyzer
• Type checking also needed for code
Semantic analyzer generation (code gen for a + b depends on the
type of a and b)
Intermediate
code generator

Code optimizer

Code generator

Target program
06/19/09 Copyright Ankita 4
Source program Intermediate Code Generation
• Makes it easy to port compiler to other
Lexical analyzer architectures
• Can also be the basis for interpreters
Syntax analyzer
• Enables optimizations that are not machine
Semantic analyzer specific

Intermediate
code generator

Code optimizer

Code generator

Target program
06/19/09 Copyright Ankita 5
Source program Intermediate Code Optimization
• Constant propagation, dead code
Lexical analyzer elimination, common sub-expression
elimination, strength reduction, etc.
Syntax analyzer
• Based on dataflow analysis – properties that
are independent of execution paths
Semantic analyzer

Intermediate
code generator

Code optimizer

Code generator

Target program
06/19/09 Copyright Ankita 6
Source program Native Code Generation
• Intermediate code is translated into native
Lexical analyzer code
• Register allocation, instruction selection
Syntax analyzer

Semantic analyzer

Intermediate Native Code Optimization


code generator
• Peephole optimizations – small window is
optimized at a time
Code optimizer

Code generator

Target program
06/19/09 Copyright Ankita 7
Informal sketch of lexical analysis
 Identifies tokens in input string

Issues in lexical analysis


 Lookahead
 Ambiguities

Specifying lexers
 Regular expression

06/19/09 Copyright Ankita 8


Simplifies the design of the compiler
 LL(1) or LR(1) parsing with 1 indicates that the next
input symbol is used to decide the next parsing
process.
Provides efficient implementation
 Systematic techniques to implement lexical analyzers
by hand or automatically from specifications
 Stream buffering methods to scan input

Improves portability
 Non-standard symbols and alternate character
encodings can be normalized.

06/19/09 Copyright Ankita 9


Token,
tokenval
Source
Lexical
Program Parser
Analyzer
Get next
token

error error

Symbol Table

06/19/09 Copyright Ankita 10


y := 31 + 28*x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id,
“x”>

token

tokenval
(token attribute) Parser

06/19/09 Copyright Ankita 11


06/19/09 Copyright Ankita 12
A token is a classification of lexical units
 For example: id and num
Lexemes are the specific character strings that
make up a token
 For example: abc and 123
Patterns are rules describing the set of lexemes
belonging to a token
 For example: “letter followed by letters and digits” and
“non-empty sequence of digits”

06/19/09 Copyright Ankita 13


A syntactic category
 In English:
noun, verb, adjective, …

 In a programming language:
Identifier, Integer, Single-Float, Double-Float, operator
(perhaps single or multiple character), Comment,
Keyword, Whitespace, string constant, …

06/19/09 Copyright Ankita 14


This is the job of the Lexer, the first pass over
the source code. The Lexer classifies program
substrings according to role.

We also tack on to each token the location (line


and column) of where it ends, so we can report
errors in the source code by location.

We read tokens until we get to an end-of-file.

06/19/09 Copyright Ankita 15


Output of lexical analysis is a stream of tokens . . .

This stream is the input to the next pass, the parser.

Parser makes use of token distinctions.

06/19/09 Copyright Ankita 16


Define a finite set of tokens
 Tokens describe all items of interest
 Choice of tokens depends on language, design of
parser
\
Useful tokens for this expression:
Integer, Keyword, operator, Identifier

06/19/09 Copyright Ankita 17


Describe which strings belong to each token
category

Recall:
 Identifier: strings of letters or digits, starting with a
letter
 Integer: a non-empty string of digits
 Keyword: “else” or “if” or “begin” or …
 Whitespace: a non-empty sequence of blanks, newlines,
and tabs

06/19/09 Copyright Ankita 18


Is it as easy as it sounds?

Not quite!

Look at some history . . .

06/19/09 Copyright Ankita 19


FORTRAN rule: Whitespace is insignificant

E.g., VAR1 is the same as VA R1

Footnote: FORTRAN whitespace rule


motivated by inaccuracy of punch card
operators

06/19/09 Copyright Ankita 20


The goal of lexical analysis is to
 Partition the input string into lexemes
 Identify the token type, and perhaps the value of
each lexeme .

Left-to-right scan => lookahead sometimes


required.

06/19/09 Copyright Ankita 21


There are several formalisms for specifying tokens.

Regular languages are the most popular


 Simple and useful theory
 Easy to understand
 Efficient implementations are possible
 Almost powerful enough (Can’t do nested comments)
 Popular “almost automatic” tools to write programs.

The standard notation for regular languages is regular


expressions.

06/19/09 Copyright Ankita 22


Single character denotes a set of one string

' c ' = { " c "}


Epsilon character denotes a set of one 0-length
string
ε = { ""}

Empty set is {} = ∅ not the same as ε .


 Size(∅)=0. Size(e) =1

06/19/09 Copyright Ankita 23


Union: If A and B are REs then…
A+ B = { s | s ∈ A or s ∈ B}
Concatenation of Sets  Concatenation of
strings
AB = { ab | a ∈ A and b ∈ B}
Iteration (Kleene closure)
A = Ui ≥0 A where A = A...i times ... A
* i i

06/19/09 Copyright Ankita 24


Translate regular expressions to NFA
Translate NFA to an efficient DFA

Optional

regular
NFA DFA
expressions

Simulate NFA Simulate DFA


to recognize to recognize
tokens tokens
06/19/09 Copyright Ankita 25
An NFA is a 5-tuple (S, Σ, δ, s0, F) where

S is a finite set of states


Σ is a finite set of symbols, the alphabet
δ is a mapping from S×Σ to a set of states
s0 ∈ S is the start state
F ⊆ S is the set of accepting (or final) states

06/19/09 Copyright Ankita 26


An NFA can be diagrammatically represented
by a labeled directed graph called a transition
graph

a S = {0,1,2,3}
Σ = {a,b}
start a b b s0 = 0
0 1 2 3
F = {3}
b

06/19/09 Copyright Ankita 27


The mapping δ of an NFA can be represented
in a transition table

Input Input
State
δ(0,a) = {0,1} a b
δ(0,b) = {0} 0 {0, 1} {0}
δ(1,b) = {2} 1 {2}
δ(2,b) = {3} 2 {3}

06/19/09 Copyright Ankita 28


A deterministic finite automaton is a special case
of an NFA
 No state has an ε-transition
 For each state s and input symbol a there is at most
one edge labeled a leaving s
Each entry in the transition table is a single
state
 At most one path exists to accept a string
 Simulation algorithm is simple

06/19/09 Copyright Ankita 29


δ(0,a) = {0,1}
δ(0,b) = {0} Input Input
State
δ({0,1},a) = {0,1} a b
δ({0,1},b) = {0,2} 0 {0, 1} {0}
δ({0,2},a)= {0,1} {0,1} {0,1} {0,2}
δ({0,2},b)= {0,3} {0,2} {0,1} {0,3}
δ({0,3},a)= {0,1} {0,3} {0,1} {0}
δ({0,3},b)= {0}
06/19/09 Copyright Ankita 30
A DFA that accepts (ab)*abb

b
b
a
start a b b
0 1 2 3

a a

06/19/09 Copyright Ankita 31


06/19/09 Copyright Ankita 32

You might also like