Professional Documents
Culture Documents
Compiler Design (CD) : Lab Assignment 1
Compiler Design (CD) : Lab Assignment 1
Compiler Design (CD) : Lab Assignment 1
Roll no. 62
Batch 3
TY-CSA
PRN: 12111387
Theory:
Lexical Analysis
Lexical analysis is the first phase of compiling a program. It involves breaking the
input text into a sequence of tokens, which are meaningful units such as keywords,
identifiers, literals, and punctuation symbols. The lexical analyser reads the input
character stream and groups characters into tokens based on predefined patterns.
It takes the input as a stream of characters and gives the output as tokens also
known as tokenization. The tokens can be classified into identifiers, Keywords,
Operators, Constant and Special Characters.
• Input preprocessing: This stage involves cleaning up the input text and
preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.
• Tokenization: This is the process of breaking the input text into a sequence
of tokens. This is usually done by matching the characters in the input text
against a set of patterns or regular expressions that define the different types
of tokens.
• Token classification: In this stage, the lexer determines the type of each
token. For example, in a programming language, the lexer might classify
keywords, identifiers, operators, and punctuation symbols as separate token
types.
• Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might
check that a variable name is a valid identifier, or that an operator has the
correct syntax.
• Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens
can then be passed to the next stage of compilation or interpretation.
Working of Lex:
Declarations
%%
Translation rules
%%
Auxiliary procedures
A. Declarations:
The declarations include declarations of variables. They are made up of
r.e.1 {action1}
r.e.2 {action2}
....
r.e.n {actionn}
The actionsi are C code to be carried out when the regular expression matches the
input. Think event-driven programming. For example:
[Ii][Ff]
{return(IF);}
{id} {yylval = storeId(yytext,yyleng);
return(ID);}
{snum} {yylval = storeNum(yytext,yyleng,atoi(yytext),INTEGER);
return(CON);}
Processing of tokens will continue through the list of actions until a return() is
performed. The longest token is matched when there may be any ambiguity.
C. Auxiliary Functions
Auxiliary functions can be added to this lex.l file. These functions will contain more C
code that can be called by the actions of the translation rules, thus making the code
in the translation rules simpler. These functions can also be compiled separately-
they need not be in the lex file since linking with other files is likely.
Code:
%{
#include<stdio.h>
int cc = 0, wc = 0, sc = 0, lc = 0;
%}
%%
[ ] {sc++;}
[\t] {sc+=4;}
[\n] {lc++;}
%%
int main(){
while(!feof(yyin)){
yylex();
return 0;
int yywrap(){
return 1;
Input File:
Output:
Conclusion:
Lex and Flex are powerful tools for generating lexical analysers, making them essential in the
development of compilers, interpreters, and other language processing tools. By defining
lexical rules using regular expressions, Lex/Flex allows developers to efficiently tokenize
input text, enabling various text processing tasks such as counting characters, words, and
lines as demonstrated in this assignment. It showcases the ease and flexibility of Lex/Flex in
handling such tasks efficiently.
Gauri Choudhari
Roll no. 62
Batch 3
TY-CSA
PRN: 12111387
Theory:
Tokenization in compiler design is the process of breaking down a sequence of
characters (source code) into meaningful units called tokens. Tokens are the
smallest units of a program that have meaning, such as keywords, identifiers, literals,
and operators.
Purpose of Tokenization:
Tokenization serves as the first step in the compilation process of a programming
language. It involves dividing the input source code into tokens, which are the
smallest meaningful units such as keywords, identifiers, literals, operators, and
punctuation symbols. These tokens serve as building blocks for subsequent phases
of compilation, such as parsing, semantic analysis, and code generation.
Tokenization enables the compiler or interpreter to understand the structure and
semantics of the program and perform various analyses and optimizations.
Tokenization of C Programs:
Tokenizing C programs using Lex/Flex involves defining rules to recognize various
tokens in the C language. These tokens include keywords (e.g., int, if, while),
identifiers (e.g., variable names, function names), literals (e.g., integers, strings),
operators (e.g., arithmetic, logical, relational), and punctuation symbols (e.g., braces,
semicolons).
Code:
%{
#include<stdio.h>
#include<string.h>
struct tab
char lexeme[50];
char type[50];
};
%}
%%
"int"|"char"|"string"|"float"|"double"|"long"|"long long"|"struct"|"union" {
strcpy(keyword[counter_keyword].lexeme, yytext);
strcpy(keyword[counter_keyword].type, "datatype");
counter_keyword++; }
["int"|"char"|"string"|"float"|"double"|"long"|"long long"|"struct"|"union"][
]*[*] { strcpy(keyword[counter_keyword].lexeme, yytext);
strcpy(keyword[counter_keyword].type, "pointer");
counter_keyword++; }
"return"|"void"|"null"|"if"|"else"|"if
else"|"break"|"continue"|"switch"|"while"|"for" {
strcpy(keyword[counter_keyword].lexeme, yytext);
strcpy(keyword[counter_keyword].type, "keyword");
counter_keyword++; }
strcpy(operators[counter_opr].type, "Comparison
Operator");
counter_opr++; }
counter_opr++; }
strcpy(operators[counter_opr].type, "Mathematical
Operator");
counter_opr++; }
counter_ss++; }
strcpy(specialsymbol[counter_ss].type, "Special
Symbol");
counter_ss++; }
strcpy(specialsymbol[counter_ss].type,
"Paranthesis");
counter_ss++; }
["].*["] { strcpy(constants[counter_cons].lexeme, yytext);
strcpy(constants[counter_cons].type, "String");
counter_cons++; }
strcpy(identifier[counter_id].type,
"identifier");
counter_id++; }
counter_cons++; }
strcpy(constants[counter_cons].type, "Integer");
counter_cons++; }
[ \n]* {};
%%
int main()
while(!feof(yyin))
yylex();
int itr = 0;
fflush(stdout);
itr = 0;
fflush(stdout);
itr = 0;
itr = 0;
fflush(stdout);
itr = 0;
return 0;
int yywrap()
return 1;
}
Input File:
Output:
Conclusion:
The tokenization of C programs using Lex/Flex demonstrates the lexical analysis
phase in compiler design. By breaking down the code into tokens, Lex/Flex facilitates
various compilation tasks such as parsing, semantic analysis, and code generation.
This assignment helps us understand and get practical experience in using Lex/Flex
for tokenizing C programs, enhancing understanding of compiler construction
concepts and techniques.
Gauri Choudhari
Roll no. 62
Batch 3
TY-CSA
PRN: 12111387
Title: Write Lex Program to convert all uppercase to lowercase letter and
summation of digits if a number is found in the file.
Theory:
Text processing and numerical operations are fundamental aspects of computer
programming, essential for various applications such as data parsing, analysis, and
transformation. In this theoretical exploration, we delve into the design and
implementation of a Lex program to convert uppercase letters to lowercase and
calculate the summation of digits in a given text file.
Lexical Analysis:
The first step is to define lexical rules using regular expressions to identify uppercase
letters and numeric digits within the input text. Regular expressions provide a
concise and powerful mechanism for pattern matching, allowing us to specify
complex patterns with ease.
Tokenization:
Once the lexical rules are defined, Lex breaks down the input text into tokens based
on the specified patterns. Tokens represent meaningful units of text, such as words,
numbers, or punctuation symbols. For this problem, tokens correspond to uppercase
letters and numeric digits.
Action Execution:
Upon identifying tokens corresponding to uppercase letters, the Lex program
executes actions to convert them to lowercase. Similarly, when numeric digits are
encountered, the program computes the summation of digits and accumulates the
result.
Regular Expressions:
Regular expressions expressions consist of symbols and operators that represent
sequences of characters or character classes. By constructing appropriate regular
expressions, we can identify uppercase letters and numeric digits within the input
text efficiently.
Lexical Tokens:
In Lex programming, each token corresponds to a specific pattern defined by regular
expressions and triggers corresponding actions when matched. By tokenizing the
input text, Lex breaks down complex input sequences into manageable units,
facilitating further processing and analysis.
Code:
%{
#include<stdio.h>
#include<string.h>
%}
%%
[0-9]+ {
num = atoi(yytext);
sum = 0;
while(num!=0)
sum+=num%10;
num/=10;
%%
int main()
while(!feof(yyin))
yylex();
return 0;
int yywrap()
return 1;
Input:
Output: Modified File
Conclusion:
This Lex program effectively achieves its goals of converting uppercase letters to
lowercase and computing the summation of digits in a given text file. By utilizing
simple rules and actions defined in the Lex syntax, the program seamlessly
processes the input text, providing a lowercase version while also calculating the
sum of digits whenever numbers are encountered. This program demonstrates the
versatility and efficiency of Lex in performing text processing tasks, making it a
valuable tool for various applications requiring lexical analysis and manipulation of
textual data.
Gauri Choudhari
Roll no. 62
Batch 3
TY-CSA
PRN: 12111387
Theory:
Parsing:
The process of transforming the data from one format to another is called Parsing.
This process can be accomplished by the parser. The parser is a component of the
translator that helps to organise linear text structure following the set of defined rules
which is known as grammar.
YACC
YACC (yet another compiler-compiler) is an LALR (LookAhead, Left-to-right,
Rightmost derivation producer with 1 lookahead token) parser generator. YACC was
originally designed for being complemented by Lex.
Working of YACC
• Yacc reads a grammar file that specifies the language's syntax rules.
• It generates a parser based on this grammar.
• The generated parser reads tokens produced by Lex.
• It constructs a parse tree or performs actions based on the grammar rules.
Input File: YACC input file is divided into three parts.
/* definitions */
....
%%
/* rules */
....
%%
/* auxiliary routines */
....
Definition Part:
The definition part includes information about the tokens used in the syntax
definition:
%token NUMBER
%token ID
•
Yacc automatically assigns numbers for tokens, but it can be overridden
by
%token NUMBER 621
•
Yacc also recognizes single characters as tokens. Therefore, assigned
token numbers should no overlap ASCII codes.
• The definition part can include C code external to the definition of the
parser and variable declarations, within %{ and %} in the first column.
• It can also include the specification of the starting symbol in the
grammar:
%start nonterminal
Rule Part:
• The rules part contains grammar definitions in a modified BNF form.
• Actions is C code in { } and can be embedded inside (Translation
schemes).
%{
#include "Feb22_yacc.tab.h"
%}
%%
He|She|They|I|You|he|she|they|you|we|We|It|it { return PRON; }
is|was|are|writes|write|reads|read|watch|watches|study|learn|studies|eat|li
kes { return VERB; }
girl|boy|teacher|books|TV|email|college|engineering|food { return NOUN; }
and|but|or|because|so|since|though { return CONJ; }
%%
int yywrap() {
return 1;
}
B. Yacc File
%{
#include <stdio.h>
FILE *yyin;
int yylex();
void yyerror();
%}
object: NOUN
;
%%
int main()
{
yyin = fopen("feb22_input.txt", "r");
while(!feof(yyin))
{
yyparse();
}
fclose(yyin);
}
void yyerror(char *s)
{
printf("%s", s);
}
Input:
Output:
Conclusion:
This assignment successfully distinguishes between simple and compound
statements with language processing and parsing. Through the integration of lexical
analysis (Lex) and syntax analysis (Yacc), we've created a robust tool capable of
accurately identifying the structural complexity of input sentences. By defining lexical
rules in Lex to tokenize input text and specifying grammar rules in Yacc to parse
syntactic structures, our program effectively distinguishes between simple
statements, consisting of a single clause or expression, and compound statements,
comprised of multiple clauses or expressions connected by conjunctions or
punctuation.
Gauri Choudhari
Roll no. 62
Batch 3
TY-CSA
PRN: 12111387
Theory:
Code Optimization
The code optimization in the synthesis phase is a program transformation
technique, which tries to improve the intermediate code by making it consume
fewer resources (i.e. CPU, Memory) so that faster-running machine code will result.
Compiler optimizing process should meet the following objectives :
• The optimization must be correct, it must not, in any way, change the
meaning of the program.
• Optimization should increase the speed and performance of the program.
• The compilation time must be kept reasonable.
• The optimization process should not delay the overall compiling process.
Optimization of the code is often performed at the end of the development stage
since it reduces readability and adds code that is used to increase the
performance.
Need of Optimization:
Code:
A. Lex Code
%{
#include"Mar28_Yacc.tab.h"
extern char yyval;
%}
%%
[0-9]+ {yylval.symbol = (char)yytext[0];return NUMBER;}
[a-zA-Z]+ {yylval.symbol =(char) yytext[0];return ID;}
[-+*/=();] { return yytext[0]; }
[ \t\n] { }
. { }
%%
B. Yacc Code
%{
#include<stdio.h>
#include<stdlib.h>
int yylex();
void yyerror();
char temp ='A'-1;
int index1=0;
char addtotable(char, char, char);
struct expr{
char operand1;
char operand2;
char operator;
char result;
};
%}
%union
{
char symbol;
}
%left '+''-'
%left '*''/'
%token <symbol> NUMBER ID
%type <symbol> exp
%%
st: ID '=' exp ';' {addtotable((char)$1,(char)$3,'='); YYACCEPT;};
exp: exp '+' exp {$$ = addtotable((char)$1,(char)$3,'+');}
|exp '-' exp {$$ = addtotable((char)$1,(char)$3,'-');}
|exp '/' exp {$$ = addtotable((char)$1,(char)$3,'/');}
|exp '*' exp {$$ = addtotable((char)$1,(char)$3,'*');}
|'(' exp ')' {$$ = (char)$2;}
|NUMBER {$$ = (char)$1;}
|ID {$$=(char)$1;};
%%
struct expr table[20];
void printTable()
{
int i;
for(i=0;i<index1;i++)
{
if(table[i].operator=='!') continue;
printf("%c:= %c %c %c \n",table[i].result, table[i].operand1,
table[i].operand2, table[i].operator);
}
}
void optim()
{
int i,j;
for(i=0;i<index1;i++)
for(int j=i+1;j<index1;j++)
{
if(table[i].operator==table[j].operator && table[i].operand1
==table[j].operand1
&& table[i].operand2 == table[j].operand2){
int z;
for(int z=j+1;z<index1;z++){
if(table[z].operand1==table[j].result)
table[z].operand1=table[i].result;
if(table[z].operand2==table[j].result)
table[z].operand2=table[i].result;
}
table[j].operator='!';
}}}
int main()
{
temp='A'-1;
printf("Enter the expression\n");
yyparse();
printTable();
optim();
printf("After Optimization\n");
printTable();
}
int yywrap()
{
return 1;
}
Output:
Conclusion:
The code optimizer for a subset of the C programming language has been developed
in this assignment. By implementing these optimizations within the constraints of the
C subset, we have demonstrated the ability to improve code performance without
sacrificing correctness or portability. The optimizer analyses code structures, identifies
optimization opportunities, and applies transformation rules to generate optimized
code that exhibits improved runtime behaviour and resource utilization. Optimized
code not only enhances application performance but also reduces energy
consumption and improves scalability, making it crucial for a wide range of computing
platforms, including embedded systems, mobile devices, and high-performance
computing clusters.
Gauri Choudhari
Roll no. 62
Batch 3
TY-CSA
PRN: 12111387
Theory:
Code Optimization
Code generation is the final phase phase in the compilation process, where the
compiler translates the intermediate representation (IR) of the source code into
target machine code or another intermediate representation suitable for execution on
the target platform. This phase bridges the gap between the platform-independent
representation of the source code and the platform-specific machine instructions.
Code:
A. Lex Code
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "Apr04_yacc.tab.h"
%}
%%
"int"|"char"|"string"|"float"|"if"|"else"|"else if" { yylval.expr =
strdup(yytext); return KEYWORD; }
"printf""(".*");" { yylval.expr = strdup(yytext); return INBUILT; }
[0-9]+ { yylval.num = atoi(yytext); return NUMBER; }
[a-zA-Z]+ { yylval.var = strdup(yytext); return VARIABLE; }
"=="|"!="|"<"|">"|"<="|">=" { yylval.var = strdup(yytext); return
COMPARISON; }
[-+*/=();{}%] { return yytext[0]; }
[ \t\n] { /* ignore whitespace */ }
. { }
%%
B. Yacc Code
%{
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
extern FILE *yyin;
char *buffer;
int i = 0;
int yylex();
void yyerror();
%}
%union
{
int num;
char* var;
char* expr;
}
%left '+''-'
%left '*''/'
%token <num> NUMBER
%token <var> VARIABLE COMPARISON BOOLOPR
%token <expr> KEYWORD INBUILT
%type <expr> exp st finalst condition condblock
%%
finalst: finalst st {
printf("%s", $2);
}
| {}
;
st: VARIABLE '=' exp ';' {
buffer = (char *)malloc(strlen($1) + strlen($3)+7); // Allocate memory
for the result string
sprintf(buffer, "%s = %s;\n", $1, $3);
//printf("%s = %s;\n", $1, $3);
$$ = buffer;
}
| condblock {
buffer = (char *)malloc(strlen($1)); // Allocate memory for the result
string
sprintf(buffer, "%s\n", $1);
//printf("%s\n", $1);
$$ = buffer;
}
| KEYWORD {
buffer = (char *)malloc(strlen($1)); // Allocate memory for the result
string
sprintf(buffer, "%s ", $1);
$$ = buffer;
}
| INBUILT {
//printf("%s\n", $1);
buffer = (char *)malloc(strlen($1)); // Allocate memory for the result
string
sprintf(buffer, "%s\n", $1);
//printf("%s\n", $1);
$$ = buffer;
}
| '{' {
buffer = (char *)malloc(3); // Allocate memory for the result string
sprintf(buffer, "{ \n");
$$ = buffer;
}
| '}' {
buffer = (char *)malloc(3); // Allocate memory for the result string
sprintf(buffer, "} \n");
$$ = buffer;
}
;
%%
int main()
{
printf("#include<stdio.h>\n\n");
printf("int main(){\n");
yyin = fopen("apr04_input.txt", "r");
while(!feof(yyin))
{
yyparse();
}
fclose(yyin);
printf("return 0;\n}");
return 0;
}
Input:
Output:
Conclusion:
This assignment successfully demonstrates code generation for C language.
Throughout the assignment, we've explored various aspects of code generation,
including: tokenization, parsing, code generation, etc. By implementing these stages,
the code generator can produce executable code that faithfully reflects the original
program's behavior within the constraints of the C/C++ subset. This assignment
highlights the essential role of code generation in compiler design and software
development.