Download as pdf
Download as pdf
You are on page 1of 27
Compiler Construction (CS354) Compiler Construction: Front End Parser: Context Free Grammar Lecture 12 Imran RAO imranrao@gmail.com Reference Slide: Compiler Phases Front-End: The front-end of a compiler deals with source code analysis. It's responsible for understanding and checking the source code and preparing it for further stages. The Key Processes are: © Lexical Analysis: Converts code into tokens. © Syntax Analysis: Constructs a syntaxtree based on grammatical rules © Semantic Analysis: Checks tor semantic errors and annatates the syntax tree. © Language Specificity: Ihe front-end fs typically language-specitic, as it must understand the syntax indisemantics of the source lancuage. Middle-End: Also known as the optimizer, the middle-end transforms the program into an intermediate representation (IR). Its goal is to improve the program's performance and efficiency without altering its sernantics. The Key Processes are: © Optimization: Performs various optimizctions on the IR, such as loop optimization, decd code elimination, and constant folding, © Language Neutrality: The middie-end Is generally language-neutral, working with the intermediate. representation rather than language-specific constructs, Back-End: The back-end of a compiler is responsible for generating the target code (often machine code) and performin optimizations specific to the target machine. The Key Processes are: © Code Generation: Converts! to target machine code, © Machine-Specific Optimization: Tailors the code to efficiently utilize the hardware, such as register allocation and instruction scheduling. © Target Specificity: The back-end is specific to the target machine, as it needs to understand the defails of the hardware architecture. CS354 Compiler Construction DrImran RAO 2 BLUE BRACKETS Reference Slide: DFA DFA is defined as a mathematical model that represents a finite state machine, capable of parsing a set of inputs and determining if they belong to a specific language. The formal definition of a DFA is a 5-tuple (Q, 2, 6, qo, F), where each element represents the following: @: Afinile set of states. This set includes all possible states that the DFA can be in. X (Sigma): A finite alphabet. It represents the set of all input symbols that the DFA can process. 6 (Delta) - Transition Function: A function that takes two arguments — a state from Q and an input symbol from 3, and returns a state from Q, It defines how the automaton transitions from one state to another based on the input symbal go: The initial state. This is the stale where the DFA starts its processing of the input sting. go is anelement of Q. F: Aset of final or accepting states. Fis a subset of Q. If the DFA ends up in one of these states after processing the entire input string, the strings considered to be accepted by the DFA In simpler terms, a DFA reads a string of symbolsfrom the alphabet, starting from the initial state qo. For each symbol, the DFA uses the transition function 6 to determine the next state. If, after processing the entire string, the DFA is in one of the final states in F, the string is accepted; otherwise, itis rejected. The deterministic aspect of a DFA means that for each state and input symbol, there is exactly one defined transition. CS354 Compiler Construction DrImran RAO 3 BLUE BRACKETS Reference Slide: Terminals & Non Terminals Here's how they fit into the broader framework of the terminologies we discussed: Alphabet: The set of all basic symbols from which strings are formed. This includes both terminals and non-terminals. as fundamental elements. Lexeme: In the context of programming languages, alexeme can be aterminal. For instance, a keyword or an operator in a programming language is alexeme and also a terminal. Token: Similar to lexemes, tokens are often the categorized representation of terminals. In parsing, tokens are the elements of the langyage’s vocabulary (lerminals} that are recognized by the lexical analyzer. Syntax: This is where non-terminals play a significant role. syntax rules (or grammar) define how terminals and non-terminals can be combined fo form, vaiid strings or sentences in a language. Non-terminals in a grammar represent abstract syntactic categories or structures, ke expressions gt sictements, that can be further expanded into sequences ol tetminals and ether non- jerminals Sking: informal Ionguage theory a stings any fite sequence of symbols from the alphabet, which may include both terminals and non-terminals. However, in the context of a specific language defined by a grammar, strings usually refer to sequences of terminals. Sentence: A sentence is a string that is derived from the start symbol (a non-teminall_ of a grammar and consists enfirely of terminals. If represents a vaid construction in the language according to its syntax rules. Semantics: Semantics apply meaning fo sentences. In grammar, this involves interpreting the sequences of terminals (and the structures represented by non-ferminals| according to the rules of the language. Language: in formal language theory, alanguage is a set of sentences. Each sentence is a sting of teminals derived aécording to the syntax rules from the non-terminals. CS354 Compiler Construction DrImran RAO 4 BLUE BRACKETS Languages and Grammars * Agrammar is just a way of describing a language. + There are actually an infinite number of grammars for a particular language. * 2grammars are equivalent if they describe the same language. o This becomes extremely important when parsing top-down. o Most programming language manuals contain a grammar in BNF or EBNF, which we may modify to fit our parsing method better. CS354 Compiler Construction Dr: Imran RAO BLUE BRACKETS Chomsky Hierarchy The Chomsky Hierarchy, developed by Noam Chomsky, is a classification of different types of formal grammars which generate different types of formal languages. This hierarchy is important in the fields of computer science, linguistics, and mathematics, particularly in the areas of language theory, automata theory, and compiler design. It categorizes grammars into four levels. Each levelin the hierarchy includes all levels below it. From Type 8 to Type 0, the hierarchy shows the increasing expressive power and decreasing computational simplicity. © Type 3most computational simple and least powerfulin expressiveness © Type 0 most powerfulin expressiveness and least simple in computational complexity CS354 Compiler Construction Dr. Imran RAO BLUE BRACKETS Chomsky Hierarchy Unristricted Grammar (Recognized by Turing Machine) Context Sesitive Grammar (Accepted by Linear Bound Automata) Context Free Grammar (Accepted by Push Down Automata) Regular Grammar (Accepted By Finite Automata) Dr.Imran RAO 7 BLUE BRACKETS Type 0 - Recursive Enumerable Grammars + Flexibility: type 0 grammars are the most general and powerful. They have no restrictions on the form of their production rules. + Production Rule Example: aAbBc + Explanation: © In this type of grammar, both sides of a production rule can contain any mix of terminals and non-terminals in any order. 0 The string on the left can be transformed into the string on the right by applying the production rule. o Recursive enumerable grammars can generate any language that a Turing machine can recognize, including all possible programming languages and even some that are not computationally decidable xyyZzA CS354 Compiler Construction Dr: Imran RAO BLUE BRACKETS Type 1 - Context-Sensitive Grammars - Resirictions: The productions must not decrease the length of the string, and each rule must have at least one non-terminal on the left side. + Production Rule Example: aA ::= aXa * Explanation: © Here, a single non-terminal A is replaced by a longer string that includes at least one non-terminal X. © In this grammar type, the length of the string on the right must be greater than or equal to the string on the left. © This ensures that the derivation dees not get shorter, hence "context- sensitive.” © This type of grammar is more restrictive than Type 0 and is used for some natural language constructs but is less common for programming languages due to its complexity. + There are some context-sensitive aspects to programming languages, like variable declaration before use or type checking CS354 Compiler Construction Dr: Imran RAO BLUE BRACKETS Type 2 - Context-Free Grammars * Single Non-terminal: The left side of the production rule must consist of a single non-terminal. + Production Rule Example: A ::= aBb * Explanation: o Ina context-free grammar, each production replaces one non-terminal with a sequence of terminals and/or non-terminals. © This is the type of grammar most commonly used for programming languages because it's powerful enough to express the necessary constructs while still being efficiently parsable. © The example rule says that A can be replaced by aBb, where B is a non- terminal and a, b are terminals. + Programming languages tend to be based on context-free grammars (Type 2) because they strike a good balance between expressive power and parsability. The structure of programming languages often includes constructs like loops, conditionals, and function calls, which can be nicely expressed using context-free grammars CS354 Compiler Construction Dr: Imran RAO 1 BLUE BRACKETS Type 3 - Regular Grammars * Limitations: The left side of a production must be a single non-terminal, and the right side must be either a single terminal or a terminal followed by a non-terminal. + Production Rule Example: A : * Explanation: o Regular grammars are the most restrictive and correspond to regular expressions. o They are used to describe regular languages which can be parsed by the simplest computational models, finite automata. o These grammars are used for simple string pattern matching, like tokenization in compilers, rather than full-fledged programming languages. o Inthe example, A can either be replaced by a single terminal a or by a terminal a followed by a non-terminal B. aorA::= aB CS354 Compiler Construction Dr:Imran RAO 11 BLUE BRACKETS Context Free Grammar (CFG) + Context-free grammars are well-suited to programming languages because they restrict the manner in which programming construct can be used and thus simplify the process of analyzing its use in a program. + They are called context-free because the manner in which we parse any nonterminal is independent of the other symbols surrounding it (i.e., parsing is done without respect to context) * The grammars of most programming languages are explicitly context-free (although a few have one or two context-sensitive elements). CS354 Compiler Construction Dr: Imran RAO 12 BLUE BRACKETS Context Free and Context Sensitive Grammar + Context Free Grammar if_statement -> IF ( expression ) statement ELSE statement + Context Sensitive Grammar ° CHt int myArray[10]; + C++98 std::vector > + Ctl: std::vector> CS354 Compiler Construction Dr Imran RAO 13 BLUE BRACKETS CSG and CFG + Context-Free Grammar (CFG): o Rules: In CFGs, the production rules are of the form A > a, where A is a single non-terminal and ais a string of terminals and/or non-terminals. o Example: A simple CFG rule might be $ > aSb | ¢, where $ is a non- terminal, a and b are terminals, and ¢ represents the emply string o These rules imply that the substitution of A with ais independent of the surrounding symbols (the context). + Context-Sensitive Grammar (CSG): © Rules: In CSGs, the production rules are of the form aAB > ayB. where A is a@non-terminal and a, B, and y are strings of terminals and/or non- terminals. o Example: A context-sensitive rule might be aAb — abb, meaning that A can be replaced with b only when it is preceded by a and followed by b. © The production rule implies that the substitution of A with y can depend on the context provided by a and 8. CS354 Compiler Construction Dr:Imran RAO 14 BLUE BRACKETS Context Free Grammar (CFG) + Acontext-free grammar is defined by the 4-tupl G=(TN,S, P) where © T= The set of ferminals (e.g. the tokens returned by the scanner) © N=The set of non-terminals {denoting structures within the language such as DeclarationSection, Function) © §=The start symbol (in most instances, our program) © P= The set of productions (rules governing how tokens are arranged into syntactic units}. The string like Int x = 5; is a valid program In this language, and It can be derived starting from the Program. T (Terminals): {int, ;, identifier, =, number} N (Non-terminals): (Program, Statement, Declaration} $ (Start Symbol): Program P (Productions): Program — Statement Statement — Declaration; Declaration — int identifier = number O° ° coco CS354 Compiler Construction Dr: Imran RAO 15 BLUE BRACKETS CFG Notations + Capital letters at the beginning of the alphabet will represent nonterminals. -ie.A, B,C,D - Lowercase letters at the end of the alphabet will represent terminals. -ie.tpuview * Lowercase Greek letters will represent arbitrary strings of terminals and nonterminals. -i.e.a, p,w Y CS354 Compiler Construction Dr: Imran RAO. 16 BLUE BRACKETS Backus-Naur Form (BN Form) BNF (Backus-Naur Form) is a metalanguage for describing a context-free grammar. ¢ The symbol ::=(or — __) is used for may derive. ¢ The symbol | separates alternative strings on the right-hand side. Example E::=E+TIT Ti=T*FIF F ::= id | constant | (E) where E is Expression, T is Term, and F is Factor J CS354 Compiler Construction Dr. Imran RAO 17 BLUE BRACKETS Extended Backus-Naur Form (EBN Form) EBNF (Extended Backus-Naur Form) adds a few additional metasymbols whose main advantage is replacing recursion with iteration. ¢ {a} means that a is occur zero or more times. ¢ [a] means that a appears once or not at all. Example = Our expression grammar can become: E:=T {+ T} F{*F} = id | constant | (E) CS354 Compiler Construction ‘Dr. Imran RAO 18 BLUE BRACKETS A Sample Grammar + Start Symbol o Su=ABc oAN=aAlb o Bu=Abla + Generates strings abbbc, aaabac, aaaababbc. + Similarly, o Sz=a| (bSS) o Sample strings generated by this + Generates strings (baa) (b(baa)aja * Can you determine how? CS354 Compiler Construction Dr: Imran RAO 19 BLUE BRACKETS The Empty String Productions within a grammar can contain e, the empty string. A > Bis equivalent to A > Be It is also possible to write the production A > e; such productions become particularly useful in top-down parsing. The grammar might have a rule like: statements — statement statements | ¢ Here, statements can either derive a statement followed by more statements, or it can derive ¢, indicating that having no statements at all is valid. This is crucial in parsing a block of code where having zero or more statements is allowed. CS354 Compiler Construction Dr: Imran RAO 2 0 BLUE BRACKETS Derivations + Aderivationina context-free grammar is the process of starting with the start symboland repeatedly replacing non-terminals with strings of symbols according to the grammar's production rules until a string of terminals (the desired language construct) is formed. + Example Derivation: © Sra o AzsAbl\e The string cbba is derived from $ as follows: © Start with s. © Replace $ with Aa (S “= Aa), resulting in Aa. © Replace the leftmost A in Aa with Ab (A= Ab), resultingin Aba. Again, replace the leftmost A in Aba with Ab (A == Ab), resuitingin Abba. Finally, replace the leftmost A in Abba with c (A.:= c), resuifing in cbba. 0° + If the start symbolS derives a string B which contains nonterminals, 8 is a sentential form. + IfS derives a string B which contains only terminals, B is a sentence. CS354 Compiler Construction Dr: Imran RAO 21 BLUE BRACKETS Derivations ‘One Step Derivation (=>) © Thenotation A => a means thal A derives a in one derivation step. © Forexample, ifA One or More Steps Derivation (=>+): co Thenotation A =>+ a means that A derives a in one or more derivation steps. © This implies that at least one production rule must be appliedto get from Ato a. Zero or More Steps Derivation (=>*): © The notation A =>* a indicates that A derives a in zero or more derivation steps. This includes the following three possibilities: + Direct derivation: lthere’sa production rule A::=a, A can derive cin one step. + Multiple derivation steps: A ean derive athrough a series of one or more production slept, transforming A info a through a sequence of intermedicte derivations. + Noderivation needed: Wie already a, then A kivially deriver a with zero derivation steps. Ths it more of atheoreticalinclusion because, informal grammars (typically @ non-terminal) and a (typically @ terminal or sting of teeminclz/non-terminais) are usually distinct. However, this inclusion fs ervcialfor completeness and or cerlain theoretical contideraiions. In most practical grammars, especially in programming languages, you wouldn't typically find a scenario where A =>" ais realized in zero steps, as non-lerminals and terminals are distinct by their nature. The concept is more relevant in theoretical computer science, especially when discussing the properties of formal languages and grammars, © The =>* is an umbrella covering all kinds of derivations, including those of =>+. Thats, in grammar terms, whenever you have a derivation A =>+ a, it automatically means that A =>* ais also true because the later encompasses all possibilities: no steps, one step, or multiple steps. For instance, in our example, § =>" cbba means S$ can derive cbba through a series of zero (i.e. directly) or more steps (ie. indirectly). ° CS354 Compiler Construction Dr: Imran RAO 22 BLUE BRACKETS Derivations To explain derivations in a simple way, let's consider these symbols in the context of a journey. => (One Step Derivation): © This is like taking a single, direct flight from City A to City B. There's only one leg to this journey, directly connecting A and B. =>+ (One or More Steps Derivation): © Here, you're guaranteed to travel from City A to City B, but this time, you might take one or more flights to get there. It could be a direct flight (one step) or a series of connecting flights (multiple steps). =>* (Zero or More Steps Derivation): ©. This is the most fiexiblejourney plan, You could already be in City B (zero flights needed), take a direct flight (one step), or take several connecting flights (multiple steps). CS354 Compiler Construction Dr Imran RAO 23 BLUE BRACKETS Reduced Grammars + Agrammar being "reduced" means that every production (rule) in the grammar is necessary and contributes to generating some string in the language. In other words, there are no useless productions. + The formal requirement for a grammar to be reduced is given by the derivation pattern: $=>*x AZ=>x az =>* xyz. * This means: © Start from S: There must be a derivation starting from the start symbol $ that produces a string containing A, which is x Az © Apply the Production: Then, the production A string, changing if to x az. © Complete the Derivation: Finally, there must exist further derivations that transform x a z into a string xyz that is in the language defined by the grammar. ais applied within this CS354 Compiler Construction Dr: Imran RAO 24 BLUE BRACKETS Reduced Grammars + Example of Reduced Grammar: o Consider a simple grammar G: © Productions: S ::= aA | b, A:=a o In this grammar, every production contributes to generating strings of the language. For example, the string aa can be derived as S => aA => aa. + Example of Non-Reduced Grammar: o Now consider a grammar with a useless production: o Productions: S$ ::= aA |b, A:=a,Bi=¢ o If the production B ::= c is never used in any derivation starting from S (i.e., there is no way to derive a string containing B from S), then this production is useless, and the grammar is not reduced. CS354 Compiler Construction Dr Imran RAO 2 BLUE BRACKETS Ambiguous Grammars + While there may be an infinite number of grammars that describe a given language, their parse trees may be very different. + Agrammar capable of producing two different parse trees for the same sentence is called ambiguous. * Ambiguous grammars are highly undesirable. CS354 Compiler Construction Dr: Imran RAO 26 BLUE BRACKETS Ambiguous Grammars: Example ‘The IF-THEN=ELSE ambiguity is a cl an ambiguous grammar. ical example of Stasement = Af Expression them Statement else Statement Mf Expression then Statement How would you parse the following string? IF x>0 THEN IF y > 0 THEN 2 c= x+y There are two possible parse trees: ELSE @ i= x; Statement if Expression then Statement 9S if Expression then Statement else Statement Statement a i

You might also like