Syntax Analysis
Or
Parsing
Parsing
• A.K.A. Syntax Analysis
– Recognize sentences in a language.
– Discover the structure of a document/program.
– Construct (implicitly or explicitly) a tree (called as a
parse tree) to represent the structure.
– The above tree is used later to guide translation.
Parsing During Compilation
intermediate
representation
errors
lexical
analyzer
parser
rest of
front end
symbol
table
source
program
parse
treeget next
token
token
regular
expressions
• Collecting token
information
• Perform type checking
• Intermediate code
generation
• uses a grammar to check structure of tokens
• produces a parse tree
• syntactic errors and recovery
• recognize correct syntax
• report errors
Parsing Responsibilities
Syntax Error Identification / Handling
Recall typical error types:
1. Lexical : Misspellings
2. Syntactic : Omission, wrong order of tokens
3. Semantic : Incompatible types, undefined IDs
4. Logical : Infinite loop / recursive call
Majority of error processing occurs during syntax analysis
NOTE: Not all errors are identifiable !!
if x<1 thenn y = 5:
if ((x<1) & (y>5)))
if (x+5) then
if (i<9) then ...
Should be <= not <
Error Detection
• Much responsibility on Parser
– Many errors are syntactic in nature
– Modern parsing method can detect the presence of syntactic errors in
programs very efficiently
– Detecting semantic or logical error is difficult
• Challenges for error handler in Parser
– It should report error clearly and accurately
– It should recover from error and continue..
– It should not significantly slow down the processing of correct programs
• Good news is
– Common errors are simple and relatively easy to catch.
• Errors don’t occur that frequently!!
• 60% programs are syntactically and semantically correct
• 80% erroneous statements have only 1 error, 13% have 2
• Most error are trivial : 90% single token error
• 60% punctuation, 20% operator, 15% keyword, 5% other error
• Difficult to generate clear and accurate error messages.
Example
function foo () {
...
if (...) {
...
} else {
...
...
}
<eof>
Example
int myVarr;
...
x = myVar;
...
Adequate Error Reporting is Not a Trivial
Task
Missing } here
Not detected until here
Misspelled ID here
Not detected until here
Error Recovery
• After first error recovered
– Compiler must go on!
• Restore to some state and process the rest of the input
• Error-Correcting Compilers
– Issue an error message
– Fix the problem
– Produce an executable
Example
Error on line 23: “myVarr” undefined.
“myVar” was used.
May not be a good Idea!!
– Guessing the programmers intention is not easy!
Error Recovery May Trigger More Errors!
• Inadequate recovery may introduce more errors
– Those were not programmers errors
• Example:
int myVar flag ;
...
x := flag;
...
...
while (flag==0)
...
Too many Error message may be obscuring
– May bury the real message
– Remedy:
• allow 1 message per token or per statement
• Quit after a maximum (e.g. 100) number of errors
Declaration of flag is discarded
Variable flag is undefined
Variable flag is undefined
Error Recovery Approaches: Panic Mode
• Discard tokens until we see a “synchronizing” token.
• The key...
– Good set of synchronizing tokens
– Knowing what to do then
• Advantage
– Simple to implement
– Does not go into infinite loop
– Commonly used
• Disadvantage
– May skip over large sections of source with some errors
Example
Skip to next occurrence of
} end ;
Resume by parsing the next statement
Error Recovery Approaches: Phrase-Level
Recovery
• Compiler corrects the program
by deleting or inserting tokens
...so it can proceed to parse from where it was.
• The key...
Don’t get into an infinite loop
Example
while (x==4) y:= a + b
Insert do to fix the statement
Context Free Grammars (CFG)
• A context free grammar is a formal model that consists of:
• Terminals
Keywords
Token Classes
Punctuation
• Non-terminals
Any symbol appearing on the lefthand side of any rule
• Start Symbol
Usually the non-terminal on the lefthand side of the first rule
• Rules (or “Productions”)
BNF: Backus-Naur Form / Backus-Normal Form
Stmt ::= if Expr then Stmt else Stmt
Rule Alternative Notations
Context Free Grammars : A First Look
assign_stmt → id := expr ;
expr → expr operator term
expr → term
term → id
term → real
term → integer
operator → +
operator → -
Derivation: A sequence of grammar rule applications and
substitutions that transform a starting non-term into a sequence
of terminals / tokens.
Derivation
Let’s derive: id := id + real – integer ;
assign_stmt assign_stmt → id := expr ;
→ id := expr ; expr → expr operator term
→id := expr operator term; expr → expr operator term
→id := expr operator term operator term; expr → term
→ id := term operator term operator term; term → id
→ id := id operator term operator term; operator → +
→ id := id + term operator term; term → real
→ id := id + real operator term; operator → -
→ id := id + real - term; term → integer
→ id := id + real - integer;
using production:
Example Grammar: Simple Arithmetic
Expressions
expr → expr op expr
expr → ( expr )
expr → - expr
expr → id
op → +
op → -
op → *
op → /
op → ↑
9 Production rules
Terminals: id + - * / ↑ ( )
Nonterminals: expr, op
Start symbol: expr
Notational Conventions
• Terminals
– Lower-case letters early in the alphabet: a, b, c
– Operator symbols: +, -
– Punctuations symbols: parentheses, comma
– Boldface strings: id or if
• Nonterminals:
– Upper-case letters early in the alphabet: A, B, C
– The letter S (start symbol)
– Lower-case italic names: expr or stmt
• Upper-case letters late in the alphabet, such as X, Y, Z,
represent either nonterminals or terminals.
• Lower-case letters late in the alphabet, such as u, v, …, z,
represent strings of terminals.
Notational Conventions
• Lower-case Greek letters, such as α, β, γ, represent strings of
grammar symbols. Thus A→ α indicates that there is a single
nonterminal A on the left side of the production and a string of
grammar symbols α to the right of the arrow.
• If A→ α1, A→ α2, …., A→ αk are all productions with A on the
left, we may write A→ α1 | α2 | …. | αk
• Unless otherwise started, the left side of the first production is
the start symbol.
E → E A E | ( E ) | -E | id
A → + | - | * | / | ↑
Derivations
Doesn’t contain nonterminals
Derivation
Leftmost Derivation
Rightmost Derivation
Parse Tree
Parse Tree
Parse Tree
Parse Tree
Ambiguous Grammar
Ambiguous Grammar
• More than one Parse Tree for some sentence.
– The grammar for a programming language may be
ambiguous
– Need to modify it for parsing.
• Also: Grammar may be left recursive.
• Need to modify it for parsing.
Elimination of Ambiguity
• Ambiguous
• A Grammar is ambiguous if there are multiple parse
trees for the same sentence.
• Disambiguation
• Express Preference for one parse tree over others
– Add disambiguating rule into the grammar
Resolving Problems: Ambiguous Grammars
Consider the following grammar segment:
stmt → if expr then stmt
| if expr then stmt else stmt
| other (any other statement)
If E1 then S1 else if E2 then S2 else S3
simple parse tree:
stmt
stmt
stmtexpr
exprE1
E2
S3
S1
S2
then
then
else
else
if
if
stmt stmt
Example : What Happens with this string?
If E1 then if E2 then S1 else S2
How is this parsed ?
if E1 then
if E2 then
S1
else
S2
if E1 then
if E2 then
S1
else
S2
vs.
Parse Trees: If E1 then if E2 then S1 else S2
Form 1:
stmt
stmt
stmtexpr
E1 S2
then elseif
expr
E2
S1
thenif
stmt
stmt
expr
E1
thenif
stmt
expr
E2
S2S1
then else
if
stmt stmt
Form 2:
Removing Ambiguity
Take Original Grammar:
stmt → if expr then stmt
| if expr then stmt else stmt
| other (any other statement)
Revise to remove ambiguity:
stmt → matched_stmt | unmatched_stmt
matched_stmt → if expr then matched_stmt else matched_stmt |
other
unmatched_stmt → if expr then stmt
| if expr then matched_stmt else unmatched_stmt
Rule: Match each else with the closest previous
unmatched then.
Any Question?

Lecture 04 syntax analysis

  • 1.
  • 2.
    Parsing • A.K.A. SyntaxAnalysis – Recognize sentences in a language. – Discover the structure of a document/program. – Construct (implicitly or explicitly) a tree (called as a parse tree) to represent the structure. – The above tree is used later to guide translation.
  • 3.
    Parsing During Compilation intermediate representation errors lexical analyzer parser restof front end symbol table source program parse treeget next token token regular expressions • Collecting token information • Perform type checking • Intermediate code generation • uses a grammar to check structure of tokens • produces a parse tree • syntactic errors and recovery • recognize correct syntax • report errors
  • 4.
    Parsing Responsibilities Syntax ErrorIdentification / Handling Recall typical error types: 1. Lexical : Misspellings 2. Syntactic : Omission, wrong order of tokens 3. Semantic : Incompatible types, undefined IDs 4. Logical : Infinite loop / recursive call Majority of error processing occurs during syntax analysis NOTE: Not all errors are identifiable !! if x<1 thenn y = 5: if ((x<1) & (y>5))) if (x+5) then if (i<9) then ... Should be <= not <
  • 5.
    Error Detection • Muchresponsibility on Parser – Many errors are syntactic in nature – Modern parsing method can detect the presence of syntactic errors in programs very efficiently – Detecting semantic or logical error is difficult • Challenges for error handler in Parser – It should report error clearly and accurately – It should recover from error and continue.. – It should not significantly slow down the processing of correct programs • Good news is – Common errors are simple and relatively easy to catch. • Errors don’t occur that frequently!! • 60% programs are syntactically and semantically correct • 80% erroneous statements have only 1 error, 13% have 2 • Most error are trivial : 90% single token error • 60% punctuation, 20% operator, 15% keyword, 5% other error
  • 6.
    • Difficult togenerate clear and accurate error messages. Example function foo () { ... if (...) { ... } else { ... ... } <eof> Example int myVarr; ... x = myVar; ... Adequate Error Reporting is Not a Trivial Task Missing } here Not detected until here Misspelled ID here Not detected until here
  • 7.
    Error Recovery • Afterfirst error recovered – Compiler must go on! • Restore to some state and process the rest of the input • Error-Correcting Compilers – Issue an error message – Fix the problem – Produce an executable Example Error on line 23: “myVarr” undefined. “myVar” was used. May not be a good Idea!! – Guessing the programmers intention is not easy!
  • 8.
    Error Recovery MayTrigger More Errors! • Inadequate recovery may introduce more errors – Those were not programmers errors • Example: int myVar flag ; ... x := flag; ... ... while (flag==0) ... Too many Error message may be obscuring – May bury the real message – Remedy: • allow 1 message per token or per statement • Quit after a maximum (e.g. 100) number of errors Declaration of flag is discarded Variable flag is undefined Variable flag is undefined
  • 9.
    Error Recovery Approaches:Panic Mode • Discard tokens until we see a “synchronizing” token. • The key... – Good set of synchronizing tokens – Knowing what to do then • Advantage – Simple to implement – Does not go into infinite loop – Commonly used • Disadvantage – May skip over large sections of source with some errors Example Skip to next occurrence of } end ; Resume by parsing the next statement
  • 10.
    Error Recovery Approaches:Phrase-Level Recovery • Compiler corrects the program by deleting or inserting tokens ...so it can proceed to parse from where it was. • The key... Don’t get into an infinite loop Example while (x==4) y:= a + b Insert do to fix the statement
  • 11.
    Context Free Grammars(CFG) • A context free grammar is a formal model that consists of: • Terminals Keywords Token Classes Punctuation • Non-terminals Any symbol appearing on the lefthand side of any rule • Start Symbol Usually the non-terminal on the lefthand side of the first rule • Rules (or “Productions”) BNF: Backus-Naur Form / Backus-Normal Form Stmt ::= if Expr then Stmt else Stmt
  • 12.
  • 13.
    Context Free Grammars: A First Look assign_stmt → id := expr ; expr → expr operator term expr → term term → id term → real term → integer operator → + operator → - Derivation: A sequence of grammar rule applications and substitutions that transform a starting non-term into a sequence of terminals / tokens.
  • 14.
    Derivation Let’s derive: id:= id + real – integer ; assign_stmt assign_stmt → id := expr ; → id := expr ; expr → expr operator term →id := expr operator term; expr → expr operator term →id := expr operator term operator term; expr → term → id := term operator term operator term; term → id → id := id operator term operator term; operator → + → id := id + term operator term; term → real → id := id + real operator term; operator → - → id := id + real - term; term → integer → id := id + real - integer; using production:
  • 15.
    Example Grammar: SimpleArithmetic Expressions expr → expr op expr expr → ( expr ) expr → - expr expr → id op → + op → - op → * op → / op → ↑ 9 Production rules Terminals: id + - * / ↑ ( ) Nonterminals: expr, op Start symbol: expr
  • 16.
    Notational Conventions • Terminals –Lower-case letters early in the alphabet: a, b, c – Operator symbols: +, - – Punctuations symbols: parentheses, comma – Boldface strings: id or if • Nonterminals: – Upper-case letters early in the alphabet: A, B, C – The letter S (start symbol) – Lower-case italic names: expr or stmt • Upper-case letters late in the alphabet, such as X, Y, Z, represent either nonterminals or terminals. • Lower-case letters late in the alphabet, such as u, v, …, z, represent strings of terminals.
  • 17.
    Notational Conventions • Lower-caseGreek letters, such as α, β, γ, represent strings of grammar symbols. Thus A→ α indicates that there is a single nonterminal A on the left side of the production and a string of grammar symbols α to the right of the arrow. • If A→ α1, A→ α2, …., A→ αk are all productions with A on the left, we may write A→ α1 | α2 | …. | αk • Unless otherwise started, the left side of the first production is the start symbol. E → E A E | ( E ) | -E | id A → + | - | * | / | ↑
  • 18.
  • 19.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    Ambiguous Grammar • Morethan one Parse Tree for some sentence. – The grammar for a programming language may be ambiguous – Need to modify it for parsing. • Also: Grammar may be left recursive. • Need to modify it for parsing.
  • 29.
    Elimination of Ambiguity •Ambiguous • A Grammar is ambiguous if there are multiple parse trees for the same sentence. • Disambiguation • Express Preference for one parse tree over others – Add disambiguating rule into the grammar
  • 30.
    Resolving Problems: AmbiguousGrammars Consider the following grammar segment: stmt → if expr then stmt | if expr then stmt else stmt | other (any other statement) If E1 then S1 else if E2 then S2 else S3 simple parse tree: stmt stmt stmtexpr exprE1 E2 S3 S1 S2 then then else else if if stmt stmt
  • 31.
    Example : WhatHappens with this string? If E1 then if E2 then S1 else S2 How is this parsed ? if E1 then if E2 then S1 else S2 if E1 then if E2 then S1 else S2 vs.
  • 32.
    Parse Trees: IfE1 then if E2 then S1 else S2 Form 1: stmt stmt stmtexpr E1 S2 then elseif expr E2 S1 thenif stmt stmt expr E1 thenif stmt expr E2 S2S1 then else if stmt stmt Form 2:
  • 33.
    Removing Ambiguity Take OriginalGrammar: stmt → if expr then stmt | if expr then stmt else stmt | other (any other statement) Revise to remove ambiguity: stmt → matched_stmt | unmatched_stmt matched_stmt → if expr then matched_stmt else matched_stmt | other unmatched_stmt → if expr then stmt | if expr then matched_stmt else unmatched_stmt Rule: Match each else with the closest previous unmatched then.
  • 34.