Lecture 04 syntax analysis

Parsing
• A.K.A. Syntax Analysis
– Recognize sentences in a language.
– Discover the structure of a document/program.
– Construct (implicitly or explicitly) a tree (called as a
parse tree) to represent the structure.
– The above tree is used later to guide translation.

Parsing During Compilation
intermediate
representation
errors
lexical
analyzer
parser
rest of
front end
symbol
table
source
program
parse
treeget next
token
token
regular
expressions
• Collecting token
information
• Perform type checking
• Intermediate code
generation
• uses a grammar to check structure of tokens
• produces a parse tree
• syntactic errors and recovery
• recognize correct syntax
• report errors

Parsing Responsibilities
Syntax Error Identification / Handling
Recall typical error types:
1. Lexical : Misspellings
2. Syntactic : Omission, wrong order of tokens
3. Semantic : Incompatible types, undefined IDs
4. Logical : Infinite loop / recursive call
Majority of error processing occurs during syntax analysis
NOTE: Not all errors are identifiable !!
if x<1 thenn y = 5:
if ((x<1) & (y>5)))
if (x+5) then
if (i<9) then ...
Should be <= not <

Error Detection
• Much responsibility on Parser
– Many errors are syntactic in nature
– Modern parsing method can detect the presence of syntactic errors in
programs very efficiently
– Detecting semantic or logical error is difficult
• Challenges for error handler in Parser
– It should report error clearly and accurately
– It should recover from error and continue..
– It should not significantly slow down the processing of correct programs
• Good news is
– Common errors are simple and relatively easy to catch.
• Errors don’t occur that frequently!!
• 60% programs are syntactically and semantically correct
• 80% erroneous statements have only 1 error, 13% have 2
• Most error are trivial : 90% single token error
• 60% punctuation, 20% operator, 15% keyword, 5% other error

• Difficult to generate clear and accurate error messages.
Example
function foo () {
...
if (...) {
...
} else {
...
...
}
<eof>
Example
int myVarr;
...
x = myVar;
...
Adequate Error Reporting is Not a Trivial
Task
Missing } here
Not detected until here
Misspelled ID here
Not detected until here

Error Recovery
• After first error recovered
– Compiler must go on!
• Restore to some state and process the rest of the input
• Error-Correcting Compilers
– Issue an error message
– Fix the problem
– Produce an executable
Example
Error on line 23: “myVarr” undefined.
“myVar” was used.
May not be a good Idea!!
– Guessing the programmers intention is not easy!

Error Recovery May Trigger More Errors!
• Inadequate recovery may introduce more errors
– Those were not programmers errors
• Example:
int myVar flag ;
...
x := flag;
...
...
while (flag==0)
...
Too many Error message may be obscuring
– May bury the real message
– Remedy:
• allow 1 message per token or per statement
• Quit after a maximum (e.g. 100) number of errors
Declaration of flag is discarded
Variable flag is undefined
Variable flag is undefined

Error Recovery Approaches: Panic Mode
• Discard tokens until we see a “synchronizing” token.
• The key...
– Good set of synchronizing tokens
– Knowing what to do then
• Advantage
– Simple to implement
– Does not go into infinite loop
– Commonly used
• Disadvantage
– May skip over large sections of source with some errors
Example
Skip to next occurrence of
} end ;
Resume by parsing the next statement

Error Recovery Approaches: Phrase-Level
Recovery
• Compiler corrects the program
by deleting or inserting tokens
...so it can proceed to parse from where it was.
• The key...
Don’t get into an infinite loop
Example
while (x==4) y:= a + b
Insert do to fix the statement

Context Free Grammars (CFG)
• A context free grammar is a formal model that consists of:
• Terminals
Keywords
Token Classes
Punctuation
• Non-terminals
Any symbol appearing on the lefthand side of any rule
• Start Symbol
Usually the non-terminal on the lefthand side of the first rule
• Rules (or “Productions”)
BNF: Backus-Naur Form / Backus-Normal Form
Stmt ::= if Expr then Stmt else Stmt

Context Free Grammars : A First Look
assign_stmt → id := expr ;
expr → expr operator term
expr → term
term → id
term → real
term → integer
operator → +
operator → -
Derivation: A sequence of grammar rule applications and
substitutions that transform a starting non-term into a sequence
of terminals / tokens.

Derivation
Let’s derive: id := id + real – integer ;
assign_stmt assign_stmt → id := expr ;
→ id := expr ; expr → expr operator term
→id := expr operator term; expr → expr operator term
→id := expr operator term operator term; expr → term
→ id := term operator term operator term; term → id
→ id := id operator term operator term; operator → +
→ id := id + term operator term; term → real
→ id := id + real operator term; operator → -
→ id := id + real - term; term → integer
→ id := id + real - integer;
using production:

Example Grammar: Simple Arithmetic
Expressions
expr → expr op expr
expr → ( expr )
expr → - expr
expr → id
op → +
op → -
op → *
op → /
op → ↑
9 Production rules
Terminals: id + - * / ↑ ( )
Nonterminals: expr, op
Start symbol: expr

Notational Conventions
• Terminals
– Lower-case letters early in the alphabet: a, b, c
– Operator symbols: +, -
– Punctuations symbols: parentheses, comma
– Boldface strings: id or if
• Nonterminals:
– Upper-case letters early in the alphabet: A, B, C
– The letter S (start symbol)
– Lower-case italic names: expr or stmt
• Upper-case letters late in the alphabet, such as X, Y, Z,
represent either nonterminals or terminals.
• Lower-case letters late in the alphabet, such as u, v, …, z,
represent strings of terminals.

Notational Conventions
• Lower-case Greek letters, such as α, β, γ, represent strings of
grammar symbols. Thus A→ α indicates that there is a single
nonterminal A on the left side of the production and a string of
grammar symbols α to the right of the arrow.
• If A→ α1, A→ α2, …., A→ αk are all productions with A on the
left, we may write A→ α1 | α2 | …. | αk
• Unless otherwise started, the left side of the first production is
the start symbol.
E → E A E | ( E ) | -E | id
A → + | - | * | / | ↑

Derivations
Doesn’t contain nonterminals

Ambiguous Grammar
• More than one Parse Tree for some sentence.
– The grammar for a programming language may be
ambiguous
– Need to modify it for parsing.
• Also: Grammar may be left recursive.
• Need to modify it for parsing.

Elimination of Ambiguity
• Ambiguous
• A Grammar is ambiguous if there are multiple parse
trees for the same sentence.
• Disambiguation
• Express Preference for one parse tree over others
– Add disambiguating rule into the grammar

Resolving Problems: Ambiguous Grammars
Consider the following grammar segment:
stmt → if expr then stmt
| if expr then stmt else stmt
| other (any other statement)
If E1 then S1 else if E2 then S2 else S3
simple parse tree:
stmt
stmt
stmtexpr
exprE1
E2
S3
S1
S2
then
then
else
else
if
if
stmt stmt

Example : What Happens with this string?
If E1 then if E2 then S1 else S2
How is this parsed ?
if E1 then
if E2 then
S1
else
S2
if E1 then
if E2 then
S1
else
S2
vs.

Parse Trees: If E1 then if E2 then S1 else S2
Form 1:
stmt
stmt
stmtexpr
E1 S2
then elseif
expr
E2
S1
thenif
stmt
stmt
expr
E1
thenif
stmt
expr
E2
S2S1
then else
if
stmt stmt
Form 2:

Removing Ambiguity
Take Original Grammar:
stmt → if expr then stmt
| if expr then stmt else stmt
| other (any other statement)
Revise to remove ambiguity:
stmt → matched_stmt | unmatched_stmt
matched_stmt → if expr then matched_stmt else matched_stmt |
other
unmatched_stmt → if expr then stmt
| if expr then matched_stmt else unmatched_stmt
Rule: Match each else with the closest previous
unmatched then.

Lecture 04 syntax analysis

In this document

More Related Content

What's hot

Viewers also liked

Similar to Lecture 04 syntax analysis

More from Iffat Anjum

Recently uploaded

Lecture 04 syntax analysis