Transcript CS252-Slides-2015-to..
CS252: Systems Programming
Ninghui Li Topic 5: Parsing Prepared by Evan Hanau ehanau@purdue.edu
Introduction to Parsing with Yacc
An Introduction to Parsing with Yacc
• • • Context-Free Grammars Yacc Parsing An example Infix Calculator Program
Context-Free Grammar
Background: The Context-Free Grammar • By CS252 you are already somewhat familiar with Regular Expressions.
• Regular expressions can be used to describe regular languages, which belong to a larger classification of language types.
Context-Free Grammar
In CS, we classify languages on the Chomsky Hierarchy. Type-0 Recursively Enumerable Type-1 Context-Sensitive Type-2 Context-Free Type-3 Regular Type-(i) is a superset of Type-(i+1)
Context-Free Grammar
• • • Languages generated by regular expressions belong to type 3. Note: Your specific regular expression engine (e.g. POSIX extended RE) is likely capable of more complex productions.
In any case, we need more than regular expressions to parse computer programming languages and shell scripts.
Context-Free Grammar
• You can do a great deal with regular expressions.
• Exercise: Create a regular expression that matches on
any
English phrase that is a palindrome, for instance the string “some men interpret nine memos”.
Context-Free Grammar
• • • This is in fact not possible with regex (by its strict CS definition!). You would be limited to palindromes of a finite length only.
• • RE cannot express “a n b n ”, a string with some number of a’s followed by equal number of b’s The expression a*b* does not require number of a’s equal that of b’s We must use a
context-free grammar
palindromes and other constructs.
to describe More powerful than a regular expression, and useful when some notion of “what came before” is required.
Backus-Naur form
• BNF or Backus-Naur form is used in CS to describe context-free grammars. It is often used to describe the syntax of programming languages.consists of one or more of the following:
terminals
and
nonterminals
or nothing (epsilon).
US Post Address in Backus-Naur form (from wikipedia)
Context-Free Grammar for Simple Expressions
Let’s define a grammar for a primitive add or multiply expression:
Context-Free Grammar
• Clearly, there is some ambiguity here, because operator precedence (sometimes referred to as
binding
) is not defined.
• The grammar does not distinguish between 2+2*2+2 = 16 (incorrect under normal rules) or 2+2*2+2 = 8 (correct).
Context-Free Grammar
One Solution: Define expressions of different levels:
Context-Free Grammar
Associativity follows from the above example (Hint: What side of the multiply and add operation did we have the “deeper” production on?)
CFG for palindrome
….
Or,
CFG for anbn:
Chomsky Hierarchy (From Wikipedia Page)
Grammar Type-0 Type-1 Type-2 Type-3 Languages Recursively enumerable Context-sensitive Context-free Regular Automaton Production rules (constraints) Turing machine Linear-bounded non-deterministic Turing machine Non deterministic push down automaton (equivalently, Right side no shorter than left) Finite state automaton (no restrictions) and
Chomsky Hierarchy Revisited.
Type-0 Recursively Enumerable Type-1 Context-Sensitive Cannot encode all strings r 1 r 2 such that r 1 and r 2 are two regular expressions that are equivalent Type-2 Context-Free (Pushdown Automaton, i.e, Finite State Automaton with a Stack) Can encode a n b n , but not a n b n c n Type-3 Regular (Finite State Automaton) Can encode a*b*, but not a n b n
Why is Context-Free Grammar Called Context Free?
In a CFG, the left hand of each production is a single non-terminal, e.g.,
In a Context-Sensitive Grammar, left hand of production rules can include other things
An Example Context-Sensitive Grammar for a
n
b
n
c
n 1. S a B C 2. S a S B C 3. B C C B 4. a B a b 5. b B b b 6. b C b c 7. c C c c
S
→ 2 → 1
aSBC
aaBCBC → 3
aaBBCC
→ 4
aab
B
CC
→ 5 → 6 → 7
aabbCC aabbcC aabbcc
Yacc & Parsing
• • There are many ways to parse BNF grammars, most of which are discussed in a compilers course.
Recall: A finite state automaton (FSA) is used for regular expressions. (CS182).
For a context-free grammar, we use a pushdown automaton, which combines features of a FSA with a stack.
Yacc & Parsing
Yacc generates what is known as a LALR parser, which is generated from the BNF grammar in your Yacc file. This parser is defined in the C source file that Yacc generates.
We use Lex to make a lexer to generate our terminals, which are matched with regular expressions before being fed into the parser.
Yacc & Parsing
Input Characters LEX Lexer terminals YACC Parser Rule Based Behavior Yacc is capable of generating a powerful parser that will handle many different grammars.
Yacc & Parsing
• • • Recall that parsing combines a state machine with a stack. States go on a stack to keep track of where parsing is. Yacc uses a
parse table
which defines possible states.
Yacc’s parser operates using two primary actions,
shift
and
reduce.
shift
puts a state on the stack, reduce pops state(s) off the stack and
reduces
combinations of nonterminals and terminals to a single nonterminal. After a reduction to a rule, Yacc’s parser will optionally run some user-defined code.
Yacc & Parsing
A
very
basic example:
A Lex/Yacc Infix Calculator
• Yacc’s parser is powerful, but is not capable of parsing all grammars.
• Certain ambiguous grammars may produce what is known as a
shift/reduce
or
reduce/reduce
conflict. Yacc will, by default, shift instead of reduce.
Yacc & Parsing
Consider the classic
shift/reduce
example: conflict
A Lex/Yacc Infix Calculator
• To demonstrate the utility of Lex and Yacc (or in our case, Flex and Bison) we provide an example infix calculator.
• Similar to several of the examples provided on the Lex and Yacc manpage at http://dinosaur.compilertools.net, but with added features
A Lex/Yacc Infix Calculator
• Make sure to read ALL source code comments, particularly those that describe source file organization.
• • • • Lex definition file: calculator.l
Yacc grammar file: calculator.y
AST Classes: ast.cc
Symbol table: symtab.cc
A Lex/Yacc Infix Calculator
The example calculator application uses Lex and Yacc to parse mathematical expressions and produce an Abstract Syntax Tree, which is then used to evaluate those expressions.
It allows the =, +, *, -, +, ^, () and unary minus operators, with appropriate levels of binding and precedence. Examine calculator.y, because it is heavily commented.
A Lex/Yacc Infix Calculator
• The symbol table (implemented here in simple O(n) access time) maps variables to values.
• Print the AST after every expression evaluation by running calculator with the –t flag, e.g. “calc –t”.
A Lex/Yacc Infix Calculator
A calculator example. Type “2*2^3/3 and press enter: calc> 2*2^3/3 = 5.333333
3.000
/ (/) \ 3.000
/ (^) \ 2.000
/ (*) \ 2.000
Review
• What are required: • Able to write simple Context Free Grammars, similar to those used in implementing FIZ • Able to determine whether a string of tokens is accepted by a grammar • Able to show how a string of tokens is parsed into some non-terminal (i.e., draw the parsing tree)