Recursive Descent Parsing
Specifying Lexical Analyzers
- could just write code, and this would be fine for many applications
(though probably not for large languages)
- complexities: identifiers, strings, comments, integers, floats, macro
pre-processing
- solution: tool to convert pattern descriptions into machines
Regular Expressions
- goal: define sets of strings
- language: set of strings specified by some RE
- notation
- symbols (a, b, X, '[')
- alternation: a | b
- concatenation: ab or a \cdot b
- example: "else" in any mix of letters
- ε: empty string
- example: aba with optional b
- repetition: "Kleene closure"
- example: string of a's, string of a's and b's, string of a followed by b
- examples:
- strings of 0's and 1's with at least one 1
- strings of a's and b's with at least 3 a's in them
- strings of a's and b's with at least 3 consecutive a's in them
- strings of a's and b's with no double'd a's or b's
- strings of a's and b's with every a separated by at least one b
- shorthand:
- [see Fig. 2.1]
- exercises: C++ floats, comments
Lexical Specifications
- Extend set of REs as follows:
- . (dot) - all but newline
- [^x]: all but x
- quotes: quote characters such as | that might be interpreted
as parts of REs
- backslash for non-ascii characters: \n, \a, \001
- pattern for all characters?
- disambiguating rules
- if more than one RE can match, pick longest
- if two REs match same string, use the first one
- handles "if" before identifier
- lexical specification file:
- set of RE's and actions; when match RE, perform action
- usual action: return token to parser
- should be complete
- usually have dot at end of file with action "illegal token"
Finite Automata
- states, transitions, start state (arrow), final states (double-circle)
- acceptance
- language: set of accepted inputs
- examples: top part
of Fig. 2.3
- a-z means 26 parallel lines labeled a through z (avoids clutter)
- deterministic finite automata: only one transition from each
state for each input
- Can show: every RE can be translated into a DFA and vice-versa
- more examples
- see above REs
- odd number of 0's and 1's
- fig 2.3
- Figure 2.4: combined results
- execute on "if", "ifnot", "23.4"
- Translating DFA's to code
- book's method: transition matrix
- alternative: case statement
- Recognizing the longest match
- track last final state reached and input position when at that state
Nondeterministic Finite Automata
- States may have more than one outgoing transition with the same symbol
or may do a transition on ε
- why NFAs?
- notion of "guessing right input" is neat
- model randomness
- easy for implementing regular expressions
- converting an RE into an NFA
- basic structure: transition into state labeled by symbol (p. 25)
- example: piece of machine for "ab"
- machine for "a|b|c" (don't forget the ε coming in)
- machine for "a|ε"
- machine for "(ab)*"
- final state: reached end of RE
- full construction:
see Figure 2.6
- Figure 2.7: "if" or identifier or number or error
Converting an NFA into a DFA
- hard to build an actual machine which can "guess"
- solution: try all posibilities at once
- track all states could be in!
- failure: list of valid states is empty
- success: list of valid states contains final state when reach end of
input
- example: run Fig. 2.7 on input "in"
- note: use notion of ε-closure
- See textbook Ch. 2 for details...
- recognizing tokens:
- apply algorithm for recognizing longest tokens (see above)
- when reach final state (and continuing fails), return token from
first rule in the input file
Minimizing DFAs
- Converting an NFA to a DFA can create a machine with more states than needed
- in particular, can combine states [10,11,13,15] and [11,12,13]
- more generally: states s1 and s2 are
equivalent when an input is accepted by starting in state
s1 iff that input is accepted by starting in state s2
- can then change all of s2's incoming edges to point to
s1 and delete s2
- can build algorithm to identify such states
Review
- REs
- DFAs
- NFAs
- REs to NFAs
- NFAs to DFAs
- hence, can construct DFA for any RE
Properties of REs and finite automata
- Note every DFA is automatically an NFA: can convert back and forth
- Every DFA has an equivalent RE (construction not given in text)
- basic method: sequence = concatenation, choice = alternation, loop = *,
ε-transition = ε
- RE's, DFAs, NFAs are "equivalent" in power (in sets of languages that
can be recognized/specified)
- Can we construct a NFA for anbn?
- no! would require "matching" a's and b's - NFA has no real memory
- impact: cannot handle matching nested parens, braces, begin/end pairs,
etc.
Lexical-Analyzer Generators
- We will use SableCC
- SableCC: takes specification (using REs) and generates Java classes
which implement that specification
- SableCC specification (input): up to six sections (all optional):
- package declaration; determines package for resulting Java class
- helper declarations: abbreviations
- state declarations: allow lexical analyzer to recognize certain tokens
only when in some particular state; very handy for "modal" input, but not
covered here
- token declarations: definitions of tokens using REs
- ignored tokens: tokens such as whitespace that are thrown away
- productions: grammar for language
Review
- RE -> NFA -> DFA: all equivalent
- issue: how effecient is an automatically-generated lexer?
- transition matrix version can be very large; must be careful
- Gray [1988]: DFAs translated directly into executable code (using case
statements) can run as fast as hand-coded lexer
- SableCC: tool for generating these
- other tools: JavaCC, flex