CS 4980 CC, Note 2

CS 4980 CC - Note 2

could just write code, and this would be fine for many applications (though probably not for large languages)
complexities: identifiers, strings, comments, integers, floats, macro pre-processing
solution: tool to convert pattern descriptions into machines

Extend set of REs as follows:
- . (dot) - all but newline
- [^x]: all but x
- quotes: quote characters such as | that might be interpreted as parts of REs
- backslash for non-ascii characters: \n, \a, \001
  - pattern for all characters?
disambiguating rules
- if more than one RE can match, pick longest
  - apply to ifstmt, <>, <<
- if two REs match same string, use the first one
  - handles "if" before identifier
lexical specification file:
- set of RE's and actions; when match RE, perform action
- usual action: return token to parser
should be complete
- usually have dot at end of file with action "illegal token"

states, transitions, start state (arrow), final states (double-circle)
- acceptance
- language: set of accepted inputs
examples: top part of Fig. 2.3
- a-z means 26 parallel lines labeled a through z (avoids clutter)
- deterministic finite automata: only one transition from each state for each input
Can show: every RE can be translated into a DFA and vice-versa
more examples
- see above REs
- odd number of 0's and 1's
- fig 2.3
  - show execution!
- Figure 2.4: combined results
  - execute on "if", "ifnot", "23.4"
Translating DFA's to code
- book's method: transition matrix
- alternative: case statement
Recognizing the longest match
- track last final state reached and input position when at that state

States may have more than one outgoing transition with the same symbol or may do a transition on ε
- see bottom page 24
- [What does this compute?]
why NFAs?
- notion of "guessing right input" is neat
- model randomness
- easy for implementing regular expressions
converting an RE into an NFA
- basic structure: transition into state labeled by symbol (p. 25)
  - example: piece of machine for "ab"
  - machine for "a|b|c" (don't forget the ε coming in)
  - machine for "a|ε"
  - machine for "(ab)*"
  - final state: reached end of RE
- full construction: see Figure 2.6
Figure 2.7: "if" or identifier or number or error

hard to build an actual machine which can "guess"
solution: try all posibilities at once
- track all states could be in!
- failure: list of valid states is empty
- success: list of valid states contains final state when reach end of input
- example: run Fig. 2.7 on input "in"
- note: use notion of ε-closure
See textbook Ch. 2 for details...
recognizing tokens:
- apply algorithm for recognizing longest tokens (see above)
- when reach final state (and continuing fails), return token from first rule in the input file

Converting an NFA to a DFA can create a machine with more states than needed
- in particular, can combine states [10,11,13,15] and [11,12,13]
- more generally: states s₁ and s₂ are equivalent when an input is accepted by starting in state s₁ iff that input is accepted by starting in state s₂
  - can then change all of s₂'s incoming edges to point to s₁ and delete s₂
  - can build algorithm to identify such states

Note every DFA is automatically an NFA: can convert back and forth
Every DFA has an equivalent RE (construction not given in text)
- basic method: sequence = concatenation, choice = alternation, loop = *, ε-transition = ε
RE's, DFAs, NFAs are "equivalent" in power (in sets of languages that can be recognized/specified)
Can we construct a NFA for aⁿbⁿ?
- no! would require "matching" a's and b's - NFA has no real memory
- impact: cannot handle matching nested parens, braces, begin/end pairs, etc.

We will use SableCC
SableCC: takes specification (using REs) and generates Java classes which implement that specification
SableCC specification (input): up to six sections (all optional):
1. package declaration; determines package for resulting Java class
2. helper declarations: abbreviations
3. state declarations: allow lexical analyzer to recognize certain tokens only when in some particular state; very handy for "modal" input, but not covered here
4. token declarations: definitions of tokens using REs
5. ignored tokens: tokens such as whitespace that are thrown away
6. productions: grammar for language

RE -> NFA -> DFA: all equivalent
issue: how effecient is an automatically-generated lexer?
- transition matrix version can be very large; must be careful
- Gray [1988]: DFAs translated directly into executable code (using case statements) can run as fast as hand-coded lexer
SableCC: tool for generating these
- other tools: JavaCC, flex