Parsing
syn-tax: the way in which words are put together to form phrases, clauses, or sentences.
Webster's Dictionary
OVERVIEW
The abbreviation mechanism discussed in the previous chapter, whereby a symbol stands for some regular expression, is convenient enough that it is tempting to use it in interesting ways:
These regular expressions define sums of the form 28+301+9. But now consider
This is meant to define expressions of the form:
(109+23) 61 (1+(250+3))
in which all the parentheses are balanced. But it is impossible for a finite automaton to recognize balanced parentheses (because a machine with N states cannot remember a parenthesis-nesting depth greater than N), so clearly sum and expr cannot be regular expressions. So how does a lexical analyzer implement regular-expression abbreviations such as digits? The answer is that the right-hand-side ([0-9]+) is simply substituted for digits wherever it appears in regular expressions, before translation to a finite automaton. This is not possible for the sum-and-expr language; we can first substitute sum into expr, yielding
but now an attempt to substitute expr into itself leads to
and the right-hand side now has just as many occurrences of expr as it did before - in fact, it has more! Thus, the notion of abbreviation does not add expressive power to the language of regular expressions - there are no additional languages that can be defined - unless the abbreviations are recursive (or mutually recursive, as are sum and expr). The additional expressive power gained by recursion is just what we need for parsing. Also, once we have abbreviations with recursion, we do not need alternation except at the top level of expressions, because the definition
can always be rewritten using an auxiliary definition as
In fact, instead of using the alternation mark at all, we can just write several allowable expansions for the same symbol:
The Kleene closure is not necessary, since we can rewrite it so that
becomes
What we have left is a very simple notation, called context-free grammars. Just as regular expressions can be used to define lexical structure in a static, declarative way, grammars define syntactic structure declaratively. But we will need something more powerful than finite automata to parse languages described by grammars.
In fact, grammars can also be used to describe the structure of lexical tokens, although regular expressions are adequate - and more concise - for that purpose.