Previous Next |

syn-tax: the way in which words are put together to form phrases, clauses, or sentences.Webster's Dictionary

The abbreviation mechanism discussed in the previous chapter, whereby a symbol stands for some regular expression, is convenient enough that it is tempting to use it in interesting ways:

These regular expressions define sums of the form `28+301+9`. But now consider

This is meant to define expressions of the form:

(109+23) 61 (1+(250+3))

in which all the parentheses are balanced. But it is impossible for a finite automaton to recognize balanced parentheses (because a machine with *N* states cannot remember a parenthesis-nesting depth greater than *N*), so clearly *sum* and *expr* cannot be regular expressions. So how does a lexical analyzer implement regular-expression abbreviations such as `digits`? The answer is that the right-hand-side `([0-9]+)` is simply substituted for `digits` wherever it appears in regular expressions, *before* translation to a finite automaton. This is not possible for the *sum*-and-*expr* language; we can first substitute *sum* into *expr*, yielding

but now an attempt to substitute *expr* into itself leads to

and the right-hand side now has just as many occurrences of *expr* as it did before - in fact, it has more! Thus, the notion of abbreviation does not add expressive power to the language of regular expressions - there are no additional languages that can be defined - unless the abbreviations are recursive (or mutually recursive, as are *sum* and *expr*). The additional expressive power gained by recursion is just what we need for parsing. Also, once we have abbreviations with recursion, we do not need alternation except at the top level of expressions, because the definition

can always be rewritten using an auxiliary definition as

In fact, instead of using the alternation mark at all, we can just write several allowable expansions for the same symbol:

The Kleene closure is not necessary, since we can rewrite it so that

becomes

What we have left is a very simple notation, called *context-free grammars*. Just as regular expressions can be used to define lexical structure in a static, declarative way, grammars define syntactic structure declaratively. But we will need something more powerful than finite automata to parse languages described by grammars.

In fact, grammars can also be used to describe the structure of lexical tokens, although regular expressions are adequate - and more concise - for that purpose.

JaVa |
Previous Next |