Parsing

syn-tax: the way in which words are put together to form phrases, clauses, or sentences.
Webster's Dictionary

OVERVIEW

The abbreviation mechanism discussed in the previous chapter, whereby a symbol stands for some regular expression, is convenient enough that it is tempting to use it in interesting ways: Java ScreenShot

These regular expressions define sums of the form 28+301+9. But now consider Java ScreenShot

This is meant to define expressions of the form:

(109+23)
61
(1+(250+3))

in which all the parentheses are balanced. But it is impossible for a finite automaton to recognize balanced parentheses (because a machine with N states cannot remember a parenthesis-nesting depth greater than N), so clearly sum and expr cannot be regular expressions. So how does a lexical analyzer implement regular-expression abbreviations such as digits? The answer is that the right-hand-side ([0-9]+) is simply substituted for digits wherever it appears in regular expressions, before translation to a finite automaton. This is not possible for the sum-and-expr language; we can first substitute sum into expr, yielding Java ScreenShot

but now an attempt to substitute expr into itself leads to Java ScreenShot

and the right-hand side now has just as many occurrences of expr as it did before - in fact, it has more! Thus, the notion of abbreviation does not add expressive power to the language of regular expressions - there are no additional languages that can be defined - unless the abbreviations are recursive (or mutually recursive, as are sum and expr). The additional expressive power gained by recursion is just what we need for parsing. Also, once we have abbreviations with recursion, we do not need alternation except at the top level of expressions, because the definition Java ScreenShot

can always be rewritten using an auxiliary definition as Java ScreenShot

In fact, instead of using the alternation mark at all, we can just write several allowable expansions for the same symbol: Java ScreenShot

The Kleene closure is not necessary, since we can rewrite it so that Java ScreenShot

becomes Java ScreenShot

What we have left is a very simple notation, called context-free grammars. Just as regular expressions can be used to define lexical structure in a static, declarative way, grammars define syntactic structure declaratively. But we will need something more powerful than finite automata to parse languages described by grammars.

In fact, grammars can also be used to describe the structure of lexical tokens, although regular expressions are adequate - and more concise - for that purpose.