Parsing

syn-tax: the way in which words are put together to form phrases, clauses, or sentences.

Webster's Dictionary

OVERVIEW

The abbreviation mechanism discussed in the , whereby a symbol stands for some regular expression, is convenient enough that it is tempting to use it in interesting ways:Java ScreenShot

These regular expressions define sums of the form 28+301+9. But now considerJava ScreenShot

This is meant to define expressions of the form:

(109+23)
61
(1+(250+3))

in which all the parentheses are balanced. But it is impossible for a finite automaton to recognize balanced parentheses (because a machine with N states cannot remember a parenthesis-nesting depth greater than N), so clearly sum and expr cannot be regular expressions. So how does a lexical analyzer implement regular-expression abbreviations such as digits? The answer is that the right-hand-side ([0-9]+) is simply substituted for digits wherever it appears in regular expressions, before translation to a finite automaton. This is not possible for the sum-and-expr language; we can first substitute sum into expr, yieldingJava ScreenShot

but now an attempt to substitute expr into itself leads toJava ScreenShot

and the right-hand side now has just as many occurrences of expr as it did before - in fact, it has more! Thus, the notion of abbreviation does not add expressive power to the language of regular expressions - there are no additional languages that can be defined - unless the abbreviations are recursive (or mutually recursive, as are sum and expr). The additional expressive power gained by recursion is just what we need for parsing. Also, once we have abbreviations with recursion, we do not need alternation except at the top level of expressions, because the definitionJava ScreenShot

can always be rewritten using an auxiliary definition asJava ScreenShot

In fact, instead of using the alternation mark at all, we can just write several allowable expansions for the same symbol:Java ScreenShot

The Kleene closure is not necessary, since we can rewrite it so thatJava ScreenShot

becomesJava ScreenShot

What we have left is a very simple notation, called context-free grammars. Just as regular expressions can be used to define lexical structure in a static, declarative way, grammars define syntactic structure declaratively. But we will need something more powerful than finite automata to parse languages described by grammars.

In fact, grammars can also be used to describe the structure of lexical tokens, although regular expressions are adequate - and more concise - for that purpose.