# PREDICTIVE PARSING

Some grammars are easy to parse using a simple algorithm known as recursive descent. In essence, each grammar production turns into one clause of a recursive function. We illustrate this by writing a recursive-descent parser for Grammar 3.11.

GRAMMAR 3.11
• S → if E then S else S

• S → begin S L
• S → print E
• L → end
• L → ; S L
• E → num = num

A recursive-descent parser for this language has one function for each nonterminal and one clause for each production.

```final int IF=1, THEN=2, ELSE=3, BEGIN=4, END=5, PRINT=6,
SEMI=7, NUM=8, EQ=9;
int tok = getToken();
void eat(int t) {if (tok==t) advance(); else error();}
void S() {switch(tok) {
case IF: eat(IF); E(); eat(THEN); S();
eat(ELSE); S(); break;
case BEGIN: eat(BEGIN); S(); L(); break;
case PRINT: eat(PRINT); E(); break;
default: error();
}}
void L() {switch(tok) {
case END: eat(END); break;
case SEMI: eat(SEMI); S(); L(); break;
default: error();
}}
void E() { eat(NUM); eat(EQ); eat(NUM); }
```

With suitable definitions of error and getToken, this program will parse very nicely. Emboldened by success with this simple method, let us try it with Grammar 3.10:

```void S() { E(); eat(EOF); }
void E() {switch (tok) {
case ?: E(); eat(PLUS); T(); break;
case ?: E(); eat(MINUS); T(); break;
case ?: T(); break;
default: error();
}}
void T() {switch (tok) {
case ?: T(); eat(TIMES); F(); break;
case ?: T(); eat(DIV); F(); break;
case ?: F(); break;
default: error();
}}
```

There is a conflict here: The E function has no way to know which clause to use. Consider the strings (1*2-3)+4 and (1*2-3). In the former case, the initial call to E should use the EE + T production, but the latter case should use ET. Recursive-descent, or predictive, parsing works only on grammars where the first terminal symbol of each subexpression provides enough information to choose which production to use. To understand this better, we will formalize the notion of FIRST sets, and then derive conflict-free recursive-descent parsers using a simple algorithm. Just as lexical analyzers can be constructed from regular expressions, there are parser-generator tools that build predictive parsers. But if we are going to use a tool, then we might as well use one based on the more powerful LR(1) parsing algorithm, which will be described in . Sometimes it's inconvenient or impossible to use a parser-generator tool. The advantage of predictive parsing is that the algorithm is simple enough that we can use it to construct parsers by hand - we don't need automatic tools.

Given a string γ of terminal and nonterminal symbols, FIRST(γ) is the set of all terminal symbols that can begin any string derived from γ. For example, let γ = T * F. Any string of terminal symbols derived from γ must start with id, num, or (. Thus, FIRST(T * F) = {id, num, (}. If two different productions X → γ1 and X → γ2 have the same lefthand-side symbol (X) and their right-hand sides have overlapping FIRST sets, then the grammar cannot be parsed using predictive parsing. If some terminal symbol I is in FIRST(γ1) and also in FIRST(γ2), then the X function in a recursive-descent parser will not know what to do if the input token is I. The computation of FIRST sets looks very simple: If γ = X Y Z, it seems as if Y and Z can be ignored, and FIRST(X) is the only thing that matters. But consider Grammar 3.12. Because Y can produce the empty string - and therefore X can produce the empty string - we find that FIRST(X Y Z) must include FIRST(Z). Therefore, in computing FIRST sets, we must keep track of which symbols can produce the empty string; we say such symbols are nullable. And we must keep track of what might follow a nullable symbol.

GRAMMAR 3.12
• Zd

• ZX Y Z
• Y
• Yc
• XY
• Xa

With respect to a particular grammar, given a string γ of terminals and nonterminals,

• nullable(X) is true if X can derive the empty string.

• FIRST(γ) is the set of terminals that can begin strings derived from γ.
• FOLLOW(X) is the set of terminals that can immediately follow X. That is, t ∈ FOLLOW(X) if there is any derivation containing Xt. This can occur if the derivation contains X Y Zt where Y and Z both derive ∊.

A precise definition of FIRST, FOLLOW, and nullable is that they are the smallest sets for which these properties hold: For each terminal symbol Z, FIRST[Z] = {Z}.

Algorithm 3.13 for computing FIRST, FOLLOW, and nullable just follows from these facts; we simply replace each equation with an assignment statement, and iterate.

ALGORITHM 3.13: Iterative computation of FIRST, FOLLOW, and nullable.

Algorithm to compute FIRST, FOLLOW, and nullable. Initialize FIRST and FOLLOW to all empty sets, and nullable to all false.

```for each terminal symbol Z
FIRST[Z] ← {Z}
repeat
for each production X → Y1Y2 ... Yk
if Y1 Yk are all nullable (or if k = 0)
then nullable[X] ← true
for each i from 1 to k, each j from i + 1 to k
if Y1 ... Yi-1 are all nullable (or if i = 1)
then FIRST[X] ← FIRST[X] ∪ FIRST[Yi ]
if Yi+1 ... Yk are all nullable (or if i = k)
then FOLLOW[Yi] ← FOLLOW[Yi] ∪ FOLLOW[X]
if Yi+1 ... Yj -1 are all nullable (or if i + 1 = j)
then FOLLOW[Yi] ← FOLLOW[Yi] ∪ FIRST[Yj]
until FIRST, FOLLOW, and nullable did not change in this iteration.
```

Of course, to make this algorithm efficient it helps to examine the productions in the right order; see . Also, the three relations need not be computed simultaneously; nullable can be computed by itself, then FIRST, then FOLLOW. This is not the first time that a group of equations on sets has become the algorithm for calculating those sets; recall the algorithm on page 28 for computing ∊-closure. Nor will it be the last time; the technique of iteration to a fixed point is applicable in dataflow analysis for optimization, in the back end of a compiler. We can apply this algorithm to Grammar 3.12. Initially, we have:

In the first iteration, we find that a ∈ FIRST[X], Y is nullable, c ∈ FIRST[Y], d ∈ FIRST[Z], d ∈ FOLLOW[X], c ∈ FOLLOW[X], d ∈ FOLLOW[Y]. Thus:

In the second iteration, we find that X is nullable, c ∈ FIRST[X], {a; c} ⊆ FIRST[Z], {a, c, d} ⊆ FOLLOW[X], {a, c, d} ⊆ FOLLOW[Y]. Thus:

The third iteration finds no new information, and the algorithm terminates. It is useful to generalize the FIRST relation to strings of symbols:

 FIRST(Xγ) = FIRST[X] if not nullable[X] FIRST(Xγ) = FIRST[X] ∪ FIRST(γ) if nullable[X]

and similarly, we say that a string γ is nullable if each symbol in γ is nullable.

### CONSTRUCTING A PREDICTIVE PARSER

Consider a recursive-descent parser. The parsing function for some nonterminal X has a clause for each X production; it must choose one of these clauses based on the next token T of the input. If we can choose the right production for each (X, T), then we can write the recursive-descent parser. All the information we need can be encoded as a two-dimensional table of productions, indexed by nonterminals X and terminals T. Thisiscalleda predictive parsing table. To construct this table, enter production X → γ in row X, column T of the table for each T ∈ FIRST(γ). Also, if γ is nullable, enter the production in row X, column T for each T ∈ FOLLOW[X]. Image 3.14 shows the predictive parser for Grammar 3.12. But some of the entries contain more than one production! The presence of duplicate entries means that predictive parsing will not work on Grammar 3.12.

Image 3.14: Predictive parsing table for Grammar 3.12.

If we examine the grammar more closely, we find that it is ambiguous. The sentence d has many parse trees, including:

An ambiguous grammar will always lead to duplicate entries in a predictive parsing table. If we need to use the language of Grammar 3.12 as a coding language, we will need to find an unambiguous grammar. Grammars whose predictive parsing tables contain no duplicate entries are called LL(1). This stands for left-to-right parse, leftmost-derivation, 1-symbol lookahead. Clearly a recursive-descent (predictive) parser examines the input left-to-right in one pass (some parsing algorithms do not, but these are generally not useful for compilers). The order in which a predictive parser expands nonterminals into right-hand sides (that is, the recursive-descent parser calls functions corresponding to nonterminals) is just the order in which a leftmost derivation expands nonterminals. And a recursive-descent parser does its job just by looking at the next token of the input, never looking more than one token ahead. We can generalize the notion of FIRST sets to describe the first k tokens of a string, and to make an LL(k) parsing table whose rows are the nonterminals and columns are every sequence of k terminals. This is rarely done (because the tables are so large), but sometimes when you write a recursive-descent parser by hand you need to look more than one token ahead.

Grammars parsable with LL(2) parsing tables are called LL(2) grammars, and similarly for LL(3), etc. Every LL(1) grammar is an LL(2) grammar, and so on. No ambiguous grammar is LL(k)forany k.

### ELIMINATING LEFT RECURSION

Suppose we want to build a predictive parser for Grammar 3.10. The two productions

are certain to cause duplicate entries in the LL(1) parsing table, since any token in FIRST(T) will also be in FIRST(E + T). The problem is that E appears as the first right-hand-side symbol in an E-production; this is called left recursion. Grammars with left recursion cannot be LL(1). To eliminate left recursion, we will rewrite using right recursion. We introduce a new nonterminal E′, and write

This derives the same set of strings (on T and +) as the original two productions, but now there is no left recursion. In general, whenever we have productions XXγ and X → α, where α does not start with X, we know that this derives strings of the form αγ*, an α followed by zero or more γ. So we can rewrite the regular expression using right recursion:

Applying this transformation to Grammar 3.10, we obtain Grammar 3.15.

GRAMMAR 3.15
• SE \$

• ET E
• E′ → + T E
• E′ →− T E
• E′ →
• TF T
• T′ →* F T
• T′ → / F T
• T′ →
• F → id
• F → num
• F → (E)

To build a predictive parser, first we compute nullable, FIRST, and FOLLOW (Table 3.16). The predictive parser for Grammar 3.15 is shown in Table 3.17.

### LEFT FACTORING

We have seen that left recursion interferes with predictive parsing, and that it can be eliminated. A similar problem occurs when two productions for the same nonterminal start with the same symbols. For example:

In such a case, we can left factor the grammar - that is, take the allowable endings (else S and ∊) and make a new nonterminal X to stand for them:

The resulting productions will not pose a problem for a predictive parser. Although the grammar is still ambiguous - the parsing table has two entries for the same slot - we can resolve the ambiguity by using the else S action.

### ERROR RECOVERY

Armed with a predictive parsing table, it is easy to write a recursive-descent parser. Here is a representative fragment of a parser for Grammar 3.15:

```void T() {switch (tok) {
case ID:
case NUM:
case LPAREN: F(); Tprime(); break;
default: error!
}}
void Tprime() {switch (tok) {
case PLUS: break;
case TIMES: eat(TIMES); F(); Tprime(); break;
case EOF: break;
case RPAREN: break;
default: error!
}}
```

A blank entry in row T, column x of the LL(1) parsing table indicates that the parsing function T() does not expect to see token x - this will be a syntax error. How should error be handled? It is safe just to raise an exception and quit parsing, but this is not very friendly to the user. It is better to print an error message and recover from the error, so that other syntax errors can be found in the same compilation. A syntax error occurs when the string of input tokens is not a sentence in the language. Error recovery is a way of finding some sentence similar to that string of tokens. This can proceed by deleting, replacing, or inserting tokens. For example, error recovery for T could proceed by inserting a num token. It's not necessary to adjust the actual input; it suffices to pretend that the num was there, print a message, and return normally.

```void T() {switch (tok) {
case ID:
case NUM:
case LPAREN: F(); Tprime(); break;
default: print("expected id, num, or left-paren");
}}
```

It's a bit dangerous to do error recovery by insertion, because if the error cascades to produce another error, the process might loop infinitely. Error recovery by deletion is safer, because the loop must eventually terminate when end-of-file is reached. Simple recovery by deletion works by skipping tokens until a token in the FOLLOW set is reached. For example, error recovery for T′ could work like this:

```int Tprime_follow [] = {PLUS, RPAREN, EOF};
void Tprime() { switch (tok) {
case PLUS: break;
case TIMES: eat(TIMES); F(); Tprime(); break;
case RPAREN: break;
case EOF: break;
default: print("expected +, *, right-paren,
or end-of-file");
skipto(Tprime_follow);
}}
```

A recursive-descent parser's error-recovery mechanisms must be adjusted (sometimes by trial and error) to avoid a long cascade of error-repair messages resulting from a single token out of place.

 JaVa Previous    Next