ERROR RECOVERY

LR(k) parsing tables contain shift, reduce, accept, and error actions. On page 58 we claimed that when an LR parser encounters an error action it stops parsing and reports failure. This behavior would be unkind to the programmer, who would like to have all the errors in her program reported, not just the first error.

RECOVERY USING THE ERROR SYMBOL

Local error recovery mechanisms work by adjusting the parse stack and the input at the point where the error was detected in a way that will allow parsing to resume. One local recovery mechanism - found in many versions of the Yacc parser generator - uses a special error symbol to control the recovery process. Wherever the special error symbol appears in a grammar rule, a sequence of erroneous input tokens can be matched. For example, in a Yacc grammar we might have productions such as

exp → ID
exp → exp + ext
exp → ( exps )
exps → exp
exps → exps ; exp

Informally, we can specify that if a syntax error is encountered in the middle of an expression, the parser should skip to the next semicolon or right parenthesis (these are called synchronizing tokens) and resume parsing. We do this by adding error-recovery productions such as

exp → ( error )
exps → error ; exp

What does the parser generator do with the error symbol? In parser generation, error is considered a terminal symbol, and shift actions are entered in the parsing table for it as if it were an ordinary token. When the LR parser reaches an error state, it takes the following actions:

Pop the stack (if necessary) until a state is reached in which the action for the error token is shift.
Shift the error token.
Discard input symbols (if necessary) until a lookahead is reached that has a nonerror action in the current state.
Resume normal parsing.

In the two error productions illustrated above, we have taken care to follow the error symbol with an appropriate synchronizing token - in this case, a right parenthesis or semicolon. Thus, the "nonerror action" taken in step 3 will always shift. If instead we used the production exp → error, the "nonerror action" would be reduce, and (in an SLR or LALR parser) it is possible that the original (erroneous) lookahead symbol would cause another error after the reduce action, without having advanced the input. Therefore, grammar rules that contain error not followed by a token should be used only when there is no good alternative.

Caution

One can attach semantic actions to Yacc grammar rules; whenever a rule is reduced, its semantic action is executed. explains the use of semantic actions. Popping states from the stack can lead to seemingly "impossible" semantic actions, especially if the actions contain side effects. Consider this grammar fragment:

statements: statements exp SEMICOLON
 | statements error SEMICOLON
 | /* empty */
exp : increment exp decrement
 |ID increment: LPAREN {: nest=nest+1; :}
decrement: RPAREN {: nest=nest-1; :}

"Obviously" it is true that whenever a semicolon is reached, the value of nest is zero, because it is incremented and decremented in a balanced way according to the grammar of expressions. But if a syntax error is found after some left parentheses have been parsed, then states will be popped from the stack without "completing" them, leading to a nonzero value of nest. The best solution to this problem is to have side-effect-free semantic actions that build abstract syntax trees, as described in .

Unfortunately, neither JavaCC nor SableCC support the error-symbol errorrecovery method, nor the kind of global error repair described below.

GLOBAL ERROR REPAIR

Global error repair finds the smallest set of insertions and deletions that would turn the source string into a syntactically correct string, even if the insertions and deletions are not at a point where an LL or LR parser would first report an error. Burke-Fisher error repair We will describe a limited but useful form of global error repair, which tries every possible single-token insertion, deletion, or replacement at every point that occurs no earlier than K tokens before the point where the parser reported the error. Thus, with K = 15, if the parsing engine gets stuck at the 100th token of the input, then it will try every possible repair between the 85th and 100th tokens. The correction that allows the parser to parse furthest past the original reported error is taken as the best error repair. Thus, if a single-token substitution of var for type at the 98th token allows the parsing engine to proceed past the 104th token without getting stuck, this repair is a successful one. Generally, if a repair carries the parser R = 4 tokens beyond where it originally got stuck, this is "good enough." The advantage of this technique is that the LL(k) or LR(k) (or LALR, etc.) grammar is not modified at all (no error productions), nor are the parsing tables modified. Only the parsing engine, which interprets the parsing tables, is modified. The parsing engine must be able to back up K tokens and reparse. To do this, it needs to remember what the parse stack looked like K tokens ago. Therefore, the algorithm maintains two parse stacks: the current stack and the old stack. A queue of K tokens is kept; as each new token is shifted, it is pushed on the current stack and also put onto the tail of the queue; simultaneously, the head of the queue is removed and shifted onto the old stack. With each shift onto the old or current stack, the appropriate reduce actions are also performed. Image 3.39 illustrates the two stacks and queue.
Image 3.39: Burke-Fisher parsing, with an error-repair queue. Image 3.18 shows the complete parse of this string according to Table 3.19.

Now suppose a syntax error is detected at the current token. For each possible insertion, deletion, or substitution of a token at any position of the queue, the Burke-Fisher error repairer makes that change to within (a copy of) the queue, then attempts to reparse from the old stack. The success of a modification is in how many tokens past the current token can be parsed; generally, if three or four new tokens can be parsed, this is considered a completely successful repair. In a language with N kinds of tokens, there are K + K · N + K · N possible deletions, insertions, and substitutions within the K -token window. Trying this many repairs is not very costly, especially considering that it happens only when a syntax error is discovered, not during ordinary parsing. Semantic actions Shift and reduce actions are tried repeatedly and discarded during the search for the best error repair. Parser generators usually perform programmer-specified semantic actions along with each reduce action, but the programmer does not expect that these actions will be performed repeatedly and discarded - they may have side effects. Therefore, a Burke-Fisher parser does not execute any of the semantic actions as reductions are performed on the current stack, but waits until the same reductions are performed (permanently) on the old stack. This means that the lexical analyzer may be up to K + R tokens ahead of the point to which semantic actions have been performed. If semantic actions affect lexical analysis - as they do in C, compiling the typedef feature - this can be a problem with the Burke-Fisher approach. For languages with a pure context-free grammar approach to syntax, the delay of semantic actions poses no problem. Semantic values for insertions In repairing an error by insertion, the parser needs to provide a semantic value for each token it inserts, so that semantic actions can be performed as if the token had come from the lexical analyzer. For punctuation tokens no value is necessary, but when tokens such as numbers or identifiers must be inserted, where can the value come from? The ML-Yacc parser generator, which uses Burke-Fischer error correction, has a %value directive, allowing the programmer to specify what value should be used when inserting each kind of token:

%value ID ("bogus")
%value INT (1)
%value STRING ("")

Programmer-specified substitutions Some common kinds of errors cannot be repaired by the insertion or deletion of a single token, and sometimes a particular single-token insertion or substitution is very commonly required and should be tried first. Therefore, in an ML-Yacc grammar specification the programmer can use the %change directive to suggest error corrections to be tried first, before the default "delete or insert each possible token" repairs.

%change EQ -> ASSIGN | ASSIGN -> EQ
 | SEMICOLON ELSE -> ELSE | -> IN INT END

Here the programmer is suggesting that users often write "; else"where they mean "else" and so on. These particular error corrections are often useful in parsing the ML coding language.

The insertion of in 0 end is a particularly important kind of correction, known as a scope closer. Programs commonly have extra left parentheses or right parentheses, or extra left or right brackets, and so on. In ML, another kind of nesting construct is let … in … end. If the programmer forgets to close a scope that was opened by a left parenthesis, then the automatic singletoken insertion heuristic can close this scope where necessary. But to close a let scope requires the insertion of three tokens, which will not be done automatically unless the compiler-writer has suggested "change nothing to in 0 end" as illustrated in the %change command above.