USING PARSER GENERATORS

The task of constructing a parser is simple enough to be automated. In the previous chapter we described the lexical-analyzer aspects of JavaCC and SableCC. Here we will discuss the parser-generator aspects of these tools. Documentation for JavaCC and SableCC are available via this tutorial's Web site.

JAVACC

JavaCC is an LL(k) parser generator. Productions are of the form:

void Assignment() : {} { Identifier() "=" Expression() ";" }

where the left-hand side is Assignment(); the right-hand side is enclosed between the last two curly brackets; Assignment(), Identifier(), and Expression() are nonterminal symbols; and "=" and ";" are terminal symbols. Grammar 3.30 can be represented as a JavaCC grammar as shown in Grammar 3.31. Notice that if we had written the production for StmList() in the style of Grammar 3.30, that is,

void StmList() :
{}
{ Stm()
| StmList( ) ";" Stm()
}

GRAMMAR 3.30

P → L
S → id := id
S → while id do S
S → begin L end
S → if id then S
S → if id then S else S
L → S
L → L ; S

GRAMMAR 3.31: JavaCC version of Grammar 3.30.

PARSER_BEGIN(MyParser)
 public class MyParser {}
PARSER_END(MyParser)
SKIP :
{ " " | "\t" | "\n" }
TOKEN :
{ < WHILE: "while" >
| < BEGIN: "begin" >
| < END: "end" >
| < DO: "do" >
| < IF: "if" >
| < THEN: "then" >
| < ELSE: "else" >
| < SEMI: ";" >
| < ASSIGN: "=" >
| < ID: ["a"-"z"](["a"-"z"] | ["0"-"9"])* >
}
void Prog() :
{}
{ StmList() <EOF> }
void StmList() :
{}
{ Stm() StmListPrime() }
void StmListPrime() :
{}
{ ( ";" Stm() StmListPrime() )? }
void Stm() :
{}
{ <ID> "=" <ID>
| "while" <ID> "do" Stm()
| "begin" StmList() "end"
| LOOKAHEAD(5) /* we need to lookahead till we see "else" */
"if" <ID> "then" Stm()
| "if" <ID> "then" Stm() "else" Stm()
}

then the grammar would be left recursive. In that case, JavaCC would give the following error:

Left recursion detected: "StmList... --> StmList..."

We used the techniques mentioned earlier to remove the left recursion and arrive at Grammar 3.31.

SABLECC

SableCC is an LALR(1) parser generator. Productions are of the form:

assignment = identifier assign expression semicolon ;

where the left-hand side is assignment; the right-hand side is enclosed between = and ;; assignment, identifier, and expression are nonterminal symbols; and assign and semicolon are terminal symbols that are defined in an earlier part of the syntax specification. Grammar 3.30 can be represented as a SableCC grammar as shown in Grammar 3.32. When there is more than one alternative, SableCC requires a name for each alternative. A name is given to an alternative in the grammar by prefixing the alternative with an identifier between curly brackets. Also, if the same grammar symbol appears twice in the same alternative of a production, SableCC requires a name for at least one of the two elements. Element names are specified by prefixing the element with an identifier between square brackets followed by a colon.GRAMMAR 3.32: SableCC version of Grammar 3.30.

Tokens
 while = 'while';
 begin = 'begin';
 end = 'end';
 do = 'do';
 if = 'if';
 then = 'then';
 else = 'else';
 semi = ';';
 assign = '=';
 whitespace = (' ' | '\t' | '\n')+;
 id = ['a'..'z'](['a'..'z'] | ['0'..'9'])*;
Ignored Tokens
 whitespace;
Productions
 prog = stmlist;
 stm = {assign} [left]:id assign [right]:id |
 {while} while id do stm |
 {begin} begin stmlist end |
 {if_then} if id then stm |
 {if_then_else} if id then [true_stm]:stm else [false_stm]:stm;
 stmlist = {stmt} stm |
 {stmtlist} stmlist semi stm;

SableCC reports shift-reduce and reduce-reduce conflicts. A shift-reduce conflict is a choice between shifting and reducing; a reduce-reduce conflict is a choice of reducing by two different rules. SableCC will report that the Grammar 3.32 has a shift-reduce conflict. The conflict can be examined by reading the detailed error message SableCC produces, as shown in Image 3.33.

shift/reduce conflict in state [stack: TIf TId TThen PStm *] on TElse in {
 [ PStm = TIf TId TThen PStm * TElse PStm ] (shift),
 [ PStm = TIf TId TThen PStm * ] followed by TElse (reduce)
}

Image 3.33: SableCC shift-reduce error message for Grammar 3.32.

SableCC prefixes productions with an uppercase 'P' and tokens with an uppercase 'T', and replaces the first letter with an uppercase when it makes the objects for the tokens and productions. This is what you see on the stack in the error message in Image 3.33. So on the stack we have tokens for if, id, then, and a production that matches a stm, and now we have an else token. Clearly this reveals that the conflict is caused by the familiar dangling else. In order to resolve this conflict we need to rewrite the grammar, removing the ambiguity as in Grammar 3.34.GRAMMAR 3.34: SableCC productions of Grammar 3.32 with conflicts resolved.

Productions
 prog = stmlist;
 stm = {stm_without_trailing_substm}
 stm_without_trailing_substm |
 {while} while id do stm |
 {if_then} if id then stm |
 {if_then_else} if id then stm_no_short_if
 else [false_stm]:stm;
 stm_no_short_if = {stm_without_trailing_substm}
 stm_without_trailing_substm |
 {while_no_short_if}
 while id do stm_no_short_if |
 {if_then_else_no_short_if}
 if id then [true_stm]:stm_no_short_if
 else [fals_stm]:stm_no_short_if;
 stm_without_trailing_substm = {assign} [left]:id assign [right]:id |
 {begin} begin stmlist end ;
 stmlist = {stmt} stm | {stmtlist} stmlist semi stm;

PRECEDENCE DIRECTIVES

No ambiguous grammar is LR(k) for any k; the LR(k) parsing table of an ambiguous grammar will always have conflicts. However, ambiguous grammars can still be useful if we can find ways to resolve the conflicts. For example, Grammar 3.5 is highly ambiguous. In using this grammar to describe a coding language, we intend it to be parsed so that * and = bind more tightly than + and −, and that each operator associates to the left. We can express this by rewriting the unambiguous Grammar 3.8. But we can avoid introducing the T and F symbols and their associated "trivial" reductions E → T and T → F. Instead, let us start by building the LR(1) parsing table for Grammar 3.5, as shown in Table 3.35. We find many conflicts. For example, in state 13 with lookahead + we find a conflict between shift into state 8 and reduce by rule 3. Two of the items in state 13 are Java ScreenShot

Table 3.35: LR parsing table for Grammar 3.5.

In this state the top of the stack is … E * E. Shifting will lead to a stack … E * E+ and eventually … E * E + E with a reduction of E + E to E. Reducing now will lead to the stack … E and then the + will be shifted. The parse trees obtained by shifting and reducing are Java ScreenShot

If we wish * to bind tighter than +, we should reduce instead of shift. So we fill the (13, +) entry in the table with r3 and discard the s8 action. Conversely, in state 9 on lookahead *, we should shift instead of reduce, so we resolve the conflict by filling the (9, *) entry with s12. The case for state 9, lookahead + is Java ScreenShot

Shifting will make the operator right-associative; reducing will make it leftassociative. Since we want left associativity, we fill (9, +) with r5. Consider the expression a − b − c. In most coding languages, this associates to the left, as if written (a − b) − c. But suppose we believe that this expression is inherently confusing, and we want to force the programmer to put in explicit parentheses, either (a − b) − c or a − (b − c). Then we say that the minus operator is nonassociative, and we would fill the (11, −) entry with an error entry. The result of all these decisions is a parsing table with all conflicts resolved (Table 3.36).

Table 3.36: Conflicts of Table 3.35 resolved.

Yacc has precedence directives to indicate the resolution of this class of shift-reduce conflicts. (Unfortunately, SableCC does not have precedence directives.) A series of declarations such as

precedence nonassoc EQ, NEQ;
precedence left PLUS, MINUS;
precedence left TIMES, DIV;
precedence right EXP;

indicates that + and - are left-associative and bind equally tightly; that * and / are left-associative and bind more tightly than +; that ⁁ is right-associative and binds most tightly; and that = and ≠ are nonassociative, and bind more weakly than +. In examining a shift-reduce conflict such as Java ScreenShot

there is the choice of shifting a token and reducing by a rule. Should the rule or the token be given higher priority? The precedence declarations (precedence left, etc.) give priorities to the tokens; the priority of a rule is given by the last token occurring on the right-hand side of that rule. Thus the choice here is between a rule with priority * and a token with priority +; the rule has higher priority, so the conflict is resolved in favor of reducing. When the rule and token have equal priority, then a left precedence favors reducing, right favors shifting, and nonassoc yields an error action. Instead of using the default "rule has precedence of its last token", we can assign a specific precedence to a rule using the %prec directive. This is commonly used to solve the "unary minus" problem. In most coding languages a unary minus binds tighter than any binary operator, so −6 * 8 is parsed as (−6) * 8, not −(6 * 8). Grammar 3.37 shows an example.GRAMMAR 3.37: Yacc grammar with precedence directives.

%{ declarations of yylex and yyerror %}
%token INT PLUS MINUS TIMES UMINUS
%start exp
%left PLUS MINUS
%left TIMES
%left UMINUS
%%
exp : INT
 | exp PLUS exp
 | exp MINUS exp
 | exp TIMES exp
 | MINUS exp %prec UMINUS

The token UMINUS is never returned by the lexer; it's just a placeholder in the chain of precedence declarations. The directive %prec UMINUS gives the rule exp::= MINUS exp the highest precedence, so reducing by this rule takes precedence over shifting any operator, even a minus sign.

Precedence rules are helpful in resolving conflicts, but they should not be abused. If you have trouble explaining the effect of a clever use of precedence rules, perhaps instead you should rewrite the grammar to be unambiguous.

SYNTAX VERSUS SEMANTICS

Consider a coding language with arithmetic expressions such as x + y and boolean expressions such as x + y = z or a&(b = c). Arithmetic operators bind tighter than the boolean operators; there are arithmetic variables and boolean variables; and a boolean expression cannot be added to an arithmetic expression. Grammar 3.38 gives a syntax for this language.GRAMMAR 3.38: Yacc grammar with precedence directives.

%token ID ASSIGN PLUS MINUS AND EQUAL
%start stm
%left OR
%left AND
%left PLUS
%%
stm : ID ASSIGN ae
 | ID ASSIGN be be : be OR be
 | be AND be
 | ae EQUAL ae
 | ID ae : ae PLUS ae
 | ID

The grammar has a reduce-reduce conflict. How should we rewrite the grammar to eliminate this conflict? Here the problem is that when the parser sees an identifier such as a, it has no way of knowing whether this is an arithmetic variable or a boolean variable - syntactically they look identical. The solution is to defer this analysis until the "semantic" phase of the compiler; it's not a problem that can be handled naturally with context-free grammars. A more appropriate grammar is

S → id : E

E → id
E → E & E
E → E = E
E → E + E

Now the expression a + 5&b is syntactically legal, and a later phase of the compiler will have to reject it and print a semantic error message.