LR PARSING

The weakness of LL(k) parsing techniques is that they must predict which production to use, having seen only the first k tokens of the right-hand side. A more powerful technique, LR(k) parsing, is able to postpone the decision until it has seen input tokens corresponding to the entire right-hand side of the production in question (and k more input tokens beyond). LR(k) stands for left-to-right parse, rightmost-derivation, k-token lookahead. The use of a rightmost derivation seems odd; how is that compatible with a left-to-right parse? Image 3.18 illustrates an LR parse of the program

a:=7;
b:=c+(d:=5+6,d)

Image 3.18: Shift-reduce parse of a sentence. Numeric subscripts in the Stack are DFA state numbers; see Table 3.19.

using Grammar 3.1, augmented with a new start production S′ → S$. The parser has a stack and an input. The first k tokens of the input are the lookahead. Based on the contents of the stack and the lookahead, the parser performs two kinds of actions:

Shift: Move the first input token to the top of the stack.
Reduce: Choose a grammar rule X → A B C; pop C, B, A from the top of the stack; push X onto the stack.

Initially, the stack is empty and the parser is at the beginning of the input. The action of shifting the end-of-file marker $ is called accepting and causes the parser to stop successfully. In Image 3.18, the stack and input are shown after every step, along with an indication of which action has just been performed. The concatenation of stack and input is always one line of a rightmost derivation; in fact, Image 3.18 shows the rightmost derivation of the input string, upside-down.

LR PARSING ENGINE

How does the LR parser know when to shift and when to reduce? By using a deterministic finite automaton! The DFA is not applied to the input - finite automata are too weak to parse context-free grammars - but to the stack. The edges of the DFA are labeled by the symbols (terminals and nonterminals) that can appear on the stack. Table 3.19 is the transition table for Grammar 3.1.

Table 3.19: LR parsing table for Grammar 3.1.

The elements in the transition table are labeled with four kinds of actions:

sn	Shift into state n;
gn	Goto state n;
rk	Reduce by rule k;
a	Accept; Error (denoted by a blank entry in the table).

To use this table in parsing, treat the shift and goto actions as edges of a DFA, and scan the stack. For example, if the stack is id := E, then the DFA goes from state 1 to 4 to 6 to 11. If the next input token is a semicolon, then the ";" column in state 11 says to reduce by rule 2. The second rule of the grammar is S → id:=E, so the top three tokens are popped from the stack and S is pushed. The action for "+" in state 11 is to shift; so if the next token had been + instead, it would have been eaten from the input and pushed on the stack. Rather than rescan the stack for each token, the parser can remember instead the state reached for each stack element. Then the parsing algorithm is

Look up top stack state, and input symbol, to get action; If action is

Shift(n):

Advance input one token; push n on stack.

Reduce(k):

Pop stack as many times as the number of symbols on the right-hand side of rule k;

Let X be the left-hand-side symbol of rule k;
In the state now on top of stack, look up X to get "goto n";
Push n on top of stack.

Accept:

Stop parsing, report success.

Error:

Stop parsing, report failure.

LR(0) PARSER GENERATION

An LR(k) parser uses the contents of its stack and the next k tokens of the input to decide which action to take. Table 3.19 shows the use of one symbol of lookahead. For k = 2, the table has columns for every two-token sequence and so on; in practice, k > 1 is not used for compilation. This is partly because the tables would be huge, but more because most reasonable coding languages can be described by LR(1) grammars. LR(0) grammars are those that can be parsed looking only at the stack, making shift/reduce decisions without any lookahead. Though this class of grammars is too weak to be very useful, the algorithm for constructing LR(0) parsing tables is a good introduction to the LR(1) parser construction algorithm. We will use Grammar 3.20 to illustrate LR(0) parser generation. Consider what the parser for this grammar will be doing. Initially, it will have an empty stack, and the input will be a complete S-sentence followed by $; that is, the right-hand side of the S′ rule will be on the input. We indicate this as S′ → .S$ where the dot indicates the current position of the parser.GRAMMAR 3.20

S′ → S$
S → (L)
S → x
L → S
L → L, S

In this state, where the input begins with S, that means that it begins with any possible right-hand side of an S-production; we indicate that by Java ScreenShot

Call this state 1. A grammar rule, combined with the dot that indicates a position in its right-hand side, is called an item (specifically, an LR(0) item). A state is just a set of items. Shift actions In state 1, consider what happens if we shift an x. Wethen know that the end of the stack has an x; we indicate that by shifting the dot past the x in the S → x production. The rules S′ → .S$ and S → .(L) are irrelevant to this action, so we ignore them; we end up in state 2: Java ScreenShot

Or in state 1 consider shifting a left parenthesis. Moving the dot past the parenthesis in the third item yields S → (.L), where we know that there must be a left parenthesis on top of the stack, and the input begins with some string derived by L, followed by a right parenthesis. What tokens can begin the input now? We find out by including all L-productions in the set of items. But now, in one of those L-items, the dot is just before an S, so we need to include all the S-productions: Java ScreenShot

Goto actions In state 1, consider the effect of parsing past some string of tokens derived by the S nonterminal. This will happen when an x or left parenthesis is shifted, followed (eventually) by a reduction of an S-production. All the right-hand-side symbols of that production will be popped, and the parser will execute the goto action for S in state 1. The effect of this can be simulated by moving the dot past the S in the first item of state 1, yielding state 4: Java ScreenShot

Reduce actions In state 2 we find the dot at the end of an item. This means that on top of the stack there must be a complete right-hand side of the corresponding production (S → x), ready to reduce. In such a state the parser could perform a reduce action. The basic operations we have been performing on states are closure(I) and goto(I, X), where I is a set of items and X is a grammar symbol (terminal or nonterminal). Closure adds more items to a set of items when there is a dot to the left of a nonterminal; goto moves the dot past the symbol X in all items.

Closure(I) = Goto(I, X) =
 repeat set J to the empty set
 for any item A → α.Xβ in I for any item A → α:Xβ in I
 for any production X → γ add A → αX.β to J
 I ← I ∩ {X → .γ} return Closure(J)
 until I does not change.
 return I

Now here is the algorithm for LR(0) parser construction. First, augment the grammar with an auxiliary start production S′ → S$. Let T be the set of states seen so far, and E the set of (shift or goto) edges found so far.

Initialize T to {Closure({S′ → :S$})}
Initialize E to empty.
repeat for each state I in T
 for each item A → α.Xβ in I
 let J be Goto(I, X)
 T ← T ∪ {J}
 E ← E ∪ {I *** J}
until E and T did not change in this iteration

However, for the symbol $ we do not compute Goto(I; $); instead we will make an accept action. For Grammar 3.20 this is illustrated in Image 3.21.
Image 3.21: LR(0) states for Grammar 3.20.

Now we can compute set R of LR(0) reduce actions:

R ← {}
for each state I in T
 for each item A → α. in I
 R ← R ∪ {(I, A → α)}

We can now construct a parsing table for this grammar (Table 3.22). For each edge Java ScreenShot where X is a terminal, we put the action shift J at position (I, X) of the table; if X is a nonterminal, we put goto J at position (I, X). For each state I containing an item S′ → S.$ we put an accept action at (I, $). Finally, for a state containing an item A → γ. (production n with the dot at the end), we put a reduce n action at (I, Y) for every token Y.

Table 3.22: LR(0) parsing table for Grammar 3.20.

In principle, since LR(0) needs no lookahead, we just need a single action for each state: A state will shift or reduce, but not both. In practice, since we need to know what state to shift into, we have rows headed by state numbers and columns headed by grammar symbols.

SLR PARSER GENERATION

Let us attempt to build an LR(0) parsing table for Grammar 3.23. The LR(0) states and parsing table are shown in Image 3.24.GRAMMAR 3.23

S → E $
E → T + E
E → T
T → x

Image 3.24: LR(0) states and parsing table for Grammar 3.23.

In state 3, on symbol +, there is a duplicate entry: The parser must shift into state 4 and also reduce by production 2. This is a conflict and indicates that the grammar is not LR(0) - it cannot be parsed by an LR(0) parser. We will need a more powerful parsing algorithm. A simple way of constructing better-than-LR(0) parsers is called SLR, which stands for simple LR. Parser construction for SLR is almost identical to that for LR(0), except that we put reduce actions into the table only where indicated by the FOLLOW set. Here is the algorithm for putting reduce actions into an SLR table:

R ← {}
for each state I in T
 for each item A → α. in I
 for each token X in FOLLOW(A)
 R ← R ∪ {(I, X, A → α)}

The action (I, X, A → α) indicates that in state I, on lookahead symbol X, the parser will reduce by rule A → α. Thus, for Grammar 3.23 we use the same LR(0) state diagram (Image 3.24), but we put fewer reduce actions into the SLR table, as shown in Image 3.25.
Image 3.25: SLR parsing table for Grammar 3.23.

The SLR class of grammars is precisely those grammars whose SLR parsing table contains no conflicts (duplicate entries). Grammar 3.23 belongs to this class, as do many useful programming-language grammars.

LR(1) ITEMS; LR(1) PARSING TABLE

Even more powerful than SLR is the LR(1) parsing algorithm. Most coding languages whose syntax is describable by a context-free grammar have an LR(1) grammar. The algorithm for constructing an LR(1) parsing table is similar to that for LR(0), but the notion of an item is more sophisticated. An LR(1) item consists of a grammar production, a right-hand-side position (represented by the dot), and a lookahead symbol. The idea is that an item (A → α.β, x) indicates that the sequence α is on top of the stack, and at the head of the input is a string derivable from βx. An LR(1) state is a set of LR(1) items, and there are Closure and Goto operations for LR(1) that incorporate the lookahead:

Closure(I) = Goto(I, X) =
 repeat J ← {}
 for any item (A → α.Xβ, z) in I for any item (A → α.Xβ, z) in I
 for any production X → γ add (A → αX.β, z) to J
 for any w ∈ FIRST(βz) return Closure(J).
 I ← I ∪ {(X → .γ, w)}
 until I does not change
 return I

The start state is the closure of the item (S′ → .S $, ?), where the lookahead symbol ? will not matter, because the end-of-file marker will never be shifted. The reduce actions are chosen by this algorithm:

R ← {}
for each state I in T
 for each item (A → α., z) in I
 R ← R ∪{(I, z, A → α)}

The action (I, z, A → α) indicates that in state I, on lookahead symbol z, the parser will reduce by rule A → α. Grammar 3.26 is not SLR (see Exercise 3.9), but it is in the class of LR(1) grammars. Image 3.27 shows the LR(1) states for this grammar; in the figure, where there are several items with the same production but different lookahead, as at left below, we have abbreviated as at right: Java ScreenShot
Image 3.27: LR(1) states for Grammar 3.26. GRAMMAR 3.26: A grammar capturing the essence of expressions, variables, and pointer-dereference (by the *) operator in the C language.

S′ → S $
S → V = E
S → E
E → V
V → x
V → * E

The LR(1) parsing table derived from this state graph is Table 3.28a. Wherever the dot is at the end of a production (as in state 3 of Image 3.27, where it is at the end of production E → V ), then there is a reduce action for that production in the LR(1) table, in the row corresponding to the state number and the column corresponding to the lookahead of the item (in this case, the lookahead is $). Whenever the dot is to the left of a terminal symbol or nonterminal, there is a corresponding shift or goto action in the LR(1) parsing table, just as there would be in an LR(0) table.

Table 3.28: LR(1) and LALR(1) parsing tables for Grammar 3.26.

LALR(1) PARSING TABLES

LR(1) parsing tables can be very large, with many states. A smaller table can be made by merging any two states whose items are identical except for lookahead sets. The result parser is called an LALR(1) parser, for lookahead LR(1). For example, the items in states 6 and 13 of the LR(1) parser for Grammar 3.26 (Image 3.27) are identical if the lookahead sets are ignored. Also, states 7 and 12 are identical except for lookahead, as are states 8 and 11 and states 10 and 14. Merging these pairs of states gives the LALR(1) parsing table shown in Table 3.28b.

For some grammars, the LALR(1) table contains reduce-reduce conflicts where the LR(1) table has none, but in practice the difference matters little. What does matter is that the LALR(1) parsing table requires less memory to represent than the LR(1) table, since there can be many fewer states.

HIERARCHY OF GRAMMAR CLASSES

A grammar is said to be LALR(1) if its LALR(1) parsing table contains no conflicts. All SLR grammars are LALR(1), but not vice versa. Image 3.29 shows the relationship between several classes of grammars.
Image 3.29: A hierarchy of grammar classes.

Any reasonable coding language has a LALR(1) grammar, and there are many parser-generator tools available for LALR(1) grammars. For this reason, LALR(1) has become a standard for coding languages and for automatic parser generators.

LR PARSING OF AMBIGUOUS GRAMMARS

Many coding languages have grammar rules such as

S → if E then S else S
S → if E then S
S → other

which allow programs such as

if a then if b then s1 else s2

Such a program could be understood in two ways:

(1) if a then { if b then s1 else s2 }
(2) if a then { if b then s1 } else s2

In most coding languages, an else must match the most recent possible then, so interpretation (1) is correct. In the LR parsing table there will be a shift-reduce conflict: Java ScreenShot

Shifting corresponds to interpretation (1) and reducing to interpretation (2). The ambiguity can be eliminated by introducing auxiliary nonterminals M (for matched statement)and U (for unmatched statement):

S → M
S → U
M → if E then M else M
M → other
U → if E then S
U → if E then M else U

But instead of rewriting the grammar, we can leave the grammar unchanged and tolerate the shift-reduce conflict. In constructing the parsing table this conflict should be resolved by shifting, since we prefer interpretation (1).

It is often possible to use ambiguous grammars by resolving shift-reduce conflicts in favor of shifting or reducing, as appropriate. But it is best to use this technique sparingly, and only in cases (such as the dangling-else described here, and operator-precedence to be described on page 74) that are well understood. Most shift-reduce conflicts, and probably all reduce-reduce conflicts, should not be resolved by fiddling with the parsing table. They are symptoms of an ill-specified grammar, and they should be resolved by eliminating ambiguities.

Shift(n):	Advance input one token; push n on stack.
Reduce(k):	Pop stack as many times as the number of symbols on the right-hand side of rule k;
	Let X be the left-hand-side symbol of rule k; In the state now on top of stack, look up X to get "goto n"; Push n on top of stack.
Accept:	Stop parsing, report success.
Error:	Stop parsing, report failure.