Previous Next 
The weakness of LL(k) parsing techniques is that they must predict which production to use, having seen only the first k tokens of the righthand side. A more powerful technique, LR(k) parsing, is able to postpone the decision until it has seen input tokens corresponding to the entire righthand side of the production in question (and k more input tokens beyond). LR(k) stands for lefttoright parse, rightmostderivation, ktoken lookahead. The use of a rightmost derivation seems odd; how is that compatible with a lefttoright parse? Image 3.18 illustrates an LR parse of the program
a:=7; b:=c+(d:=5+6,d)
using Grammar 3.1, augmented with a new start production S′ → S$. The parser has a stack and an input. The first k tokens of the input are the lookahead. Based on the contents of the stack and the lookahead, the parser performs two kinds of actions:
Shift: Move the first input token to the top of the stack.
Reduce: Choose a grammar rule X → A B C; pop C, B, A from the top of the stack; push X onto the stack.
Initially, the stack is empty and the parser is at the beginning of the input. The action of shifting the endoffile marker $ is called accepting and causes the parser to stop successfully. In Image 3.18, the stack and input are shown after every step, along with an indication of which action has just been performed. The concatenation of stack and input is always one line of a rightmost derivation; in fact, Image 3.18 shows the rightmost derivation of the input string, upsidedown.
How does the LR parser know when to shift and when to reduce? By using a deterministic finite automaton! The DFA is not applied to the input  finite automata are too weak to parse contextfree grammars  but to the stack. The edges of the DFA are labeled by the symbols (terminals and nonterminals) that can appear on the stack. Table 3.19 is the transition table for Grammar 3.1.
The elements in the transition table are labeled with four kinds of actions:
sn 
Shift into state n; 
gn 
Goto state n; 
rk 
Reduce by rule k; 
a 
Accept;

To use this table in parsing, treat the shift and goto actions as edges of a DFA, and scan the stack. For example, if the stack is id := E, then the DFA goes from state 1 to 4 to 6 to 11. If the next input token is a semicolon, then the ";" column in state 11 says to reduce by rule 2. The second rule of the grammar is S → id:=E, so the top three tokens are popped from the stack and S is pushed. The action for "+" in state 11 is to shift; so if the next token had been + instead, it would have been eaten from the input and pushed on the stack. Rather than rescan the stack for each token, the parser can remember instead the state reached for each stack element. Then the parsing algorithm is
Look up top stack state, and input symbol, to get action; If action is
Shift(n):
Advance input one token; push n on stack.
Reduce(k):
Pop stack as many times as the number of symbols on the righthand side of rule k;
Let X be the lefthandside symbol of rule k;
In the state now on top of stack, look up X to get "goto n";
Push n on top of stack.Accept:
Stop parsing, report success.
Error:
Stop parsing, report failure.
An LR(k) parser uses the contents of its stack and the next k tokens of the input to decide which action to take. Table 3.19 shows the use of one symbol of lookahead. For k = 2, the table has columns for every twotoken sequence and so on; in practice, k > 1 is not used for compilation. This is partly because the tables would be huge, but more because most reasonable coding languages can be described by LR(1) grammars. LR(0) grammars are those that can be parsed looking only at the stack, making shift/reduce decisions without any lookahead. Though this class of grammars is too weak to be very useful, the algorithm for constructing LR(0) parsing tables is a good introduction to the LR(1) parser construction algorithm. We will use Grammar 3.20 to illustrate LR(0) parser generation. Consider what the parser for this grammar will be doing. Initially, it will have an empty stack, and the input will be a complete Ssentence followed by $; that is, the righthand side of the S′ rule will be on the input. We indicate this as S′ → .S$ where the dot indicates the current position of the parser.
GRAMMAR 3.20S′ → S$
In this state, where the input begins with S, that means that it begins with any possible righthand side of an Sproduction; we indicate that by
Call this state 1. A grammar rule, combined with the dot that indicates a position in its righthand side, is called an item (specifically, an LR(0) item). A state is just a set of items. Shift actions In state 1, consider what happens if we shift an x. Wethen know that the end of the stack has an x; we indicate that by shifting the dot past the x in the S → x production. The rules S′ → .S$ and S → .(L) are irrelevant to this action, so we ignore them; we end up in state 2:
Or in state 1 consider shifting a left parenthesis. Moving the dot past the parenthesis in the third item yields S → (.L), where we know that there must be a left parenthesis on top of the stack, and the input begins with some string derived by L, followed by a right parenthesis. What tokens can begin the input now? We find out by including all Lproductions in the set of items. But now, in one of those Litems, the dot is just before an S, so we need to include all the Sproductions:
Goto actions In state 1, consider the effect of parsing past some string of tokens derived by the S nonterminal. This will happen when an x or left parenthesis is shifted, followed (eventually) by a reduction of an Sproduction. All the righthandside symbols of that production will be popped, and the parser will execute the goto action for S in state 1. The effect of this can be simulated by moving the dot past the S in the first item of state 1, yielding state 4:
Reduce actions In state 2 we find the dot at the end of an item. This means that on top of the stack there must be a complete righthand side of the corresponding production (S → x), ready to reduce. In such a state the parser could perform a reduce action. The basic operations we have been performing on states are closure(I) and goto(I, X), where I is a set of items and X is a grammar symbol (terminal or nonterminal). Closure adds more items to a set of items when there is a dot to the left of a nonterminal; goto moves the dot past the symbol X in all items.
Closure(I) = Goto(I, X) = repeat set J to the empty set for any item A → α.Xβ in I for any item A → α:Xβ in I for any production X → γ add A → αX.β to J I ← I ∩ {X → .γ} return Closure(J) until I does not change. return I
Now here is the algorithm for LR(0) parser construction. First, augment the grammar with an auxiliary start production S′ → S$. Let T be the set of states seen so far, and E the set of (shift or goto) edges found so far.
Initialize T to {Closure({S′ → :S$})} Initialize E to empty. repeat for each state I in T for each item A → α.Xβ in I let J be Goto(I, X) T ← T ∪ {J} E ← E ∪ {I *** J} until E and T did not change in this iteration
However, for the symbol $ we do not compute Goto(I; $); instead we will make an accept action. For Grammar 3.20 this is illustrated in Image 3.21.
Now we can compute set R of LR(0) reduce actions:
R ← {} for each state I in T for each item A → α. in I R ← R ∪ {(I, A → α)}
We can now construct a parsing table for this grammar (Table 3.22). For each edge where X is a terminal, we put the action shift J at position (I, X) of the table; if X is a nonterminal, we put goto J at position (I, X). For each state I containing an item S′ → S.$ we put an accept action at (I, $). Finally, for a state containing an item A → γ. (production n with the dot at the end), we put a reduce n action at (I, Y) for every token Y.
In principle, since LR(0) needs no lookahead, we just need a single action for each state: A state will shift or reduce, but not both. In practice, since we need to know what state to shift into, we have rows headed by state numbers and columns headed by grammar symbols.
Let us attempt to build an LR(0) parsing table for Grammar 3.23. The LR(0) states and parsing table are shown in Image 3.24.
GRAMMAR 3.23S → E $
In state 3, on symbol +, there is a duplicate entry: The parser must shift into state 4 and also reduce by production 2. This is a conflict and indicates that the grammar is not LR(0)  it cannot be parsed by an LR(0) parser. We will need a more powerful parsing algorithm. A simple way of constructing betterthanLR(0) parsers is called SLR, which stands for simple LR. Parser construction for SLR is almost identical to that for LR(0), except that we put reduce actions into the table only where indicated by the FOLLOW set. Here is the algorithm for putting reduce actions into an SLR table:
R ← {} for each state I in T for each item A → α. in I for each token X in FOLLOW(A) R ← R ∪ {(I, X, A → α)}
The action (I, X, A → α) indicates that in state I, on lookahead symbol X, the parser will reduce by rule A → α. Thus, for Grammar 3.23 we use the same LR(0) state diagram (Image 3.24), but we put fewer reduce actions into the SLR table, as shown in Image 3.25.
The SLR class of grammars is precisely those grammars whose SLR parsing table contains no conflicts (duplicate entries). Grammar 3.23 belongs to this class, as do many useful programminglanguage grammars.
Even more powerful than SLR is the LR(1) parsing algorithm. Most coding languages whose syntax is describable by a contextfree grammar have an LR(1) grammar. The algorithm for constructing an LR(1) parsing table is similar to that for LR(0), but the notion of an item is more sophisticated. An LR(1) item consists of a grammar production, a righthandside position (represented by the dot), and a lookahead symbol. The idea is that an item (A → α.β, x) indicates that the sequence α is on top of the stack, and at the head of the input is a string derivable from βx. An LR(1) state is a set of LR(1) items, and there are Closure and Goto operations for LR(1) that incorporate the lookahead:
Closure(I) = Goto(I, X) = repeat J ← {} for any item (A → α.Xβ, z) in I for any item (A → α.Xβ, z) in I for any production X → γ add (A → αX.β, z) to J for any w ∈ FIRST(βz) return Closure(J). I ← I ∪ {(X → .γ, w)} until I does not change return I
The start state is the closure of the item (S′ → .S $, ?), where the lookahead symbol ? will not matter, because the endoffile marker will never be shifted. The reduce actions are chosen by this algorithm:
R ← {} for each state I in T for each item (A → α., z) in I R ← R ∪{(I, z, A → α)}
The action (I, z, A → α) indicates that in state I, on lookahead symbol z, the parser will reduce by rule A → α. Grammar 3.26 is not SLR (see Exercise 3.9), but it is in the class of LR(1) grammars. Image 3.27 shows the LR(1) states for this grammar; in the figure, where there are several items with the same production but different lookahead, as at left below, we have abbreviated as at right:
S′ → S $
The LR(1) parsing table derived from this state graph is Table 3.28a. Wherever the dot is at the end of a production (as in state 3 of Image 3.27, where it is at the end of production E → V ), then there is a reduce action for that production in the LR(1) table, in the row corresponding to the state number and the column corresponding to the lookahead of the item (in this case, the lookahead is $). Whenever the dot is to the left of a terminal symbol or nonterminal, there is a corresponding shift or goto action in the LR(1) parsing table, just as there would be in an LR(0) table.
LR(1) parsing tables can be very large, with many states. A smaller table can be made by merging any two states whose items are identical except for lookahead sets. The result parser is called an LALR(1) parser, for lookahead LR(1). For example, the items in states 6 and 13 of the LR(1) parser for Grammar 3.26 (Image 3.27) are identical if the lookahead sets are ignored. Also, states 7 and 12 are identical except for lookahead, as are states 8 and 11 and states 10 and 14. Merging these pairs of states gives the LALR(1) parsing table shown in Table 3.28b.
For some grammars, the LALR(1) table contains reducereduce conflicts where the LR(1) table has none, but in practice the difference matters little. What does matter is that the LALR(1) parsing table requires less memory to represent than the LR(1) table, since there can be many fewer states.
A grammar is said to be LALR(1) if its LALR(1) parsing table contains no conflicts. All SLR grammars are LALR(1), but not vice versa. Image 3.29 shows the relationship between several classes of grammars.
Any reasonable coding language has a LALR(1) grammar, and there are many parsergenerator tools available for LALR(1) grammars. For this reason, LALR(1) has become a standard for coding languages and for automatic parser generators.
Many coding languages have grammar rules such as
S → if E then S else S
which allow programs such as
if a then if b then s1 else s2
Such a program could be understood in two ways:
(1) if a then { if b then s1 else s2 } (2) if a then { if b then s1 } else s2
In most coding languages, an else must match the most recent possible then, so interpretation (1) is correct. In the LR parsing table there will be a shiftreduce conflict:
Shifting corresponds to interpretation (1) and reducing to interpretation (2). The ambiguity can be eliminated by introducing auxiliary nonterminals M (for matched statement)and U (for unmatched statement):
S → M
But instead of rewriting the grammar, we can leave the grammar unchanged and tolerate the shiftreduce conflict. In constructing the parsing table this conflict should be resolved by shifting, since we prefer interpretation (1).
It is often possible to use ambiguous grammars by resolving shiftreduce conflicts in favor of shifting or reducing, as appropriate. But it is best to use this technique sparingly, and only in cases (such as the danglingelse described here, and operatorprecedence to be described on page 74) that are well understood. Most shiftreduce conflicts, and probably all reducereduce conflicts, should not be resolved by fiddling with the parsing table. They are symptoms of an illspecified grammar, and they should be resolved by eliminating ambiguities.
JaVa  Previous Next 