CONTEXT-FREE GRAMMARS

As before, we say that a language is a set of strings; each string is a finite sequence of symbols taken from a finite alphabet. For parsing, the strings are source programs, the symbols are lexical tokens, and the alphabet is the set of token-types returned by the lexical analyzer. A context-free grammar describes a language. A grammar has a set of productions of the form

symbol → symbol symbol … symbol

where there are zero or more symbols on the right-hand side. Each symbol is either terminal, meaning that it is a token from the alphabet of strings in the language, or nonterminal, meaning that it appears on the left-hand side of some production. No token can ever appear on the left-hand side of a production. Finally, one of the nonterminals is distinguished as the start symbol of the grammar. Grammar 3.1 is an example of a grammar for straight-line programs. The start symbol is S (when the start symbol is not written explicitly it is conventional to assume that the left-hand nonterminal in the first production is the start symbol). The terminal symbols are

id print num, + ( ) := ;

GRAMMAR 3.1: A syntax for straight-line programs.

S → S; S
S → id := E
S → print (L)
E → id
E → num
E → E + E
E → (S, E)
L → E
L → L, E

and the nonterminals are S, E, and L. One sentence in the language of this grammar is

id := num; id := id + (id := num + num, id)

where the source text (before lexical analysis) might have been

a : = 7;
b : = c + (d : = 5 + 6, d)

The token-types (terminal symbols) are id, num, :=, and so on; the names (a,b,c,d) and numbers (7, 5, 6) are semantic values associated with some of the tokens.

DERIVATIONS

To show that this sentence is in the language of the grammar, we can perform a derivation: Start with the start symbol, then repeatedly replace any nonterminal by one of its right-hand sides, as shown in Derivation 3.2.DERIVATION 3.2

S
S ; S
S ; id := E
id := E; id := E
id := num ; id := E
id := num ; id := E + E
id := num ; id := E + (S, E)
id := num ; id := id + (S, E)
id := num ; id := id + (id := E, E)
id := num ; id := id + (id := E + E, E)
id := num ; id := id + (id := E + E, id )
id := num ; id := id + (id := num + E, id)
id := num ; id := id + (id := num + num, id)

There are many different derivations of the same sentence. A leftmost derivation is one in which the leftmost nonterminal symbol is always the one expanded; in a rightmost derivation, the rightmost nonterminal is always the next to be expanded. Derivation 3.2 is neither leftmost nor rightmost; a leftmost derivation for this sentence would begin,

S
S ; S
id := E ; S
id := num ; S
id := num ; id := E
id := num ; id := E + E
⋮

PARSE TREES

A parse tree is made by connecting each symbol in a derivation to the one from which it was derived, as shown in Image 3.3. Two different derivations can have the same parse tree.
Image 3.3: Parse tree.

AMBIGUOUS GRAMMARS

A grammar is ambiguous if it can derive a sentence with two different parse trees. Grammar 3.1 is ambiguous, since the sentence id := id+id+id has two parse trees (Image 3.4).
Image 3.4: Two parse trees for the same sentence using Grammar 3.1.

Grammar 3.5 is also ambiguous; Image 3.6 shows two parse trees for the sentence 1-2-3, and Image 3.7 shows two trees for 1+2*3. Clearly, if we use parse trees to interpret the meaning of the expressions, the two parse trees for 1-2-3 mean different things: (1 − 2) − 3 D−4 versus 1 − (2 − 3) D 2. Similarly, (1 + 2) × 3 is not the same as 1 + (2 × 3). And indeed, compilers do use parse trees to derive meaning.
Image 3.6: Two parse trees for the sentence 1-2-3 in Grammar 3.5.
Image 3.7: Two parse trees for the sentence 1+2*3 in Grammar 3.5. GRAMMAR 3.5

E → id
E → num
E → E * E
E → E / E
E → E + E
E → E − E
E → (E)

GRAMMAR 3.8

E → E + T
E → E − T
E → T
T → T * F
T → T / F
T → F
F → id
F → num
F → (E)

Therefore, ambiguous grammars are problematic for compiling: In general, we would prefer to have unambiguous grammars. Fortunately, we can often transform ambiguous grammars to unambiguous grammars. Let us find an unambiguous grammar that accepts the same language as Grammar 3.5. First, we would like to say that * binds tighter than +, or has higher precedence. Second, we want to say that each operator associates to the left, so that we get (1 − 2) − 3 instead of 1 − (2 − 3). We do this by introducing new nonterminal symbols to get Grammar 3.8. The symbols E, T, and F stand for expression, term, and factor; conventionally, factors are things you multiply and terms are things you add. This grammar accepts the same set of sentences as the ambiguous grammar, but now each sentence has exactly one parse tree. Grammar 3.8 can never produce parse trees of the form shown in Image 3.9 (see Exercise 3.17).
Image 3.9: Parse trees that Grammar 3.8 will never produce.

Had we wanted to make * associate to the right, we could have written its production as T → F * T.

We can usually eliminate ambiguity by transforming the grammar. Though there are some languages (sets of strings) that have ambiguous grammars but no unambiguous grammar, such languages may be problematic as programming languages because the syntactic ambiguity may lead to problems in writing and understanding programs.

END-OF-FILE MARKER

Parsers must read not only terminal symbols such as +, −, num, and so on, but also the end-of-file marker. We will use $ to represent end of file. Suppose S is the start symbol of a grammar. To indicate that $ must come after a complete S-phrase, we augment the grammar with a new start symbol S′ and a new production S′ → S$. In Grammar 3.8, E is the start symbol, so an augmented grammar is Grammar 3.10.GRAMMAR 3.10

S → E $
E → E + T
E → E − T
E → T
T → T * F
T → T / F
T → F
F → id
F → num
F → (E)