ABSTRACT PARSE TREES

It is possible to write an entire compiler that fits within the semantic action phrases of a JavaCC or SableCC parser. However, such a compiler is difficult to read and maintain, and this approach constrains the compiler to analyze the program in exactly the order it is parsed. To improve modularity, it is better to separate issues of syntax (parsing) from issues of semantics (type-checking and translation to machine code). One way to do this is for the parser to produce a parse tree - a data structure that later phases of the compiler can traverse. Technically, a parse tree has exactly one leaf for each token of the input and one internal node for each grammar rule reduced during the parse. Such a parse tree, which we will call a concrete parse tree, representing the concrete syntax of the source language, may be inconvenient to use directly. Many of the punctuation tokens are redundant and convey no information - they are useful in the input string, but once the parse tree is built, the structure of the tree conveys the structuring information more conveniently. Furthermore, the structure of the parse tree may depend too much on the grammar! The grammar transformations shown in - factoring, elimination of left recursion, elimination of ambiguity - involve the introduction of extra nonterminal symbols and extra grammar productions for technical purposes. These details should be confined to the parsing phase and should not clutter the semantic analysis. An abstract syntax makes a clean interface between the parser and the later phases of a compiler (or, in fact, for the later phases of other kinds of program-analysis tools such as dependency analyzers). The abstract syntax tree conveys the phrase structure of the source program, with all parsing issues resolved but without any semantic interpretation. Many early compilers did not use an abstract syntax data structure because early computers did not have enough memory to represent an entire compilation unit's syntax tree. Modern computers rarely have this problem. And many modern coding languages (ML, Modula-3, Java) allow forward reference to identifiers defined later in the same module; using an abstract syntax tree makes compilation easier for these languages. It may be that Pascal and C require clumsy forward declarations because their designers wanted to avoid an extra compiler pass on the machines of the 1970s. Grammar 4.3 shows an abstract syntax of the expression language is Grammar 3.15. This grammar is completely impractical for parsing: The grammar is quite ambiguous, since precedence of the operators is not specified.GRAMMAR 4.3: Abstract syntax of expressions.

E → E + E
E → E − E
E → E * E
E → E / E
E → id
E → num

However, Grammar 4.3 is not meant for parsing. The parser uses the concrete syntax to build a parse tree for the abstract syntax. The semantic analysis phase takes this abstract syntax tree; it is not bothered by the ambiguity of the grammar, since it already has the parse tree! The compiler will need to represent and manipulate abstract syntax trees as data structures. In Java, these data structures are organized according to the principles outlined in : an abstract class for each nonterminal, a subclass for each production, and so on. In fact, the classes of Program 4.5 are abstract syntax classes for Grammar 4.3. An alternate arrangement, with all the different binary operators grouped into an OpExp class, is also possible. Let us write an interpreter for the expression language in Grammar 3.15 by first building syntax trees and then interpreting those trees. Program 4.4 is a JavaCC grammar with semantic actions that produce syntax trees. Each class of syntax-tree nodes contains an eval function; when called, such a function will return the value of the represented expression.Building syntax trees for expressions.

Exp Start() :
 { Exp e; }
 { e=Exp() { return e; }
 }
Exp Exp() :
 { Exp e1,e2; }
 { e1=Term()
 ( "+" e2=Term() { e1=new PlusExp(e1,e2); }
 | "-" e2=Term() { e1=new MinusExp(e1,e2); }
 )*
 { return e1; }
 }
Exp Term() :
 { Exp e1,e2; }
 { e1=Factor()
 ( "*" e2=Factor() { e1=new TimesExp(e1,e2); }
 | "/" e2=Factor() { e1=new DivideExp(e1,e2); }
 )*
 { return e1; }
 }
Exp Factor() :
 { Token t; Exp e; }
 { ( t=<IDENTIFIER> { return new Identifier(t.image); } |
 t=<INTEGER_LITERAL> { return new IntegerLiteral(t.image); } |
 "(" e=Exp() ")" { return e; } )
 }

POSITIONS

In a one-pass compiler, lexical analysis, parsing, and semantic analysis (typechecking) are all done simultaneously. If there is a type error that must be reported to the user, the current position of the lexical analyzer is a reasonable approximation of the source position of the error. In such a compiler, the lexical analyzer keeps a "current position" global variable, and the errormessage routine just prints the value of that variable with each message. A compiler that uses abstract-syntax-tree data structures need not do all the parsing and semantic analysis in one pass. This makes life easier in many ways, but slightly complicates the production of semantic error messages. The lexer reaches the end of file before semantic analysis even begins; so if a semantic error is detected in traversing the abstract syntax tree, the current position of the lexer (at end of file) will not be useful in generating a line number for the error message. Thus, the source-file position of each node of the abstract syntax tree must be remembered, in case that node turns out to contain a semantic error. To remember positions accurately, the abstract-syntax data structures must be sprinkled with pos fields. These indicate the position, within the original source file, of the characters from which these abstract-syntax structures were derived. Then the type-checker can produce useful error messages. (The syntax constructors we will show in Image 4.9 do not have pos fields; any compiler that uses these exactly as given will have a hard time producing accurately located error messages.)

package syntaxtree;
Program(MainClass m, ClassDeclList cl)
MainClass(Identifier i1, Identifier i2, Statement s)
abstract class ClassDecl
ClassDeclSimple(Identifier i, VarDeclList vl, MethodDeclList ml)
ClassDeclExtends(Identifier i, Identifier j,
 VarDeclList vl, MethodDeclList ml) see Ch.14
VarDecl(Type t, Identifier i)
MethodDecl(Type t, Identifier i, FormalList fl, VarDeclList vl,
 StatementList sl, Exp e)
Formal(Type t, Identifier i)
abstract class Type
IntArrayType() BooleanType() IntegerType() IdentifierType(String s)
abstract class Statement
Block(StatementList sl)
If(Exp e, Statement s1, Statement s2)
While(Exp e, Statement s)
Print(Exp e)
Assign(Identifier i, Exp e)
ArrayAssign(Identifier i, Exp e1, Exp e2)
abstract class Exp
And(Exp e1, Exp e2)
LessThan(Exp e1, Exp e2)
Plus(Exp e1, Exp e2) Minus(Exp e1, Exp e2) Times(Exp e1, Exp e2)
ArrayLookup(Exp e1, Exp e2)
ArrayLength(Exp e)
Call(Exp e, Identifier i, ExpList el)
IntegerLiteral(int i)
True()
False()
IdentifierExp(String s)
This()
NewArray(Exp e)
NewObject(Identifier i)
Not(Exp e)
Identifier(String s)
list classes ClassDeclList() ExpList() FormalList() MethodDeclList() StatementList() VarDeclList()

Image 4.9: Abstract syntax for the MiniJava language.

The lexer must pass the source-file positions of the beginning and end of each token to the parser. We can augment the types Exp, etc. with a position field; then each constructor must take a pos argument to initialize this field. The positions of leaf nodes of the syntax tree can be obtained from the tokens returned by the lexical analyzer; internal-node positions can be derived from the positions of their subtrees. This is tedious but straightforward.