DATA STRUCTURES FOR TREE LANGUAGES
Many of the important data structures used in a compiler are intermediate representations of the program being compiled. Often these representations take the form of trees, with several node types, each of which has different attributes. Such trees can occur at many of the phase-interfaces shown in Image 1.1. Tree representations can be described with grammars, just like coding languages. To introduce the concepts, we will show a simple coding language with statements and expressions, but no loops or if-statements (this is called a language of straight-line programs). The syntax for this language is given in Grammar 1.3.GRAMMAR 1.3: A straight-line coding language.
Stm → Stm; Stm |
(CompoundStm) |
Stm → id := Exp |
(AssignStm) |
Stm → print (ExpList) |
(PrintStm) |
Exp → id |
(IdExp) |
Exp → num |
(NumExp) |
Exp → Exp Binop Exp |
(OpExp) |
Exp → (Stm, Exp) |
(EseqExp) |
ExpList → Exp, ExpList |
(PairExpList) |
ExpList → Exp |
(LastExpList) |
Binop →+ |
(Plus) |
Binop →− |
(Minus) |
Binop →× |
(Times) |
Binop → / |
(Div) |
The informal semantics of the language is as follows. Each Stm is a statement, each Exp is an expression. s1; s2 executes statement s1, then statement s2. i :=e evaluates the expression e, then "stores" the result in variable i. print(e1, e2,…, en) displays the values of all the expressions, evaluated left to right, separated by spaces, terminated by a newline. An identifier expression, suchas i, yields the current contents of the variable i. A number evaluates to the named integer. An operator expression e1 op e2 evaluates e1, then e2, then applies the given binary operator. And an expression sequence (s, e) behaves like the C-language "comma" operator, evaluating the statement s for side effects before evaluating (and returning the result of) the expression e. For example, executing this program
a := 5+3; b := (print(a, a-1), 10*a); print(b)
prints
8 7 80
How should this program be represented inside a compiler? One representation is source code, the characters that the programmer writes. But that is not so easy to manipulate. More convenient is a tree data structure, with one node for each statement (Stm) and expression (Exp). Image 1.4 shows a tree representation of the program; the nodes are labeled by the production labels of Grammar 1.3, and each node has as many children as the corresponding grammar production has right-hand-side symbols.
Image 1.4: Tree representation of a straight-line program.
We can translate the grammar directly into data structure definitions, as shown in Program 1.5. Each grammar symbol corresponds to an abstract class in the data structures:
Grammar |
class |
---|---|
| |
Stm |
Stm |
Exp |
Exp |
ExpList |
ExpList |
id |
String |
num |
int |
public abstract class Stm {} public class CompoundStm extends Stm { public Stm stm1, stm2; public CompoundStm(Stm s1, Stm s2) {stm1=s1; stm2=s2;}} public class AssignStm extends Stm { public String id; public Exp exp; public AssignStm(String i, Exp e) {id=i; exp=e;}} public class PrintStm extends Stm { public ExpList exps; public PrintStm(ExpList e) {exps=e;}} public abstract class Exp {} public class IdExp extends Exp { public String id; public IdExp(String i) {id=i;}} public class NumExp extends Exp { public int num; public NumExp(int n) {num=n;}} public class OpExp extends Exp { public Exp left, right; public int oper; final public static int Plus=1, Minus=2, Times=3, Div=4; public OpExp(Exp l, int o, Exp r) {left=l; oper=o; right=r;}} public class EseqExp extends Exp { public Stm stm; public Exp exp; public EseqExp(Stm s, Exp e) {stm=s; exp=e;}} public abstract class ExpList {} public class PairExpList extends ExpList { public Exp head; public ExpList tail; public PairExpList(Exp h, ExpList t) {head=h; tail=t;}} public class LastExpList extends ExpList { public Exp head; public LastExpList(Exp h) {head=h;}}
For each grammar rule, there is one constructor that belongs to the class for its left-hand-side symbol. We simply extend the abstract class with a "concrete" class for each grammar rule. The constructor (class) names are indicated on the right-hand side of Grammar 1.3. Each grammar rule has right-hand-side components that must be represented in the data structures. The CompoundStm has two Stm's on the right-hand side; the AssignStm has an identifier and an expression; and so on. These become fields of the subclasses in the Java data structure. Thus, CompoundStm has two fields (also called instance variables) called stm1 and stm2; AssignStm has fields id and exp. For Binop we do something simpler. Although we could make a Binop class - with subclasses for Plus, Minus, Times, Div - this is overkill because none of the subclasses would need any fields. Instead we make an "enumeration" type (in Java, actually an integer) of constants (final int variables) local to the OpExp class. Programming style We will follow several conventions for representing tree data structures in Java:
- Trees are described by a grammar.
- A tree is described by one or more abstract classes, each corresponding to a symbol in the grammar.
- Each abstract class is extended by one or more subclasses, one for each grammar rule.
- For each nontrivial symbol in the right-hand side of a rule, there will be one field in the corresponding class. (A trivial symbol is a punctuation symbol such as the semicolon in CompoundStm.)
- Every class will have a constructor function that initializes all the fields.
- Data structures are initialized when they are created (by the constructor functions), and are never modified after that (until they are eventually discarded).
Modularity principles for Java programs A compiler can be a big program; careful attention to modules and interfaces prevents chaos. We will use these principles in writing a compiler in Java:
- Each phase or module of the compiler belongs in its own package.
- "Import on demand" declarations will not be used. If a Java file begins with
import A.F.*; import A.G.*; import B.*; import C.*;
then the human reader will have to look outside this file to tell which package defines the X that is used in the expression X.put().
- "Single-type import" declarations are a better solution. If the module begins
import A.F.W; import A.G.X; import B.Y; import C.Z;
then you can tell without looking outside this file that X comes from A.G.
- Java is naturally a multithreaded system. We would like to support multiple simultaneous compiler threads and compile two different programs simultaneously, one in each compiler thread. Therefore, static variables must be avoided unless they are final (constant). We never want two compiler threads to be updating the same (static) instance of a variable.