SYMBOL TABLES

This phase is characterized by the maintenance of symbol tables (also called environments) mapping identifiers to their types and locations. As the declarations of types, variables, and functions are processed, these identifiers are bound to "meanings" in the symbol tables. When uses (nondefining occurrences) of identifiers are found, they are looked up in the symbol tables. Each local variable in a program has a scope in which it is visible. For example, in a MiniJava method m, all formal parameters and local variables declared in m are visible only until the end of m. As the semantic analysis reaches the end of each scope, the identifier bindings local to that scope are discarded. An environment is a set of bindings denoted by the ↦ arrow. For example, we could say that the environment σ₀ contains the bindings {g ↦ string, a ↦ int}, meaning that the identifier a is an integer variable and g is a string variable. Consider a simple example in the Java language:

1 class C {
2 int a; int b; int c;
3 public void m(){
4 System.out.println(a+c);
5 int j = a+b;
6 String a = "hello";
7 System.out.println(a);
8 System.out.println(j);
9 System.out.println(b);
10 }
11 }

Suppose we compile this class in the environment σ₀. The field declarations on line 2 give us the table σ₁ equal to σ₀ + {a ↦ int, b ↦ int, c ↦ int}, that is, σ₀ extended with new bindings for a, b, and c. The identifiers in line 4 can be looked up in σ₁. At line 5, the table σ₂ = σ₁ + {j ↦ int} is created; and at line 6, σ₃ = σ₂ + {a ↦ String} is created. How does the + operator for tables work when the two environments being "added" contain different bindings for the same symbol? When σ₂ and {a ↦ String} map a to int and String, respectively? To make the scoping rules work the way we expect them to in real coding languages, we want {a ↦ String} to take precedence. So we say that X + Y for tables is not the same as Y + X; bindings in the right-hand table override those in the left. The identifiers in lines 7, 8, and 9 can be looked up in σ₃. Finally, at line 10, we discard σ₃ and go back to σ₁. And at line 11 we discard σ₁ and go back to σ₀. How should this be implemented? There are really two choices. In a functional style, we make sure to keep σ₁ in pristine condition while we create σ₂ and σ₃. Then when we need σ₁ again, it's rested and ready. In an imperative style, we modify σ₁ until it becomes σ₂. This destructive update "destroys" σ₁; while σ₂ exists, we cannot look things up in σ₁. But when we are done with σ₂, we can undo the modification to get σ₁ back again. Thus, there is a single global environment σ which becomes σ₀, σ₁, σ₂, σ₃, σ₁, σ₀ at different times and an "undo stack" with enough information to remove the destructive updates. When a symbol is added to the environment, it is also added to the undo stack; at the end of scope (e.g., at line 10), symbols popped from the undo stack have their latest binding removed from σ (and their previous binding restored). Either the functional or imperative style of environment management can be used regardless of whether the language being compiled or the implementation language of the compiler is a "functional" or "imperative" or "objectoriented" language.

MULTIPLE SYMBOL TABLES

In some languages there can be several active environments at once: Each module, or class, or record in the program has a symbol table σ of its own. In analyzing Image 5.1, let σ₀ be the base environment containing predefined functions, and let Java ScreenShot

structure M = struct
 structure E = struct
 val a = 5;
 end
 structure N = struct
 val b = 10
 val a = E.a + b
 end
 structure D = struct
 val d = E.a + N.a
 end end

 package M;
 class E {
  static int a = 5;
 }
 class N {
  static int b = 10;
  static int a = E.a + b;
 }
 class D {
  static int d = E.a + N.a;
 }

(a) An example in ML

(b) An example in Java

Image 5.1: Several active environments at once.

In ML, the N is compiled using environment σ₀ + σ₂ to look up identifiers; D is compiled using σ₀ + σ₂ + σ₄, and the result of the analysis is {M ↦ σ₇}.

In Java, forward reference is allowed (so inside N the expression D.d would be legal), so E, N, and D are all compiled in the environment σ₇; for this program the result is still {M ↦ σ₇}.

EFFICIENT IMPERATIVE SYMBOL TABLES

Because a large program may contain thousands of distinct identifiers, symbol tables must permit efficient lookup. Imperative-style environments are usually implemented using hash tables, which are very efficient. The operation σ′ = σ + {a ↦ τ} is implemented by inserting τ in the hash table with key a. A simple hash table with external chaining works well and supports deletion easily (we will need to delete {a ↦ τ} to recover σ at the end of the scope of a). Program 5.2 implements a simple hash table. The ith bucket is a linked list of all the elements whose keys hash to i mod SIZE.Hash table with external chaining.

class Bucket {String key; Object binding; Bucket next;
 Bucket(String k, Object b, Bucket n) {key=k; binding=b; next=n;}
}
class HashT {
 final int SIZE = 256;
 Bucket table[] = new Bucket[SIZE];
 private int hash(String s) {
 int h=0;
 for(int i=0; i<s.length(); i++)
 h=h*65599+s.charAt(i);
 return h;
 }
 void insert(String s, Binding b) {
 int index=hash(s)%SIZE
 table[index]=new Bucket(s,b,table[index]);
 }
 Object lookup(String s) {
 int index=hash(s)%SIZE
 for (Binding b = table[index]; b!=null; b=b.next)
 if (s.equals(b.key)) return b.binding;
 return null;
 }
 void pop(String s) {
 int index=hash(s)%SIZE
 table[index]=table[index].next;
 }
}

Consider σ + {a ↦ τ₂} when σ contains a ↦ τ₁ already. The insert function leaves a ↦ τ₁ in the bucket and puts a ↦ τ₂ earlier in the list. Then, when pop(a) is done at the end of a's scope, σ is restored. Of course, pop works only if bindings are inserted and popped in a stacklike fashion.

An industrial-strength implementation would improve on this in several ways; see Exercise 5.1.

EFFICIENT FUNCTIONAL SYMBOL TABLES

In the functional style, we wish to compute σ′ = σ + {a ↦ τ} in such a way that we still have σ available to look up identifiers. Thus, instead of "altering" a table by adding a binding to it we create a new table by computing the "sum" of an existing table and a new binding. Similarly, when we add 7 + 8 we don't alter the 7 by adding 8 to it; we create a new value, 15 − and the 7 is still available for other computations. However, nondestructive update is not efficient for hash tables. Image 5.3a shows a hash table implementing mapping m₁. It is fast and efficient to add mouse to the fifth slot; just make the mouse record point at the (old) head of the fifth linked list, and make the fifth slot point to the mouse record. But then we no longer have the mapping m₁:Wehavedestroyedittomake m₂. The other alternative is to copy the array, but still share all the old buckets, as shown in Image 5.3b. But this is not efficient: The array in a hash table should be quite large, proportional in size to the number of elements, and we cannot afford to copy it for each new entry in the table.
Image 5.3: Hash tables.

By using binary search trees we can perform such "functional" additions to search trees efficiently. Consider, for example, the search tree in Image 5.4, which represents the mapping Java ScreenShot
Image 5.4: Binary search trees.

We can add the binding mouse ↦ 4, creating the mapping m₂ without destroying the mapping m₁, as shown in Image 5.4b. If we add a new node at depth d of the tree, we must create d new nodes - but we don't need to copy the whole tree. So creating a new tree (that shares some structure with the old one) can be done as efficiently as looking up an element: in log(n) time for a balanced tree of n nodes. This is an example of a persistent data structure; a persistent red-black tree can be kept balanced to guarantee log(n) access time (see Exercise 1.1c, and also page 276).

SYMBOLS

The hash table of Program 5.2 must examine every character of the string s for the hash operation, and then again each time it compares s against a string in the ith bucket. To avoid unnecessary string comparisons, we can convert each string to a symbol, so that all the different occurrences of any given string convert to the same symbol object. The Symbol module implements symbols and has these important properties:

Comparing symbols for equality is fast (just pointer or integer comparison).
Extracting an integer hash key is fast (in case we want to make a hash table mapping symbols to something else).
Comparing two symbols for "greater-than" (in some arbitrary ordering) is fast (in case we want to make binary search trees).

Even if we intend to make functional-style environments mapping symbols to bindings, we can use a destructive-update hash table to map strings to symbols: We need this to make sure the second occurrence of "abc" maps to the same symbol as the first occurrence. Program 5.5 shows the interface of the Symbol module.The interface of package Symbol.

package Symbol;
public class Symbol {
 public String toString();
 public static Symbol symbol(String s);
}
public class Table {
 public Table();
 public void put(Symbol key, Object value);
 public Object get(Symbol key);
 public void beginScope();
 public void endScope();
 public java.util.Enumeration keys();
}

Environments are implemented in the Symbol.Table class as Tables mapping Symbols to bindings. We want different notions of binding for different purposes in the compiler - type bindings for types, value bindings for variables and functions - so we let the bindings be Object, though in any given table every binding should be a type binding, or every binding should be a value binding, and so on. To implement the Symbol class (Program 5.6), we rely on the intern() method of the java.lang.String class to give us a unique object for any given character sequence; we can map from Symbol to String by having each symbol contain a string variable, but the reverse mapping must be done using a hash table (we use java.util.Hashtable).Symbol table implementation.

package Symbol;
public class Symbol {
 private String name;
 private Symbol(String n) {name=n; }
 private static java.util.Dictionary dict = new java.util.Hashtable();
 public String toString() {return name;}
 public static Symbol symbol(String n) {
 String u = n.intern();
 Symbol s = (Symbol)dict.get(u);
 if (s==null) {s = new Symbol(u); dict.put(u,s); }
 return s;
 }
}

To handle the "undo" requirements of destructive update, the interface function beginScope remembers the current state of the table, and endScope restores the table to where it was at the most recent beginScope that has not already been ended. An imperative table is implemented using a hash table. When the binding x ↦ b is entered (table.put(x,b)), x is hashed into an index i, and a Binder object x ↦ b is placed at the head of the linked list for the ith bucket. If the table had already contained a binding x ↦ b′, that would still be in the bucket, hidden by x ↦ b. This is important because it will support the implementation of undo (beginScope and endScope). The key x is not a character string, but is the Symbol object itself. There must also be an auxiliary stack, showing in what order the symbols were "pushed" into the symbol table. When x ↦ b is entered, then x is pushed onto this stack. A beginScope operation pushes a special marker onto the stack. Then, to implement endScope, symbols are popped off the stack down to and including the topmost marker. As each symbol is popped, the head binding in its bucket is removed. The auxiliary stack can be integrated into the Binder by having a global variable top showing the most recent Symbol bound in the table. Then "pushing" is accomplished by copying top into the prevtop field of the Binder. Thus, the "stack" is threaded through the binders. If we wanted to use functional-style symbol tables, the Table interface might look like this:

public class Table {
 public Table();
 public Table put(Symbol key, Object value);
 public Object get(Symbol key);
 public java.util.Enumeration keys();
}

The put function would return a new table without modifying the old one. We wouldn't need beginScope and endScope, because we could keep an old version of the table even as we use the new version.