Index Implementations with Symbol Tables - Algorithms

For many apps we want a search structure simply to help us find items, without moving them around. For example, we might have an array of items with keys, and we might want the search method to give us the index into that array of the item matching a certain key. Or we might want to remove the item with a given index from the search structure but still keep it in the array for some other use. An important feature of this approach is that it allows the client to maintain extra arrays (extra information associated with each node) to be added without the symbol-table code being changed at all. When the search routine returns the index for an item, it gives a way to access immediately all the information associated with that item, by using the index to access an appropriate array.

String-index ADT

This ADT describes an essential operation for strings: build an index (preprocess a text string) such that, given a query string, we can quickly search the text for the query. We normally specify that search should return -1 if the query string is not found in the text; otherwise it should return the text index where it found a match.

class TI // ADT interface { // implementations and private members hidden TI(String) int search(String) }

In , we considered the advantages of processing index items in priority queues, referring to data in a client array indirectly. For symbol tables, the same concept leads to the familiar index: a search structure external to a set of items that provides quick access to items with a given key. In , we shall consider the case where the items and perhaps even the index are in external storage; in this section, we briefly consider the case when both the items and the index fit in memory.

We can adapt any symbol-table implementation to build indices in precisely the same manner as we provided indirection for heaps in : pass an array of items to the constructor and redefine less and equals to refer to items through their indices. This solution is straightforward and useful, but leave it for exercises (see Exercises 12.50 through 12.52).

We consider instead an important special case: implementing an ADT to provide keyword searching in a string of text, as specified in Program 12.11. Given a text string, we want to support queries of the form: "Is this query string found in the text string?" We will be considering several versions and implementations of this kind of ADT in Part 6; for this section, we assume that the text is huge and that we are willing to spend some time preprocessing it (in the constructor) so as to be able to quickly respond to queries (see Screenshot).

Screenshot Text string index

In this example of a string index, we define a string key to begin with each word in a text; then, we build a table with the indices appearing in order of the keys that they reference. The first entry, string index 27, refers to the string key that begins ago ...; the second entry refers to call ...; the third entry to how ...; and so forth. The we can use binary search on this table. For example, to find out whether the phrase never mind appears in this text, we compare with long... in the middle (string index 46 at position 4 in the middle of the index array), then we find a match in the middle of the right half (string index 31 at position 7 in the middle of the right half of the index array). We use word boundaries for clarity; our implementations typically define a new key starting at each character position. The keys are arbitrarily long in principle, but only a few leading characters are generally examined, in practice.

Java graphics 12fig05

String-index client

This client uses an index to search in a text string for a sequence of strings from standard input. The main program reads the text from a specified file, puts it in a (potentially huge) String and builds an index from the substrings defined by starting at each character in the text string. Then, it reads query strings from standard input and prints a position where the query is found in the text (or prints not found). A naive index implementation would take time proportional to the length of the text, but we can use a symbol table to make the search fast, even for huge strings.

import java.io.*; class TextSearch { public static void main(String[] args) throws IOException { FileReader f = new FileReader(args[0]); BufferedReader b = new BufferedReader(f); String text = "", line = ""; while ((line = b.readLine()) != null) text += line + " "; TI ti = new TI(text); In.init(); String q; int i; while ((q = In.getString()) != null) if ((i = ti.search(q)) < 0) Out.println(q + " not found"); else Out.println(q + " "+i); } }

Program 12.12 is an example of a client. It reads a text string from an external file, builds an index, then reads a series of query strings from standard input and uses search to print the position in the text of some occurrence of each query (or an indication that the query does not occur in the text).

The first step in developing an implementation is to consider each position in the text string to define a string key starting at that position and going to the end of the string. In the abstract, these keys can be quite long, but we will take pains in our implementations to represent each key only with an index, and we will look at only enough of their characters to decide how to compare them. No two keys are equal (for example, they are all of different lengths), but if we consider two strings to be equal if one is a prefix of the other, we can use our standard symbol-table implementations to decide whether a given query string is in the text.

One approach to doing so is to develop appropriate implementations of myItem and myKey (see Exercise 12.1). The advantage of this approach is that it enables use of any of our symbol-table implementations and provides the flexibility to handle multiple text strings or to extend the basic ADT in a variety of ways. We leave that for an exercise because it leads to an excessive number of references when we have a huge text string (since we have one reference per key). We might ameliorate this cost by indexing the string only on word boundaries or some similar scheme.

Programs 12.13 and 12.14 illustrate a direct implementation that is based on binary search. We construct a symbol table that is made up of an ordered array of keys, except that we keep in that array not the key, but an index into the text string that points to the first character of the key. We start with the keys in the order in which they appear in the text string (with the ith entry in the index equal to i, the position in the text of the ith key) and rearrange them to be in sorted order (with the ith entry in the index equal to position in the text of the ith smallest key). As in , the key to the implementation is the definition of the less method. When we are to compare key i with key j, we actually want to compare the substring of the text beginning at i with the substring of the text beginning at j. If less is implemented to do so, we can think of the keys residing in the array, when actually it is their indices. Any of our sort implementations can be used (with int item type) to perform the sort.

If the text is in a String s, we might compare key i with key j with the Java code

s.substring(i).compareTo(s.substring(j))

but we avoid using these methods because they carry no performance guarantees. In particular, substring might take time or space proportional to the length of the substring, which would be disastrous for this app. We are assuming that charAt is a constant-time operation, which may not be true in some implementations. If that is the case, it is not difficult to modify Programs 12.13 and 12.14 to use character arrays rather than Strings (see ). An additional point to note is that compareTo is carefully implemented so as to provide proper ordering of Unicode strings, but that feature is irrelevant to this app, where we are only using the ordering to structure an efficient search for equal substrings.

String-index implementation (constructor)

This code uses binary search to build an index for a given text string, using an indirect sort. The constructor creates an array index, initializes it so that index[i] = i, and uses quicksort to rearrange the entries such that index[i] is the position in the text of the ith lexicographically smallest substring. We do not need to change the quicksort code at all to accomplish this result-we just define less so that less(i, j) compares the substring of the text starting at i with the substring of the text starting at j.

class TI { private String text; private int[] index; private int N; TI(String s) { text = s; N = text.length(); index = new int[N+1]; index[N] = -1; for (int i = 0; i < N; i++) index[i] = i; quicksort(index, 0, N-1); } private char s(int i) { return text.charAt(i); } private boolean less(int v, int w) { if (v == w) return false; for (int i = 0; ; i++) if (w+i >= N) return false; else if (v+i >= N) return true; else if (s(v+i) < s(w+i)) return true; else if (s(v+i) > s(w+i)) return false; } private void exch(int[] a, int i, int j) { int t = a[i]; a[i] = a[j]; a[j] = t; } private void quicksort(int[] a, int l, int r) // Program 7.1 with int for ITEM int search(String v) // Program 12.14 }

String-index implementation (search)

To search in the text using the index array constructed in Program 12.13, we again redefine the comparison methods to refer to the text through the index array. Since comparisons may be expensive, we avoid doing both less and equals at every step by using a modified version of Program 12.10 based on a three-way comparison (see text).

private int compare(String s, int v) { char[] key = s.toCharArray(); int t = index[v]; for (int i = 0; i < key.length; i++) if (t+i >= N) return 1; else if (key[i] > text(t+i)) return 1; else if (key[i] < text(t+i)) return -1; return 0; } private int searchR(int l, int r, String v) { int m = (l+r)/2; if (l > r) return N; switch (compare(v, m)) { case -1: return searchR(l, m-1, v); case 1: return searchR(m+1, r, v); } return m; // case 0 } int search(String v) { return index[searchR(0, N-1, v)]; }

Once the constructor builds the sorted key index array, then it is a simple matter to use it for a binary search, again by implementing an appropriate comparison method, as shown in Program 12.14. We could implement less and equals methods and use our binary search code from Program 12.10, but, as mentioned in , it is simpler and more efficient to use a three-way comparison method like the Java compareTo method for strings, since our binary search implementation always follows an equals text with a less test on the same keys. The compare method in Program 12.14 compares a given string with a given substring of the text and returns -1 if less, 0 if equal (the whole substring is found in the text), and 1 if greater.

Since the symbol table is implemented with binary search, then we expect from Property 2.3 that the search will involve about lg N comparisons. For example, once the index is built, we could find any phrase in a text consisting of about 1 million characters (such as Moby Dick) with about 20 string comparisons.

There are other issues for us to consider when we are building indices in practical apps, many of which we consider in the context of particular implementations (see, for example, Exercises 12.54 and 12.81). As we shall see in , there are also many ways that we can take particular advantage of the properties of string keys to speed up our algorithms. More sophisticated methods for string search and for providing indices with useful capabilities for string keys will be primary topics in Part 6.

Exercises

12.50 Give an ADT interface similar to Program 9.11 for building a symbol table using indices into a client array that supports the search and select operations.
Provide an implementation of your interface from Exercise 12.50 that uses class Sort (see Program 6.3)-so that you can use any sort implementation-and binary search.
Write a driver program that tests your interface from Exercise 12.50 and the performance of your implementation from Exercise 12.51.
Give the index array that is constructed by Program 12.13 for the text string abracadabra, and then the the index values that are returned by Program 12.14 for the query strings cad, abra and a.
Modify the ADT interface of Program 12.11 such that search returns an array with all the text indices that match the query, and modify Program 12.14 to implement your interface.
Give an example of a text string where the number of character comparisons for the index-construction part of Program 12.12 is quadratic in the length of the string.
Modify our string index implementation (Program 12.12) to use only the keys that start on word boundaries to build the index (see Screenshot). (For Moby Dick, this change cuts the size of the index by more than a factor of five.)